Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod connection issue reproducer #1564

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amuniz
Copy link
Member

@amuniz amuniz commented May 29, 2024

For some reason I don't understand the pod jnlp container thinks it's connected but the controller thinks it's not and keeps waiting for it.

To run the test I'm using:

$ kind create cluster
$ MOUNT_M2=true ./test-in-k8s.sh -Dtest=KubernetesPipelineRJRTest#restartDuringPodLaunch

jnlp container logs:

May 29, 2024 9:41:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
May 29, 2024 9:41:12 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
May 29, 2024 9:41:12 AM hudson.remoting.Launcher createEngine
INFO: Setting up agent: test0-1-0jmg0-hp6gs-0hjx9
May 29, 2024 9:41:13 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3248.v65ecb_254c298
May 29, 2024 9:41:13 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
May 29, 2024 9:41:21 AM hudson.remoting.Launcher$CuiListener status
INFO: Locating server among [http://jenkins.kubernetes-plugin-test.svc.cluster.local:24289/jenkins/]
May 29, 2024 9:41:23 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
May 29, 2024 9:41:23 AM hudson.remoting.Launcher$CuiListener status
INFO: Agent discovery successful
  Agent address: jenkins.kubernetes-plugin-test.svc.cluster.local
  Agent port:    24290
  Identity:      2f:8b:3e:9a:28:e7:d0:c5:24:25:18:72:60:8f:aa:b5
May 29, 2024 9:41:23 AM hudson.remoting.Launcher$CuiListener status
INFO: Handshaking
May 29, 2024 9:41:23 AM hudson.remoting.Launcher$CuiListener status
INFO: Connecting to jenkins.kubernetes-plugin-test.svc.cluster.local:24290
May 29, 2024 9:41:24 AM hudson.remoting.Launcher$CuiListener status
INFO: Server reports protocol JNLP4-connect-proxy not supported, skipping
May 29, 2024 9:41:24 AM hudson.remoting.Launcher$CuiListener status
INFO: Trying protocol: JNLP4-connect
May 29, 2024 9:41:26 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
INFO: Waiting for ProtocolStack to start.
May 29, 2024 9:41:31 AM hudson.remoting.Launcher$CuiListener status
INFO: Remote identity confirmed: 2f:8b:3e:9a:28:e7:d0:c5:24:25:18:72:60:8f:aa:b5
May 29, 2024 9:41:32 AM hudson.remoting.Launcher$CuiListener status
INFO: Connected

Controller logs:

2024-05-29 09:41:00.737+0000 [id=128]	INFO	o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes kubernetes-plugin-test/test0-1-0jmg0-hp6gs-0hjx9
2024-05-29 09:41:23.418+0000 [id=166]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Connection #2 from /10.244.0.14:36382 failed: null
2024-05-29 09:41:26.810+0000 [id=172]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #3 from /10.244.0.14:36398
2024-05-29 09:41:31.115+0000 [id=128]	INFO	o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (30/1,000): test0-1-0jmg0-hp6gs-0hjx9
2024-05-29 09:42:01.438+0000 [id=128]	INFO	o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (60/1,000): test0-1-0jmg0-hp6gs-0hjx9

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

For some reason I don't understand the pod jnlp container thinks it's
connected but the controller thinks it's not and keeps waiting for it.

`jnlp` container logs:

```
May 29, 2024 9:41:09 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
May 29, 2024 9:41:12 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
May 29, 2024 9:41:12 AM hudson.remoting.Launcher createEngine
INFO: Setting up agent: test0-1-0jmg0-hp6gs-0hjx9
May 29, 2024 9:41:13 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 3248.v65ecb_254c298
May 29, 2024 9:41:13 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
May 29, 2024 9:41:21 AM hudson.remoting.Launcher$CuiListener status
INFO: Locating server among [http://jenkins.kubernetes-plugin-test.svc.cluster.local:24289/jenkins/]
May 29, 2024 9:41:23 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
May 29, 2024 9:41:23 AM hudson.remoting.Launcher$CuiListener status
INFO: Agent discovery successful
  Agent address: jenkins.kubernetes-plugin-test.svc.cluster.local
  Agent port:    24290
  Identity:      2f:8b:3e:9a:28:e7:d0:c5:24:25:18:72:60:8f:aa:b5
May 29, 2024 9:41:23 AM hudson.remoting.Launcher$CuiListener status
INFO: Handshaking
May 29, 2024 9:41:23 AM hudson.remoting.Launcher$CuiListener status
INFO: Connecting to jenkins.kubernetes-plugin-test.svc.cluster.local:24290
May 29, 2024 9:41:24 AM hudson.remoting.Launcher$CuiListener status
INFO: Server reports protocol JNLP4-connect-proxy not supported, skipping
May 29, 2024 9:41:24 AM hudson.remoting.Launcher$CuiListener status
INFO: Trying protocol: JNLP4-connect
May 29, 2024 9:41:26 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
INFO: Waiting for ProtocolStack to start.
May 29, 2024 9:41:31 AM hudson.remoting.Launcher$CuiListener status
INFO: Remote identity confirmed: 2f:8b:3e:9a:28:e7:d0:c5:24:25:18:72:60:8f:aa:b5
May 29, 2024 9:41:32 AM hudson.remoting.Launcher$CuiListener status
INFO: Connected
```

Controller logs:

```
2024-05-29 09:41:00.737+0000 [id=128]	INFO	o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes kubernetes-plugin-test/test0-1-0jmg0-hp6gs-0hjx9
2024-05-29 09:41:23.418+0000 [id=166]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Connection jenkinsci#2 from /10.244.0.14:36382 failed: null
2024-05-29 09:41:26.810+0000 [id=172]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection jenkinsci#3 from /10.244.0.14:36398
2024-05-29 09:41:31.115+0000 [id=128]	INFO	o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (30/1,000): test0-1-0jmg0-hp6gs-0hjx9
2024-05-29 09:42:01.438+0000 [id=128]	INFO	o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (60/1,000): test0-1-0jmg0-hp6gs-0hjx9
```
Comment on lines +5 to +13
containers:
- name: jnlp
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 100m
memory: 256Mi
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove this, then the test passes O_o

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CPU limit would be throttling the pod and changing timing conditions. Maybe that is why.

Copy link
Member

@Vlatombe Vlatombe May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CPU throttling is just too much, and the agent never ends up as "online" on the controller side within the expected timeout despite establishing the remoting connection.

This may be an opportunity to profile the agent initialization (remote classloading, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least Waiting for agent to connect could be reworded to emphasize that the controller might be waiting for the agent to be fully initialized and ready to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants