Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running builds get stuck after Jenkins master pod deletion and recreation #542

Closed
ataillefer opened this issue Apr 12, 2021 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@ataillefer
Copy link

ataillefer commented Apr 12, 2021

Describe the bug
When deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running.
I'm not sure if this is expected, nor how it could be solved.
Thanks for your help.

To Reproduce
Install the Jenkins operator with the Helm chart, keeping jenkins.enabled=true to install a Jenkins CR. Here's a minimal set of values:

jenkins:
  name: master
  image: jenkins/jenkins:lts
  basePlugins:
  - name: kubernetes
    version: "1.29.2"
  - ...

Configure the Kubernetes plugin and add a pod template using CasC.
Add a multibranch pipeline job running a Jekinsfile using this pod template (agent in the Jenkinsfile).
Run a build on the master branch of this job, let the build start and the Kubernetes plugin provision an agent based on the above mentioned pod template.
Verify that a new pod has been created for this agent in the current namespace.
Delete the Jenkins master pod, wait for the operator to recreate it and the Jenkins master to be up.
Check the build logs in the Jenkins UI, the build is stuck with:

10:48:33  Resuming build at Mon Apr 12 08:48:33 UTC 2021 after Jenkins restart
10:48:33  Waiting to resume part of nuxeo » 11.x » nuxeo » master #12: ‘nuxeo-platform-11-pppcs’ is offline
10:48:34  Still trying to load Looking for path named ‘/home/jenkins/agent/workspace/nuxeo_11.x_nuxeo_master’ on computer named ‘nuxeo-platform-11-pppcs’
10:48:35  Waiting to resume part of nuxeo » 11.x » nuxeo » master #12: ‘nuxeo-platform-11-pppcs’ is offline
10:48:36  Still trying to load Looking for path named ‘/home/jenkins/agent/workspace/nuxeo_11.x_nuxeo_master’ on computer named ‘nuxeo-platform-11-pppcs’

Finally, the build is ended with:

11:04:42  ERROR: Killed hudson.model.Queue$BuildableItem:ExecutorStepExecution.PlaceholderTask{runId=nuxeo/11.x/nuxeo/master#12,label=nuxeo-platform-11-pppcs,context=CpsStepContext[4:node]:Owner[nuxeo/11.x/nuxeo/master/12:nuxeo/11.x/nuxeo/master #12],cookie=586bb65f-dc33-4845-af05-40dda673d425,auth=null}:6 after waiting for 300,000 ms because we assume unknown Node nuxeo-platform-11-pppcs is never going to appear!
11:04:42  Finished: FAILURE

See the related logs below.

Why does it happen?

From what I understand, once the Jenkins master pod is recreated and is up and running, the agent (aka node, aka pod, here: nuxeo-platform-11-pppcs) on which the build was running doesn't exist anymore in Jenkins, thus the error in the JNLP container:

java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: nuxeo-platform-11-pppcs
	at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
	at hudson.remoting.Engine.innerRun(Engine.java:743)
	at hudson.remoting.Engine.run(Engine.java:518)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: nuxeo-platform-11-pppcs
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)

Fix attempt

I figured out that the backup image was only backing up the jobs but not the nodes, see the tar command line.
So, I've tried to override the backup image with a backup.sh script also backing up the nodes directory, along with the jobs one:

tar -C ${JENKINS_HOME} -czf "${BACKUP_TMP_DIR}/${backup_number}.tar.gz" --exclude jobs/*/workspace* --no-wildcards-match-slash --anchored --exclude jobs/*/config.xml -c jobs nodes && \

This is working in the sense that when the backup is restored, once Jenkins is up and running, the node appears in the node list of the Jenkins master.
Yet, since the restore happens after the Jenkins master is up and running (operator logs: 2021-04-12T08:47:47.236Z), when the JNLP container tries to connect to the Jenkins master (JNLP container logs: 2021-04-12 08:46:47.647+0000), the agent isn't there yet.

Additional information

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.0", GitCommit:"70132b0f130acc0bed193d9ba59dd186f0e634cf", GitTreeState:"clean", BuildDate:"2019-12-07T21:20:10Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-gke.2800", GitCommit:"4ef0543cc59f482d118ae33397108118ec80d447", GitTreeState:"clean", BuildDate:"2021-02-20T01:42:15Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}

Jenkins Operator version: v0.5.0, installed through Helm chart 0.4.3.

Logs:

  • Operator, after the Jenkins master pod has been deleted (no problem, a backup is done, the Jenkins master pod is recreated, the backup restored):
2021-04-12T08:44:56.206Z	INFO	controller-jenkins	backuprestore/backuprestore.go:36	Stopping backup trigger for 'platform-staging/master'	{"cr": "master"}
2021-04-12T08:44:56.311Z	INFO	controller-jenkins	backuprestore/backuprestore.go:205	Performing backup '151'	{"cr": "master"}
2021-04-12T08:45:28.409Z	INFO	controller-jenkins	base/pod.go:159	Creating a new Jenkins Master Pod platform-staging/jenkins-master	{"cr": "master"}
2021-04-12T08:47:17.688Z	INFO	controller-jenkins	jenkins/jenkins_controller.go:264	Base configuration phase is complete, took 1m49s	{"cr": "master"}
...
2021-04-12T08:47:47.236Z	INFO	controller-jenkins	backuprestore/backuprestore.go:173	Restoring backup '151'	{"cr": "master"}
2021-04-12T08:47:51.934Z	INFO	controller-jenkins	backuprestore/backuprestore.go:205	Performing backup '152'	{"cr": "master"}
2021-04-12T08:48:00.433Z	INFO	controller-jenkins	backuprestore/backuprestore.go:281	Starting backup trigger	{"cr": "master"}
  • Jenkins master, after the Jenkins master pod has been recreated:
2021-04-12 08:46:36.019+0000 [id=20]	INFO	hudson.WebAppMain$3#run: Jenkins is fully up and running
2021-04-12 08:46:42.498+0000 [id=52]	INFO	h.m.DownloadService$Downloadable#load: Obtained the updated data file for hudson.tasks.Maven.MavenInstaller
2021-04-12 08:46:43.096+0000 [id=52]	INFO	h.m.DownloadService$Downloadable#load: Obtained the updated data file for hudson.tools.JDKInstaller
2021-04-12 08:46:43.096+0000 [id=52]	INFO	hudson.util.Retrier#start: Performed the action check updates server successfully at the attempt #1
2021-04-12 08:46:43.099+0000 [id=52]	INFO	hudson.model.AsyncPeriodicWork#lambda$doRun$0: Finished Download metadata. 8,601 ms
2021-04-12 08:46:47.647+0000 [id=73]	INFO	h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #1 from /10.60.60.3:49734
2021-04-12 08:46:49.795+0000 [id=74]	INFO	o.j.r.p.i.ConnectionHeadersFilterLayer#onRecv: [JNLP4-connect connection from 10.60.60.3/10.60.60.3:49734] Refusing headers from remote: Unknown client name: nuxeo-platform-11-pppcs

The last line is were the problem starts.

  • Logs of the JNLP container in the build pod, when the Jenkins master pod is deleted, the recreated:
Apr 12, 2021 8:44:56 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Apr 12, 2021 8:45:07 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:45:18 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:45:29 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:45:40 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:45:51 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:46:02 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:46:13 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:46:24 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:46:35 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFO: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused (Connection refused)
Apr 12, 2021 8:46:45 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Performing onReconnect operation.
Apr 12, 2021 8:46:45 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect
INFO: Restarting agent via jenkins.slaves.restarter.UnixSlaveRestarter@29eb5ffb
Picked up JAVA_TOOL_OPTIONS: -Dsun.zip.disableMemoryMapping=true -XX:+UseParallelGC -XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: nuxeo-platform-11-pppcs
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Apr 12, 2021 8:46:47 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.3
Apr 12, 2021 8:46:47 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
Apr 12, 2021 8:46:47 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins-operator-http-master.platform-staging.svc.cluster.local:8080]
Apr 12, 2021 8:46:47 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Apr 12, 2021 8:46:47 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
  Agent address: jenkins-operator-slave-master.platform-staging.svc.cluster.local
  Agent port:    50000
  Identity:      1b:84:48:f5:c5:e7:4f:74:9b:da:71:cb:49:a4:ba:fd
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-operator-slave-master.platform-staging.svc.cluster.local:50000
Apr 12, 2021 8:46:47 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP4-connect
Apr 12, 2021 8:46:49 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Remote identity confirmed: 1b:84:48:f5:c5:e7:4f:74:9b:da:71:cb:49:a4:ba:fd
Apr 12, 2021 8:46:49 AM org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer onRecv
INFO: [JNLP4-connect connection to jenkins-operator-slave-master.platform-staging.svc.cluster.local/10.63.243.123:50000] Local headers refused by remote: Unknown client name: nuxeo-platform-11-pppcs
Apr 12, 2021 8:46:49 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Protocol JNLP4-connect encountered an unexpected exception
java.util.concurrent.ExecutionException: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: nuxeo-platform-11-pppcs
	at org.jenkinsci.remoting.util.SettableFuture.get(SettableFuture.java:223)
	at hudson.remoting.Engine.innerRun(Engine.java:743)
	at hudson.remoting.Engine.run(Engine.java:518)
Caused by: org.jenkinsci.remoting.protocol.impl.ConnectionRefusalException: Unknown client name: nuxeo-platform-11-pppcs
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.newAbortCause(ConnectionHeadersFilterLayer.java:378)
	at org.jenkinsci.remoting.protocol.impl.ConnectionHeadersFilterLayer.onRecvClosed(ConnectionHeadersFilterLayer.java:433)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:172)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:816)
	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154)
	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$1500(BIONetworkLayer.java:48)
	at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetworkLayer.java:247)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:117)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.nio.channels.ClosedChannelException
		... 7 more

Apr 12, 2021 8:46:49 AM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: None of the protocols were accepted
java.lang.Exception: The server rejected the connection: None of the protocols were accepted
	at hudson.remoting.Engine.onConnectionRejected(Engine.java:828)
	at hudson.remoting.Engine.innerRun(Engine.java:768)
	at hudson.remoting.Engine.run(Engine.java:518)

After which, the JNLP container shuts down and the pod is in error.

@ataillefer ataillefer added the bug Something isn't working label Apr 12, 2021
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 17, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
@SylwiaBrant
Copy link

Hello. Thank you for bringing the issue to our attention. This is a complex task that would require much thought and work to overcome to align it with our core architectural concepts (e.g. Jenkins immutability). We are currently concentrating our efforts on introducing a new schema that will allow to resolve most pressing and long-known issues in our community. While it definitely needs to be addressed this is the first time anyone has brought it up. We will have to postpone handling it as it would mean even longer waiting period for the matters that have already been brought up multiple times.

ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 25, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
@ataillefer
Copy link
Author

Hi @SylwiaBrant, thanks for your answer. I understand perfectly well that there are other priorities.
In the meantime, as this issue is a blocker for us, I've switched to the Jenkins Helm chart which is working great so far.

Congrats for the work on the operator, besides this issue it has been working very well and the documentation is clear!

ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 26, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 26, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 27, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 27, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 27, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 28, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
ataillefer added a commit to nuxeo/platform-ci that referenced this issue Apr 28, 2021
- The operator currently has a big issue due to its backup mechanism: when deleting the Jenkins master pod while some builds are running, these builds get stuck after the new Jenkins master pod gets up and running. See jenkinsci/kubernetes-operator#542.
- The chart seems more "standard", relying on a StatefulSet.
- It's more complete: includes an Ingress template for instance.
- The plugin descriptor is standard, allowing for instance to not specify the version and always fetch the latest.
- It still provides CasC hot reload through a sidecar container.
@NesManrique
Copy link

I'm also getting this issue at the moment. Is there any workaround that we can do to continue using the operator? Or is the recommendation to just use the Jenkins Helm Chart?

@Sig00rd
Copy link

Sig00rd commented Jul 27, 2021

Hello @NesManrique, thanks for the update. We acknowledge the severity of this issue, but currently have no capacity to work on this it as it would require significant architectural changes to fix it properly and such changes are beyond the scope of the next release, with new API schema taking priority for the time being.

@maslakov
Copy link

maslakov commented Feb 7, 2022

Hi @Sig00rd !
We are also facing this problem and this is a kind of critical problem for us while we have spent already too much time adopting an operator and would like to keep using it.
Why do you just close the bug without any proposal or references to any possible future solutions, even if it might come in a while?
This would be more fair to keep it open, so everyone knows, there is open issue. You could close it once you have implemented anything, which changes this behaviour: new API or whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants