Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

jenkins.http’s server IP address could not be found. #4598

Open
ppitonak opened this issue Nov 29, 2018 · 69 comments
Open

jenkins.http’s server IP address could not be found. #4598

ppitonak opened this issue Nov 29, 2018 · 69 comments

Comments

@ppitonak
Copy link
Collaborator

Issue Overview

When user navigates to http://jenkins.openshift.io, he sees error page instead of Jenkins UI

Expected Behaviour

Jenkins UI is displayed

Current Behaviour

Error message
05-04-jenkins-direct-log

Steps To Reproduce
  1. create a space, create a new app, wait for pipeline to start (might not be necessary)
  2. open new tab and open http://jenkins.openshift.io
Additional Information

We saw it twice on us-east-1a-beta

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1634/ Nov 28, 2018 4:35:00 PM UTC
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1635/ Nov 28, 2018 6:35:00 PM UTC

We saw similar bug for api.openshift.io:
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2-released/14962/01-01-afterEach.png Nov 8, 2018 4:28:00 AM UTC
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2a-released/14953/01-01-afterEach.png Nov 8, 2018 4:32:00 AM UTC

@piyush-garg
Copy link
Collaborator

@ppitonak I am not able to reproduce this issue from my account.

@ppitonak
Copy link
Collaborator Author

Happened again yesterday, again on us-east-1a with the same account.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1647/ Nov 29, 2018 6:35:00 PM

Out of 22 runs, 3 failed with this error. The job runs every 2 hours.

@ppitonak
Copy link
Collaborator Author

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1656/ Nov 30, 2018 12:35:00 PM

... again the same cluster

@ljelinkova
Copy link
Collaborator

I've seen this error on all clusters 1-2 times during weekend, for example:

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1159/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1152/console

There is one thing commong in oc logs

oc get all
NAME                   READY     STATUS             RESTARTS   AGE
pod/jenkins-1-deploy   0/1       DeadlineExceeded   0          3h

Usually, there are no events in the log, maybe because the jobs run only every 4 hours. However, in one job, I've seen quite a lot of log events like this:

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/oc-jenkins-logs.txt

 Error creating: pods "jenkins-1-j5j6z" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

@piyush-garg
Copy link
Collaborator

piyush-garg commented Dec 3, 2018

This seems like idle is trying to unidle, but pod did not come up because of resource quota and old pod stuck in DeadlineExceeded state

@ppitonak
Copy link
Collaborator Author

ppitonak commented Dec 3, 2018

what is the next step?

@ljelinkova
Copy link
Collaborator

This issue still occurs and affects the E2E tests so I'd also like to see some suggestions what we could do about it.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1169/console

@ljelinkova
Copy link
Collaborator

@ljelinkova
Copy link
Collaborator

This log looks also interesting
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1724/oc-jenkins-logs.txt

Extract:

oc get all
NAME                  READY     STATUS        RESTARTS   AGE
pod/jenkins-1-xz7b7   0/1       Terminating   0          1m
.......
kubelet, ip-172-21-55-225.ec2.internal   Killing container with id docker://jenkins:Need to kill Pod
1m          1m           1         jenkins-1-xz7b7.156da5ae712564e6    Pod                                                   Normal    Scheduled                     default-scheduler                        Successfully assigned osio-ci-e2e-002-jenkins/jenkins-1-xz7b7 to ip-172-21-50-152.ec2.internal
1m          1m           1         jenkins-1.156da5ae6ef05c05          ReplicationController                                 Normal    SuccessfulCreate              replication-controller                   Created pod: jenkins-1-xz7b7
1m          1m           1         jenkins-1.156da5ae77087b9d          ReplicationController                                 Normal    SuccessfulDelete              replication-controller                   Deleted pod: jenkins-1-xz7b7
1m          22m          2         jenkins.156da48c68a94215            DeploymentConfig                                      Normal    ReplicationControllerScaled   deploymentconfig-controller              Scaled replication controller "jenkins-1" from 1 to 0
1m          1m           1         jenkins-1.156da5aea63ff670          ReplicationController                                 Warning   FailedCreate                  replication-controller                   Error creating: pods "jenkins-1-r9xt7" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

@piyush-garg
Copy link
Collaborator

@ljelinkova This is reaching resource quota for sure. But it's not showing any other pod, so don't exactly know how it's reaching resource quota. This is something we need to debug.

@ljelinkova
Copy link
Collaborator

@piyush-garg The number of failed e2e tests is increasing so give it a high priority please.

@ppitonak
Copy link
Collaborator Author

ppitonak commented Dec 6, 2018

Happened twice in last 40 minutes:
https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4272/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1181/console

Jenkins seems to unidle properly and start the job:
osio_ip_1
But failed to promote the builld:
osio_ip_2
And accessing the Jenkins UI directly resulted in what is this issue about:
osio_ip_3

Jenkins pod is still running and there are no unusual OpenShift events.

pod/jenkins-1-sv9bp                       1/1       Running     0          21m

This part of Jenkins pod log could be useful:

INFO: Terminating Kubernetes instance for agent jenkins-slave-2b956-qb9ft
Dec 06, 2018 3:18:50 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
WARNING: Computer.threadPoolForRemoting [#32] for jenkins-slave-2b956-qb9ft terminated
java.nio.channels.ClosedChannelException
	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800)
	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
	at hudson.remoting.Channel.close(Channel.java:1450)
	at hudson.remoting.Channel.close(Channel.java:1403)
	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:821)
	at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:105)
	at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:737)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@ppitonak ppitonak added priority/P1 Critical and removed priority/P3 Medium labels Dec 6, 2018
@ppitonak
Copy link
Collaborator Author

ppitonak commented Dec 6, 2018

Another failure, the same story. Three successful runs before this one so probably not caused by something broken in previous run.
https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4274/

@ppitonak
Copy link
Collaborator Author

ppitonak commented Dec 7, 2018

I have more information about what is going on.

  • Dec 6, 2018 14:40:00 build 4273 is started and succeeds, account is reset, nothing unusual in logs
  • Dec 6, 2018 15:40:00 build 4274 is started
  • Dec 6, 2018 15:43:56 new space and project is created and pipeline starts
  • Dec 6, 2018 15:45:23 e2e test opens pipelines page in OSIO
  • Dec 6, 2018 15:53:06 Jenkins pod log shows the stacktrace like in my previous comment - WARNING: Computer.threadPoolForRemoting [#34] for jenkins-slave-fzpvb-tvc99 terminated
  • Dec 6, 2018 15:53:23.458 pipeline stops on promote stage, e2e test clicks on "Input required" button
  • Dec 6, 2018 15:53:23.616 first of many similar errors in browser console SEVERE https://jenkins.api.prod-preview.openshift.io/api/jenkins/start - Failed to load resource: the server responded with a status of 500 (Internal Server Error)
  • Dec 6, 2018 16:03:23 e2e test timeout - "Promote" button is not clickable
  • Dec 6, 2018 16:03:24 e2e test navigates to Jenkins log directly by URL - results in what is reported in this issue (Chrome error ERR_NAME_NOT_RESOLVED)
  • Dec 6, 2018 16:03:28 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod looks OK pod/jenkins-1-q9gfr 1/1 Running 0 1h
  • Dec 6, 2018 16:04:01 e2e test successfully resets the account

Then #3802 goes on stage.

  • Dec 6, 2018 16:40:00 build 4275 is started, fails with unrelated issue
  • Dec 6, 2018 17:40:00 build 4276 is started
  • Dec 6, 2018 18:05:00 "View log" link on pipeline page is still not available after 21 minutes
  • Dec 6, 2018 18:05:00 e2e test navigates to Jenkins log directly by URL - results in what is reported in this issue (Chrome error ERR_NAME_NOT_RESOLVED)
  • Dec 6, 2018 18:05:11 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod has been live for 3 hours pod/jenkins-1-q9gfr 1/1 Running 0 3h
  • Dec 6, 2018 18:05:36 e2e test successfully resets the account
  • fast-forward in time
  • Dec 7, 2018 9:40:00 build 4292 is started
  • the same scenario as in build 4276
  • Dec 7, 2018 9:54:15 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod has been live for 19 hours pod/jenkins-1-q9gfr 1/1 Running 0 19h

result:

  • the account is completely unusable
  • user needs to go to OpenShift console and manually delete all pipelines
  • (maybe not necessary) reset account

@ppitonak
Copy link
Collaborator Author

ppitonak commented Dec 7, 2018

When I click the " See additional details in jenkins", I see the following error in browser instead of Jenkins UI.

osio_promote2

{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}

@hrishin
Copy link

hrishin commented Dec 7, 2018

@ppitonak from where we are seeing the following log?

{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}

@hrishin
Copy link

hrishin commented Dec 7, 2018

@ppitonak yeah, can see now with prod-preview Jenkins.

@bmicklea
Copy link
Collaborator

SEV1 label downgraded during the IC today due to the fact that users shouldn't be doing reset environment ever, let alone frequently. We can't remove it from the E2E tests though because there's nothing that can replace it as a method to provide a clean environment for testing.

Separately the docs team needs to ensure that the onboarding doc sent to users does not reference the reset environment command: #4656

@ldimaggi
Copy link
Collaborator

Automated tests have to be able to establish a known (and "clean") starting point to ensure that the test results are not impacted by the environment. The environment reset function is problematic for human users as it removes ALL their assets and is problematic for automated tests as it does not leave the system in a 100% usable state. An enhanced space clean/delete function would be a better approach - see: #4657

@piyush-garg
Copy link
Collaborator

As mentioned in previous comments and also explained, this is reset environment and platform team is taking care/looking at it. I will add the platform team label and remove build team label. If build team needs to do something regarding this, please assign back. Thanks

@MatousJobanek
Copy link

Just a small update - the "fix" (from the tenant point of view) is done fabric8-services/fabric8-tenant#714 - now I'm just waiting till the quay database is fixed, so I can merge it and deploy it to prod-preview.

@alexeykazakov
Copy link
Member

The tenant master build is green again and the fix has been deployed to prod-preview.

@alexeykazakov
Copy link
Member

Should be fixed in fabric8-services/fabric8-tenant#714 (in prod-preview now).

@ppitonak
Copy link
Collaborator Author

ppitonak commented Jan 3, 2019

We cannot judge if the fix helped because of #4670

@ljelinkova
Copy link
Collaborator

#4670 is resolved now.

I can see some failed jobs on prod-preview caused by Jenkins, for example.

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4957/oc-jenkins-logs.txt

However, this could be also caused by #4668 . Could anybody have a look at #4668 so that it can be resolved and we can focus on this issue?

@ljelinkova
Copy link
Collaborator

The #4670 is resolved and it seems that the fix on prod-prev helped to lower the number of occurrences. However, it still happens.

These are the last logs

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/5247/oc-jenkins-logs.txt

I can see in the logs that readiness probe failed

Readiness probe failed: Get http://10.130.24.191:8080/login: dial tcp 10.130.24.191:8080: connect: connection refused

And then the quota limit

Error creating: pods "jenkins-1-pbmt7" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

@ppitonak
Copy link
Collaborator Author

I would just add that there is another type of suspicious OpenShift events:

9m          9m           1         jenkins-slave-6x3x2-1g0z0.157a94c2abcd6233             Pod                                                              Warning   Evicted                       kubelet, ip-172-21-52-86.ec2.internal   The node was low on resource: ephemeral-storage. Container jnlp was using 684Ki, which exceeds its request of 0. Container maven was using 636Ki, which exceeds its request of 0. 

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1431/oc-jenkins-logs.txt

@MatousJobanek @alexeykazakov who is working on this? Nobody is assigned to this issue.

@MatousJobanek
Copy link

MatousJobanek commented Jan 17, 2019

Well, we (platform team) were asked to try to resolve this issue by fixing the missing PVC one. That issue has been fixed (or at least it's not been observed for a long time).

These "new" failures and suspicious events reported by @ppitonak as well as by @ljelinkova are of a different kind (if I'm not mistaken). Anyone from the Build team could say what the cause is @piyush-garg @chmouel. If there is anything we could help with, please just let me know.

If there is anything I'm missing or I'm wrong about with my assumption, then I just probably need additional clarification

@chmouel
Copy link

chmouel commented Jan 17, 2019

Those are different issues, I am not sure why they are dumped into this issue, the one about eviction is because the node (server) was at over capacity, we can't do much about this

@ppitonak
Copy link
Collaborator Author

@chmouel can you do something about failing readiness probe?

@piyush-garg
Copy link
Collaborator

Hey @ppitonak

We can change the value of readiness probe so that it should not fail https://github.com/fabric8-services/fabric8-tenant/blob/master/environment/templates/fabric8-tenant-jenkins.yml#L793

It may be affected by the Jenkins storage fix we have added, that may have increased the Jenkins startup time. But that like just like an event/warning not affecting anything in my knowledge, yes, sure we can fix that.

cc @chmouel

Thanks

@chmouel
Copy link

chmouel commented Jan 18, 2019

I think it's better to fail fast than trying to just wait more, the increase startup time with storage should not be greater than a second,

in the case of that other bug, this was because there was a node eviction going on (out of resources) and just waiting more is not fixing it,

we could potentially handle this in the pipeline to detect errors and retry the pipeline but the detection is going to be fragile and may not help much,

can I respectively ask (again) to log a new issue instead of dumping everything on this one or just rename this issue to something like "OpenShift uber tracking instability issue" with a Cc to the whole openshift group in the assignees (this is a sarcastic joke)

@ppitonak
Copy link
Collaborator Author

@chmouel I am happy to report a new issue when there is a new issue but from my end point of view, it's still the same symptom which could have one or more root causes.

The two links in issue description contain these clues:

  • jenkins-home PVC problem
  • readiness probe failed
  • exceeded quota

Later we identified the problem with evicted pods and I agree with you that it's unrelated/not in our hands.

What should I report in a new issue?

@chmouel
Copy link

chmouel commented Jan 21, 2019

This seems to be an error where the pod didn't get out of terminated and get stuck :

oc get pods --field-selector=status.phase=Running -o name | grep -v 'slave|deploy' | grep -m1 jenkins
oc logs pod/jenkins-1-hkf75
Error from server (BadRequest): container "jenkins" in pod "jenkins-1-hkf75" is terminated

@ppitonak
Copy link
Collaborator Author

What is the progress of this issue? It's still appearing, e.g. http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1509/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

10 participants