jenkins.http’s server IP address could not be found. #4598

ppitonak · 2018-11-29T09:46:39Z

Issue Overview

When user navigates to http://jenkins.openshift.io, he sees error page instead of Jenkins UI

Expected Behaviour

Jenkins UI is displayed

Current Behaviour

Error message

Steps To Reproduce

create a space, create a new app, wait for pipeline to start (might not be necessary)
open new tab and open http://jenkins.openshift.io

Additional Information

We saw it twice on us-east-1a-beta

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1634/ Nov 28, 2018 4:35:00 PM UTC
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1635/ Nov 28, 2018 6:35:00 PM UTC

We saw similar bug for api.openshift.io:
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2-released/14962/01-01-afterEach.png Nov 8, 2018 4:28:00 AM UTC
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2a-released/14953/01-01-afterEach.png Nov 8, 2018 4:32:00 AM UTC

The text was updated successfully, but these errors were encountered:

piyush-garg · 2018-11-30T10:23:41Z

@ppitonak I am not able to reproduce this issue from my account.

ppitonak · 2018-11-30T12:00:23Z

Happened again yesterday, again on us-east-1a with the same account.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1647/ Nov 29, 2018 6:35:00 PM

Out of 22 runs, 3 failed with this error. The job runs every 2 hours.

ppitonak · 2018-11-30T13:26:41Z

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1656/ Nov 30, 2018 12:35:00 PM

... again the same cluster

ljelinkova · 2018-12-03T11:04:56Z

I've seen this error on all clusters 1-2 times during weekend, for example:

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1b-released/1159/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1152/console

There is one thing commong in oc logs

oc get all
NAME                   READY     STATUS             RESTARTS   AGE
pod/jenkins-1-deploy   0/1       DeadlineExceeded   0          3h

Usually, there are no events in the log, maybe because the jobs run only every 4 hours. However, in one job, I've seen quite a lot of log events like this:

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/oc-jenkins-logs.txt

 Error creating: pods "jenkins-1-j5j6z" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

piyush-garg · 2018-12-03T11:27:58Z

This seems like idle is trying to unidle, but pod did not come up because of resource quota and old pod stuck in DeadlineExceeded state

ppitonak · 2018-12-03T15:29:28Z

what is the next step?

ljelinkova · 2018-12-05T08:01:41Z

This issue still occurs and affects the E2E tests so I'd also like to see some suggestions what we could do about it.

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1169/console

ljelinkova · 2018-12-06T08:52:11Z

@chmouel @stevengutz @ldimaggi

ljelinkova · 2018-12-06T09:22:00Z

This log looks also interesting
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1724/oc-jenkins-logs.txt

Extract:

oc get all
NAME                  READY     STATUS        RESTARTS   AGE
pod/jenkins-1-xz7b7   0/1       Terminating   0          1m
.......
kubelet, ip-172-21-55-225.ec2.internal   Killing container with id docker://jenkins:Need to kill Pod
1m          1m           1         jenkins-1-xz7b7.156da5ae712564e6    Pod                                                   Normal    Scheduled                     default-scheduler                        Successfully assigned osio-ci-e2e-002-jenkins/jenkins-1-xz7b7 to ip-172-21-50-152.ec2.internal
1m          1m           1         jenkins-1.156da5ae6ef05c05          ReplicationController                                 Normal    SuccessfulCreate              replication-controller                   Created pod: jenkins-1-xz7b7
1m          1m           1         jenkins-1.156da5ae77087b9d          ReplicationController                                 Normal    SuccessfulDelete              replication-controller                   Deleted pod: jenkins-1-xz7b7
1m          22m          2         jenkins.156da48c68a94215            DeploymentConfig                                      Normal    ReplicationControllerScaled   deploymentconfig-controller              Scaled replication controller "jenkins-1" from 1 to 0
1m          1m           1         jenkins-1.156da5aea63ff670          ReplicationController                                 Warning   FailedCreate                  replication-controller                   Error creating: pods "jenkins-1-r9xt7" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

ljelinkova · 2018-12-06T09:22:45Z

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1723/oc-jenkins-logs.txt

piyush-garg · 2018-12-06T09:28:25Z

@ljelinkova This is reaching resource quota for sure. But it's not showing any other pod, so don't exactly know how it's reaching resource quota. This is something we need to debug.

ljelinkova · 2018-12-06T09:36:33Z

@piyush-garg The number of failed e2e tests is increasing so give it a high priority please.

ppitonak · 2018-12-06T15:50:50Z

Happened twice in last 40 minutes:
https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/4272/console
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2a-released/1181/console

Jenkins seems to unidle properly and start the job:

But failed to promote the builld:

And accessing the Jenkins UI directly resulted in what is this issue about:

Jenkins pod is still running and there are no unusual OpenShift events.

pod/jenkins-1-sv9bp                       1/1       Running     0          21m

This part of Jenkins pod log could be useful:

INFO: Terminating Kubernetes instance for agent jenkins-slave-2b956-qb9ft
Dec 06, 2018 3:18:50 PM jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed
WARNING: Computer.threadPoolForRemoting [#32] for jenkins-slave-2b956-qb9ft terminated
java.nio.channels.ClosedChannelException
	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:209)
	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832)
	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200)
	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213)
	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800)
	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173)
	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:314)
	at hudson.remoting.Channel.close(Channel.java:1450)
	at hudson.remoting.Channel.close(Channel.java:1403)
	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:821)
	at hudson.slaves.SlaveComputer.access$800(SlaveComputer.java:105)
	at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:737)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

ppitonak · 2018-12-06T16:11:31Z

Another failure, the same story. Three successful runs before this one so probably not caused by something broken in previous run.
https://ci.centos.org/job/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4274/

ppitonak · 2018-12-07T10:20:31Z

I have more information about what is going on.

Dec 6, 2018 14:40:00 build 4273 is started and succeeds, account is reset, nothing unusual in logs
Dec 6, 2018 15:40:00 build 4274 is started
Dec 6, 2018 15:43:56 new space and project is created and pipeline starts
Dec 6, 2018 15:45:23 e2e test opens pipelines page in OSIO
Dec 6, 2018 15:53:06 Jenkins pod log shows the stacktrace like in my previous comment - WARNING: Computer.threadPoolForRemoting [#34] for jenkins-slave-fzpvb-tvc99 terminated
Dec 6, 2018 15:53:23.458 pipeline stops on promote stage, e2e test clicks on "Input required" button
Dec 6, 2018 15:53:23.616 first of many similar errors in browser console SEVERE https://jenkins.api.prod-preview.openshift.io/api/jenkins/start - Failed to load resource: the server responded with a status of 500 (Internal Server Error)
Dec 6, 2018 16:03:23 e2e test timeout - "Promote" button is not clickable
Dec 6, 2018 16:03:24 e2e test navigates to Jenkins log directly by URL - results in what is reported in this issue (Chrome error ERR_NAME_NOT_RESOLVED)
Dec 6, 2018 16:03:28 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod looks OK pod/jenkins-1-q9gfr 1/1 Running 0 1h
Dec 6, 2018 16:04:01 e2e test successfully resets the account

Then #3802 goes on stage.

Dec 6, 2018 16:40:00 build 4275 is started, fails with unrelated issue
Dec 6, 2018 17:40:00 build 4276 is started
Dec 6, 2018 18:05:00 "View log" link on pipeline page is still not available after 21 minutes
Dec 6, 2018 18:05:00 e2e test navigates to Jenkins log directly by URL - results in what is reported in this issue (Chrome error ERR_NAME_NOT_RESOLVED)
Dec 6, 2018 18:05:11 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod has been live for 3 hours pod/jenkins-1-q9gfr 1/1 Running 0 3h
Dec 6, 2018 18:05:36 e2e test successfully resets the account
fast-forward in time
Dec 7, 2018 9:40:00 build 4292 is started
the same scenario as in build 4276
Dec 7, 2018 9:54:15 e2e test gathers various logs - nothing unusual in last 15 minutes, Jenkins pod has been live for 19 hours pod/jenkins-1-q9gfr 1/1 Running 0 19h

result:

the account is completely unusable
user needs to go to OpenShift console and manually delete all pipelines
(maybe not necessary) reset account

ppitonak · 2018-12-07T12:04:30Z

When I click the " See additional details in jenkins", I see the following error in browser instead of Jenkins UI.

{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}

hrishin · 2018-12-07T12:31:13Z

@ppitonak from where we are seeing the following log?

{"Errors":[{"code":"500","detail":"Error when starting Jenkins: 2: openshift client error: got status 401 Unauthorized (401) from https://api.starter-us-east-2a.openshift.com/oapi/v1/namespaces/ppitonak-preview-jenkins/deploymentconfigs/jenkins"}]}

hrishin · 2018-12-07T12:37:27Z

@ppitonak yeah, can see now with prod-preview Jenkins.

bmicklea · 2018-12-19T13:55:34Z

SEV1 label downgraded during the IC today due to the fact that users shouldn't be doing reset environment ever, let alone frequently. We can't remove it from the E2E tests though because there's nothing that can replace it as a method to provide a clean environment for testing.

Separately the docs team needs to ensure that the onboarding doc sent to users does not reference the reset environment command: #4656

ldimaggi · 2018-12-19T14:29:22Z

Automated tests have to be able to establish a known (and "clean") starting point to ensure that the test results are not impacted by the environment. The environment reset function is problematic for human users as it removes ALL their assets and is problematic for automated tests as it does not leave the system in a 100% usable state. An enhanced space clean/delete function would be a better approach - see: #4657

piyush-garg · 2018-12-20T14:11:26Z

As mentioned in previous comments and also explained, this is reset environment and platform team is taking care/looking at it. I will add the platform team label and remove build team label. If build team needs to do something regarding this, please assign back. Thanks

MatousJobanek · 2018-12-21T17:00:38Z

Just a small update - the "fix" (from the tenant point of view) is done fabric8-services/fabric8-tenant#714 - now I'm just waiting till the quay database is fixed, so I can merge it and deploy it to prod-preview.

alexeykazakov · 2018-12-21T17:17:20Z

The tenant master build is green again and the fix has been deployed to prod-preview.

alexeykazakov · 2019-01-02T15:40:02Z

Should be fixed in fabric8-services/fabric8-tenant#714 (in prod-preview now).

ppitonak · 2019-01-03T16:41:48Z

We cannot judge if the fix helped because of #4670

ljelinkova · 2019-01-04T07:58:05Z

#4670 is resolved now.

I can see some failed jobs on prod-preview caused by Jenkins, for example.

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/4957/oc-jenkins-logs.txt

However, this could be also caused by #4668 . Could anybody have a look at #4668 so that it can be resolved and we can focus on this issue?

ljelinkova · 2019-01-16T09:21:43Z

The #4670 is resolved and it seems that the fix on prod-prev helped to lower the number of occurrences. However, it still happens.

These are the last logs

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-beta/5247/oc-jenkins-logs.txt

I can see in the logs that readiness probe failed

Readiness probe failed: Get http://10.130.24.191:8080/login: dial tcp 10.130.24.191:8080: connect: connection refused

And then the quota limit

Error creating: pods "jenkins-1-pbmt7" is forbidden: exceeded quota: compute-resources, requested: limits.cpu=2,limits.memory=1Gi, used: limits.cpu=2,limits.memory=1Gi, limited: limits.cpu=2,limits.memory=1Gi

ppitonak · 2019-01-17T08:32:31Z

I would just add that there is another type of suspicious OpenShift events:

9m          9m           1         jenkins-slave-6x3x2-1g0z0.157a94c2abcd6233             Pod                                                              Warning   Evicted                       kubelet, ip-172-21-52-86.ec2.internal   The node was low on resource: ephemeral-storage. Container jnlp was using 684Ki, which exceeds its request of 0. Container maven was using 636Ki, which exceeds its request of 0.

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1431/oc-jenkins-logs.txt

@MatousJobanek @alexeykazakov who is working on this? Nobody is assigned to this issue.

MatousJobanek · 2019-01-17T10:10:21Z

Well, we (platform team) were asked to try to resolve this issue by fixing the missing PVC one. That issue has been fixed (or at least it's not been observed for a long time).

These "new" failures and suspicious events reported by @ppitonak as well as by @ljelinkova are of a different kind (if I'm not mistaken). Anyone from the Build team could say what the cause is @piyush-garg @chmouel. If there is anything we could help with, please just let me know.

If there is anything I'm missing or I'm wrong about with my assumption, then I just probably need additional clarification

chmouel · 2019-01-17T22:28:36Z

Those are different issues, I am not sure why they are dumped into this issue, the one about eviction is because the node (server) was at over capacity, we can't do much about this

ppitonak · 2019-01-18T07:11:30Z

@chmouel can you do something about failing readiness probe?

piyush-garg · 2019-01-18T07:29:22Z

Hey @ppitonak

We can change the value of readiness probe so that it should not fail https://github.com/fabric8-services/fabric8-tenant/blob/master/environment/templates/fabric8-tenant-jenkins.yml#L793

It may be affected by the Jenkins storage fix we have added, that may have increased the Jenkins startup time. But that like just like an event/warning not affecting anything in my knowledge, yes, sure we can fix that.

cc @chmouel

Thanks

chmouel · 2019-01-18T07:38:08Z

I think it's better to fail fast than trying to just wait more, the increase startup time with storage should not be greater than a second,

in the case of that other bug, this was because there was a node eviction going on (out of resources) and just waiting more is not fixing it,

we could potentially handle this in the pipeline to detect errors and retry the pipeline but the detection is going to be fragile and may not help much,

can I respectively ask (again) to log a new issue instead of dumping everything on this one or just rename this issue to something like "OpenShift uber tracking instability issue" with a Cc to the whole openshift group in the assignees (this is a sarcastic joke)

ppitonak · 2019-01-18T11:07:10Z

@chmouel I am happy to report a new issue when there is a new issue but from my end point of view, it's still the same symptom which could have one or more root causes.

The two links in issue description contain these clues:

jenkins-home PVC problem
readiness probe failed
exceeded quota

Later we identified the problem with evicted pods and I agree with you that it's unrelated/not in our hands.

What should I report in a new issue?

ljelinkova · 2019-01-21T11:13:54Z

The e2e tests failed on persistentvolumeclaim "jenkins-home" not found both on prod and prod-prev.

http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2274/oc-jenkins-logs.txt
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-prod-preview.openshift.io-smoketest-pr-us-east-2a-released/5365/oc-jenkins-logs.txt

chmouel · 2019-01-21T11:16:32Z

This seems to be an error where the pod didn't get out of terminated and get stuck :

oc get pods --field-selector=status.phase=Running -o name | grep -v 'slave|deploy' | grep -m1 jenkins
oc logs pod/jenkins-1-hkf75
Error from server (BadRequest): container "jenkins" in pod "jenkins-1-hkf75" is terminated

ppitonak · 2019-01-30T09:46:13Z

What is the progress of this issue? It's still appearing, e.g. http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1509/

ppitonak added SEV2-high type/bug team/service-delivery team/build-cd area/jenkins priority/P3 Medium area/e2e-tests labels Nov 29, 2018

ppitonak added priority/P1 Critical and removed priority/P3 Medium labels Dec 6, 2018

ppitonak added SEV1-urgent and removed SEV2-high labels Dec 7, 2018

ppitonak mentioned this issue Dec 7, 2018

Upgrades idler: improves UserIdlerMap with concururrent safe map openshiftio/saas-openshiftio#1157

Closed

alexeykazakov removed the SEV1-urgent label Dec 19, 2018

ldimaggi mentioned this issue Dec 19, 2018

Space Deletion Function is not reliably deleting OpenShift resources #4657

Open

MatousJobanek mentioned this issue Dec 19, 2018

When there is an error while creating/updating ns create a new ones with a new base-name fabric8-services/fabric8-tenant#710

Closed

piyush-garg added team/platform and removed team/build-cd labels Dec 20, 2018

MatousJobanek mentioned this issue Dec 21, 2018

fix(#710): when creation/update of a namespace fails,create new ones with new base-name fabric8-services/fabric8-tenant#714

Merged

alexeykazakov added team/build-cd and removed team/platform team/service-delivery labels Jan 17, 2019

This was referenced Feb 25, 2019

jenkins-recommender-api-token not registered #4763

Open

Failed to Statfs "/proc/71658/ns/net": no such file or directory #4764

Open

jenkins.http’s server IP address could not be found. #4598

jenkins.http’s server IP address could not be found. #4598

Comments

ppitonak commented Nov 29, 2018

Issue Overview

Expected Behaviour

Current Behaviour

Steps To Reproduce

Additional Information

piyush-garg commented Nov 30, 2018

ppitonak commented Nov 30, 2018

ppitonak commented Nov 30, 2018

ljelinkova commented Dec 3, 2018

piyush-garg commented Dec 3, 2018 • edited Loading

ppitonak commented Dec 3, 2018

ljelinkova commented Dec 5, 2018

ljelinkova commented Dec 6, 2018

ljelinkova commented Dec 6, 2018

ljelinkova commented Dec 6, 2018

piyush-garg commented Dec 6, 2018

ljelinkova commented Dec 6, 2018

ppitonak commented Dec 6, 2018 • edited Loading

ppitonak commented Dec 6, 2018

ppitonak commented Dec 7, 2018 • edited Loading

ppitonak commented Dec 7, 2018

hrishin commented Dec 7, 2018

hrishin commented Dec 7, 2018

bmicklea commented Dec 19, 2018

ldimaggi commented Dec 19, 2018

piyush-garg commented Dec 20, 2018

MatousJobanek commented Dec 21, 2018

alexeykazakov commented Dec 21, 2018

alexeykazakov commented Jan 2, 2019

ppitonak commented Jan 3, 2019

ljelinkova commented Jan 4, 2019

ljelinkova commented Jan 16, 2019

ppitonak commented Jan 17, 2019

MatousJobanek commented Jan 17, 2019 • edited Loading

chmouel commented Jan 17, 2019

ppitonak commented Jan 18, 2019

piyush-garg commented Jan 18, 2019

chmouel commented Jan 18, 2019 • edited Loading

ppitonak commented Jan 18, 2019

ljelinkova commented Jan 21, 2019

chmouel commented Jan 21, 2019

ppitonak commented Jan 30, 2019

piyush-garg commented Dec 3, 2018 •

edited

Loading

ppitonak commented Dec 6, 2018 •

edited

Loading

ppitonak commented Dec 7, 2018 •

edited

Loading

MatousJobanek commented Jan 17, 2019 •

edited

Loading

chmouel commented Jan 18, 2019 •

edited

Loading