-
Notifications
You must be signed in to change notification settings - Fork 66
jenkins.http’s server IP address could not be found. #4598
Comments
@ppitonak I am not able to reproduce this issue from my account. |
Happened again yesterday, again on us-east-1a with the same account. https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1647/ Nov 29, 2018 6:35:00 PM Out of 22 runs, 3 failed with this error. The job runs every 2 hours. |
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1656/ Nov 30, 2018 12:35:00 PM ... again the same cluster |
I've seen this error on all clusters 1-2 times during weekend, for example: https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1683/console There is one thing commong in oc logs
Usually, there are no events in the log, maybe because the jobs run only every 4 hours. However, in one job, I've seen quite a lot of log events like this:
|
This seems like idle is trying to unidle, but pod did not come up because of resource quota and old pod stuck in |
what is the next step? |
This issue still occurs and affects the E2E tests so I'd also like to see some suggestions what we could do about it. https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-released/1169/console |
This log looks also interesting Extract:
|
@ljelinkova This is reaching resource quota for sure. But it's not showing any other pod, so don't exactly know how it's reaching resource quota. This is something we need to debug. |
@piyush-garg The number of failed e2e tests is increasing so give it a high priority please. |
Happened twice in last 40 minutes: Jenkins seems to unidle properly and start the job: Jenkins pod is still running and there are no unusual OpenShift events.
This part of Jenkins pod log could be useful:
|
Another failure, the same story. Three successful runs before this one so probably not caused by something broken in previous run. |
I have more information about what is going on.
Then #3802 goes on stage.
result:
|
When I click the " See additional details in jenkins", I see the following error in browser instead of Jenkins UI.
|
@ppitonak from where we are seeing the following log?
|
@ppitonak yeah, can see now with prod-preview Jenkins. |
Separately the docs team needs to ensure that the onboarding doc sent to users does not reference the reset environment command: #4656 |
Automated tests have to be able to establish a known (and "clean") starting point to ensure that the test results are not impacted by the environment. The environment reset function is problematic for human users as it removes ALL their assets and is problematic for automated tests as it does not leave the system in a 100% usable state. An enhanced space clean/delete function would be a better approach - see: #4657 |
As mentioned in previous comments and also explained, this is reset environment and platform team is taking care/looking at it. I will add the platform team label and remove build team label. If build team needs to do something regarding this, please assign back. Thanks |
Just a small update - the "fix" (from the tenant point of view) is done fabric8-services/fabric8-tenant#714 - now I'm just waiting till the quay database is fixed, so I can merge it and deploy it to prod-preview. |
The tenant master build is green again and the fix has been deployed to prod-preview. |
Should be fixed in fabric8-services/fabric8-tenant#714 (in prod-preview now). |
We cannot judge if the fix helped because of #4670 |
The #4670 is resolved and it seems that the fix on prod-prev helped to lower the number of occurrences. However, it still happens. These are the last logs I can see in the logs that readiness probe failed
And then the quota limit
|
I would just add that there is another type of suspicious OpenShift events:
@MatousJobanek @alexeykazakov who is working on this? Nobody is assigned to this issue. |
Well, we (platform team) were asked to try to resolve this issue by fixing the missing PVC one. That issue has been fixed (or at least it's not been observed for a long time). These "new" failures and suspicious events reported by @ppitonak as well as by @ljelinkova are of a different kind (if I'm not mistaken). Anyone from the Build team could say what the cause is @piyush-garg @chmouel. If there is anything we could help with, please just let me know. If there is anything I'm missing or I'm wrong about with my assumption, then I just probably need additional clarification |
Those are different issues, I am not sure why they are dumped into this issue, the one about eviction is because the node (server) was at over capacity, we can't do much about this |
@chmouel can you do something about failing readiness probe? |
Hey @ppitonak We can change the value of readiness probe so that it should not fail https://github.com/fabric8-services/fabric8-tenant/blob/master/environment/templates/fabric8-tenant-jenkins.yml#L793 It may be affected by the Jenkins storage fix we have added, that may have increased the Jenkins startup time. But that like just like an event/warning not affecting anything in my knowledge, yes, sure we can fix that. cc @chmouel Thanks |
I think it's better to fail fast than trying to just wait more, the increase startup time with storage should not be greater than a second, in the case of that other bug, this was because there was a node eviction going on (out of resources) and just waiting more is not fixing it, we could potentially handle this in the pipeline to detect errors and retry the pipeline but the detection is going to be fragile and may not help much, can I respectively ask (again) to log a new issue instead of dumping everything on this one or just rename this issue to something like "OpenShift uber tracking instability issue" with a Cc to the whole openshift group in the assignees (this is a sarcastic joke) |
@chmouel I am happy to report a new issue when there is a new issue but from my end point of view, it's still the same symptom which could have one or more root causes. The two links in issue description contain these clues:
Later we identified the problem with evicted pods and I agree with you that it's unrelated/not in our hands. What should I report in a new issue? |
The e2e tests failed on http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/2274/oc-jenkins-logs.txt |
This seems to be an error where the pod didn't get out of terminated and get stuck : oc get pods --field-selector=status.phase=Running -o name | grep -v 'slave|deploy' | grep -m1 jenkins
oc logs pod/jenkins-1-hkf75
Error from server (BadRequest): container "jenkins" in pod "jenkins-1-hkf75" is terminated |
What is the progress of this issue? It's still appearing, e.g. http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/1509/ |
Issue Overview
When user navigates to http://jenkins.openshift.io, he sees error page instead of Jenkins UI
Expected Behaviour
Jenkins UI is displayed
Current Behaviour
Error message
![05-04-jenkins-direct-log](https://user-images.githubusercontent.com/445100/49212684-ea9a8400-f3c2-11e8-8a50-46a9984993cc.png)
Steps To Reproduce
Additional Information
We saw it twice on us-east-1a-beta
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1634/ Nov 28, 2018 4:35:00 PM UTC
https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-1a-beta/1635/ Nov 28, 2018 6:35:00 PM UTC
We saw similar bug for
api.openshift.io
:http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2-released/14962/01-01-afterEach.png Nov 8, 2018 4:28:00 AM UTC
http://artifacts.ci.centos.org/devtools/e2e/devtools-test-e2e-openshift.io-logintest-us-east-2a-released/14953/01-01-afterEach.png Nov 8, 2018 4:32:00 AM UTC
The text was updated successfully, but these errors were encountered: