-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tkashem The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
16844b1
to
f6f032f
Compare
Job Failure Risk Analysis for sha: f6f032f
|
f6f032f
to
d770b9f
Compare
Job Failure Risk Analysis for sha: d770b9f
|
d770b9f
to
36711ce
Compare
Job Failure Risk Analysis for sha: 36711ce
|
/payload |
/payload 4.18 nightly informing |
@tkashem: trigger 68 job(s) of type informing for the nightly release of OCP 4.18
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dfdef9b0-bdbf-11ef-98ca-f9166c462206-0 |
@tkashem: This pull request references Jira Issue OCPBUGS-45924, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/retest |
1 similar comment
/retest |
@tkashem: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: a051b38
|
pods: map[string]*podInfo{}, | ||
}, | ||
filter: func(interval monitorapi.Interval) bool { | ||
if ns, ok := interval.Locator.Keys[monitorapi.LocatorNamespaceKey]; !ok || ns != "openshift-etcd" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo this should be moved to openshift-etcd and renamed to include "etcd" prefix as we may want to later expand this to other installers like kube-apiserver
endedAt = info.startedAt | ||
level = monitorapi.Error | ||
} | ||
if info.lastReason == "Killing" || info.concurrent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Killing
here means installer failed with error or took longer than timeout?
examine the events associated with the installer Pods, do the following:
a) construct an e2e timeline
b) detect if installer pods are running concurrently on two nodes, and return a flaking test
we want to know how widespread
b
is.installer pod timeline:
and the test will flake if it finds concurrent installer pods on two or more nodes, this is how it would look like (simulated, not an actual occurrence):