OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382

tkashem · 2024-12-17T19:49:10Z

examine the events associated with the installer Pods, do the following:
a) construct an e2e timeline
b) detect if installer pods are running concurrently on two nodes, and return a flaking test

we want to know how widespread b is.

installer pod timeline:

and the test will flake if it finds concurrent installer pods on two or more nodes, this is how it would look like (simulated, not an actual occurrence):

: [sig-apimachinery] installer Pods should not run concurrently on two or more node
{  
A(2024-12-18T16:11:21Z -> 0001-01-01T00:00:00Z) B(2024-12-18T16:13:07Z -> 0001-01-01T00:00:00Z):

A: node(ci-op-54qd4d73-03fd1-cl265-master-0) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-0) namespace(openshift-etcd) reason() started(2024-12-18T16:11:21Z) duration: -2562047h47m16.854775808s
B: node(ci-op-54qd4d73-03fd1-cl265-master-1) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-1) namespace(openshift-etcd) reason() started(2024-12-18T16:13:07Z) duration: -2562047h47m16.854775808s
}

openshift-ci · 2024-12-17T19:50:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tkashem
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-trt · 2024-12-18T05:09:51Z

Job Failure Risk Analysis for sha: f6f032f

Job Name	Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (20) are below the historical average (482): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-serial	Medium [sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret is deleted [Suite:openshift/conformance/serial] This test has passed 97.30% of 74 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-serial'] in the last 14 days. --- [sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret expiry annotation is changed [Suite:openshift/conformance/serial] This test has passed 97.30% of 74 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-serial'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial	Low [sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret expiry annotation is changed [Suite:openshift/conformance/serial] This test has passed 78.57% of 14 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:none] in the last week. --- [sig-storage][Feature:Cluster-CSI-Snapshot-Controller-Operator][Serial][apigroup:operator.openshift.io] should restart webhook Pods if csi-snapshot-webhook-secret is deleted [Suite:openshift/conformance/serial] This test has passed 78.57% of 14 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:none] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 57.14% of 7 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-trt · 2024-12-18T19:13:10Z

Job Failure Risk Analysis for sha: d770b9f

Job Name	Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (20) are below the historical average (446): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node	Low [sig-node] static pods should start after being created This test has passed 71.11% of 90 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node'] in the last 14 days.

tkashem · 2024-12-19T00:06:48Z

installer pod timeline:

from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29382/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1869478174232940544

openshift-trt · 2024-12-19T00:13:24Z

Job Failure Risk Analysis for sha: 36711ce

Job Name	Failure Risk
pull-ci-openshift-origin-master-okd-scos-e2e-aws-ovn	IncompleteTests Tests for this run (20) are below the historical average (194): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

tkashem · 2024-12-19T04:14:02Z

/payload

openshift-ci · 2024-12-19T04:14:06Z

@tkashem: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

tkashem · 2024-12-19T04:15:25Z

/payload 4.18 nightly informing

openshift-ci · 2024-12-19T04:15:28Z

@tkashem: trigger 68 job(s) of type informing for the nightly release of OCP 4.18

periodic-ci-openshift-release-master-nightly-4.18-e2e-agent-compact-fips
periodic-ci-openshift-release-master-nightly-4.18-e2e-agent-ha-dualstack-conformance
periodic-ci-openshift-release-master-nightly-4.18-e2e-agent-single-node-ipv6
periodic-ci-openshift-release-master-nightly-4.18-console-aws
periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-aws
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-csi
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-cgroupsv2
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-fips
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-csi
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-serial
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-techpreview
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node-techpreview-serial
periodic-ci-openshift-release-master-nightly-4.18-upgrade-from-stable-4.17-e2e-aws-upgrade-ovn-single-node
periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-upgrade-out-of-change
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-upi
periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-azure
periodic-ci-openshift-release-master-nightly-4.18-e2e-azure-csi
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-serial
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-techpreview-serial
periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade-out-of-change
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-driver-toolkit
periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-gcp
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn
periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-csi
periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-rt
periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-serial
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-techpreview-serial
periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.18-e2e-gcp-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-bm-upgrade
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-dualstack
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-dualstack-techpreview
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-ipv6-techpreview
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-serial-ipv4
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-serial-virtualmedia
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-techpreview
periodic-ci-openshift-release-master-nightly-4.18-upgrade-from-stable-4.17-e2e-metal-ipi-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-serial-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-serial-ovn-dualstack
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-upgrade-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.18-upgrade-from-stable-4.17-e2e-metal-ipi-upgrade-ovn-ipv6
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ovn-assisted
periodic-ci-openshift-release-master-nightly-4.18-metal-ovn-single-node-recert-cluster-rename
periodic-ci-openshift-osde2e-main-nightly-4.18-osd-aws
periodic-ci-openshift-release-master-nightly-4.19-e2e-osd-ccs-gcp
periodic-ci-openshift-osde2e-main-nightly-4.18-osd-gcp
periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-proxy
periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ovn-single-node-live-iso
periodic-ci-openshift-release-master-nightly-4.18-e2e-rosa-sts-ovn
periodic-ci-openshift-osde2e-main-nightly-4.18-rosa-classic-sts
periodic-ci-openshift-release-master-nightly-4.18-e2e-rosa-sts-hypershift-ovn
periodic-ci-openshift-release-master-nightly-4.18-e2e-telco5g
periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-aws-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-csi
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-serial
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-techpreview-serial
periodic-ci-openshift-release-master-ci-4.18-e2e-vsphere-ovn-upgrade
periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-vsphere-ovn-upgrade
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-upi
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-upi-serial
periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-static-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dfdef9b0-bdbf-11ef-98ca-f9166c462206-0

openshift-ci-robot · 2024-12-19T14:08:47Z

@tkashem: This pull request references Jira Issue OCPBUGS-45924, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

examine the events associated with the installer Pods, do the following:
a) construct an e2e timeline
b) detect if installer pods are running concurrently on two nodes, and return a flaking test

we want to know how widespread b is.

installer pod timeline:

and the test will flake if it finds concurrent installer pods on two or more nodes, this is how it would look like (simulated, not an actual occurrence):
: [sig-apimachinery] installer Pods should not run concurrently on two or more node
{  
A(2024-12-18T16:11:21Z -> 0001-01-01T00:00:00Z) B(2024-12-18T16:13:07Z -> 0001-01-01T00:00:00Z):

A: node(ci-op-54qd4d73-03fd1-cl265-master-0) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-0) namespace(openshift-etcd) reason() started(2024-12-18T16:11:21Z) duration: -2562047h47m16.854775808s
B: node(ci-op-54qd4d73-03fd1-cl265-master-1) name(installer-9-ci-op-54qd4d73-03fd1-cl265-master-1) namespace(openshift-etcd) reason() started(2024-12-18T16:13:07Z) duration: -2562047h47m16.854775808s
}

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tkashem · 2024-12-20T21:17:04Z

/retest

tkashem · 2024-12-22T14:29:40Z

/retest

openshift-ci · 2024-12-22T18:04:22Z

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-upgrade	`a051b38`	link	false	`/test e2e-aws-ovn-upgrade`
ci/prow/e2e-aws-ovn-single-node-upgrade	`a051b38`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-agnostic-ovn-cmd	`a051b38`	link	false	`/test e2e-agnostic-ovn-cmd`
ci/prow/verify	`a051b38`	link	true	`/test verify`
ci/prow/unit	`a051b38`	link	true	`/test unit`
ci/prow/e2e-aws-ovn-single-node-serial	`a051b38`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/lint	`a051b38`	link	true	`/test lint`
ci/prow/e2e-metal-ipi-ovn-ipv6	`a051b38`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-gcp-ovn-upgrade	`a051b38`	link	true	`/test e2e-gcp-ovn-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2024-12-22T19:24:54Z

Job Failure Risk Analysis for sha: a051b38

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade	Low [sig-network] pods should successfully create sandboxes by other This test has passed 66.44% of 149 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:ha Upgrade:micro] in the last week.

vrutkovs · 2025-01-02T07:42:49Z

pkg/monitortests/kubeapiserver/installerpod/monitortest.go

+			pods: map[string]*podInfo{},
+		},
+		filter: func(interval monitorapi.Interval) bool {
+			if ns, ok := interval.Locator.Keys[monitorapi.LocatorNamespaceKey]; !ok || ns != "openshift-etcd" {


imo this should be moved to openshift-etcd and renamed to include "etcd" prefix as we may want to later expand this to other installers like kube-apiserver

vrutkovs · 2025-01-02T07:46:44Z

pkg/monitortests/kubeapiserver/installerpod/monitortest.go

+			endedAt = info.startedAt
+			level = monitorapi.Error
+		}
+		if info.lastReason == "Killing" || info.concurrent {


Killing here means installer failed with error or took longer than timeout?

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2024

openshift-ci bot requested review from p0lyn0mial and sjenning December 17, 2024 19:50

tkashem force-pushed the monitor-installer-pod branch from 16844b1 to f6f032f Compare December 18, 2024 00:44

tkashem force-pushed the monitor-installer-pod branch from f6f032f to d770b9f Compare December 18, 2024 14:30

add a monitor test for installer pod timeline

36711ce

tkashem force-pushed the monitor-installer-pod branch from d770b9f to 36711ce Compare December 18, 2024 20:21

tkashem changed the title ~~[WIP] add a monitor test for installer pod timeline~~ add a monitor test for installer pod timeline Dec 19, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 19, 2024

tkashem changed the title ~~add a monitor test for installer pod timeline~~ OCPBUGS-45924: add a monitor test for installer pod timeline Dec 19, 2024

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 19, 2024

tkashem changed the title ~~OCPBUGS-45924: add a monitor test for installer pod timeline~~ OCPBUGS-45924: add a monitor test that detects concurrent installer pods Dec 19, 2024

hack, experiment to see if the monitor test sees bootstrap events

a051b38

vrutkovs reviewed Jan 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382

OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382

tkashem commented Dec 17, 2024 •

edited

Loading

openshift-ci bot commented Dec 17, 2024

openshift-trt bot commented Dec 18, 2024

openshift-trt bot commented Dec 18, 2024

tkashem commented Dec 19, 2024 •

edited

Loading

openshift-trt bot commented Dec 19, 2024

tkashem commented Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

tkashem commented Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

openshift-ci-robot commented Dec 19, 2024

tkashem commented Dec 20, 2024

tkashem commented Dec 22, 2024

openshift-ci bot commented Dec 22, 2024

openshift-trt bot commented Dec 22, 2024

vrutkovs Jan 2, 2025

vrutkovs Jan 2, 2025

OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382

Are you sure you want to change the base?

OCPBUGS-45924: add a monitor test that detects concurrent installer pods #29382

Conversation

tkashem commented Dec 17, 2024 • edited Loading

openshift-ci bot commented Dec 17, 2024

openshift-trt bot commented Dec 18, 2024

openshift-trt bot commented Dec 18, 2024

tkashem commented Dec 19, 2024 • edited Loading

openshift-trt bot commented Dec 19, 2024

tkashem commented Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

tkashem commented Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

openshift-ci-robot commented Dec 19, 2024

tkashem commented Dec 20, 2024

tkashem commented Dec 22, 2024

openshift-ci bot commented Dec 22, 2024

openshift-trt bot commented Dec 22, 2024

vrutkovs Jan 2, 2025

Choose a reason for hiding this comment

vrutkovs Jan 2, 2025

Choose a reason for hiding this comment

tkashem commented Dec 17, 2024 •

edited

Loading

tkashem commented Dec 19, 2024 •

edited

Loading