Add HCO health metric #2204

assafad · 2023-01-11T12:27:27Z

Add a metric which indicates the health of HCO and its secondary resources, based on the aggregated conditions. The proposed metric is exposed both as a Prometheus metric (named kubevirt_hco_system_health_status), and as a field in HCO status (named systemHealthStatus).
Its value is set according to the following logic:

If at least one resource, out of the resources maintained by HCO, is degraded or not available:
kubevirt_hco_system_health_status = 2 and systemHealthStatus = error
If all resources maintained by HCO are available and not degraded, but either one of the resources is progressing, or the reconciliation wasn't completed:
kubevirt_hco_system_health_status = 1 and systemHealthStatus = warning
If all resources maintained by HCO are available, not degraded, not progressing and the reconciliation was completed:
kubevirt_hco_system_health_status = 0 and systemHealthStatus = healthy

Signed-off-by: assafad [email protected]

Reviewer Checklist

Reviewers are supposed to review the PR for every aspect below one by one. To check an item means the PR is either "OK" or "Not Applicable" in terms of that item. All items are supposed to be checked before merging a PR.

Release note:

Add HCO health metric

Jira-Ticket: https://issues.redhat.com/browse/CNV-24648

kubevirt-bot · 2023-01-11T12:27:37Z

Hi @assafad. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

assafad · 2023-01-11T12:28:43Z

@sradco, @nunnatsa Can you please have a look?

openshift-ci · 2023-01-11T12:29:43Z

Hi @assafad. Thanks for your PR.

I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

coveralls · 2023-01-11T12:33:01Z

Pull Request Test Coverage Report for Build 3949350243

35 of 36 (97.22%) changed or added relevant lines in 2 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+0.08%) to 85.64%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
controllers/hyperconverged/hyperconverged_controller.go	31	32	96.88%

Files with Coverage Reduction	New Missed Lines	%
controllers/hyperconverged/hyperconverged_controller.go	1	81.27%

Totals
Change from base Build 3940385214:	0.08%
Covered Lines:	4783
Relevant Lines:	5585

💛 - Coveralls

controllers/hyperconverged/hyperconverged_controller.go

controllers/common/hcoConditions.go

controllers/hyperconverged/hyperconverged_controller.go

nunnatsa · 2023-01-16T05:58:56Z

api/v1beta1/hyperconverged_types.go

+	// SystemHealthStatus reflects the health of HCO and its secondary resources, based on the aggregated conditions.
+	// +optional
+	SystemHealthStatus string `json:"systemHealthStatus,omitempty"`


I wonder why do we need this field. K8s is getting away from a single status fields. This is why we have conditions.

The motivation was to have a way for users that don’t use Prometheus to check the operator's health.

@sradco is this field necessary for these users? maybe checking the metrics endpoint (which will include the new metric), without accessing Prometheus UI would be enough?

This is why we have the conditions.

@fabiand - what do you think?

Is there a condition which is providing this information (healthy or not). If there is a condition, then this is enough imo

@fabiand There is no single condition for this

nunnatsa · 2023-01-16T06:00:55Z

controllers/hyperconverged/hyperconverged_controller.go

+	isConditionReconcileCompleteTrue := req.Conditions.IsStatusConditionTrue(hcov1beta1.ConditionReconcileComplete)
+
+	isSystemHealthStatusError := !isConditionAvailableTrue || isConditionDegradedTrue
+	if isSystemHealthStatusError {


Not sure about this logic. The result will be that we'll maybe get false negative, mostly during setup, until the system is fully function.

Can you elaborate on what would be the reason for this false negative?
Is it related to the conditions' logic, or to the location in the reconciliation in which we update the system health?

HyperConverged won't be available during setup. According to this code, we'll get an error metric in this case. Is that what we want? @sradco

nunnatsa

Added a few comments and questions.

controllers/hyperconverged/hyperconverged_controller.go

nunnatsa · 2023-01-16T10:01:17Z

@assafad - please notice that you have 1 code smell.

pkg/metrics/metrics.go

nunnatsa · 2023-01-16T10:12:32Z

pkg/metrics/metrics.go

+
+	SystemHealthStatusHealthy = float64(0)
+	SystemHealthStatusWarning = float64(1)
+	SystemHealthStatusError   = float64(2)
 )



I think this will be more robust, if you move to one setting function.

Suggested change

SystemHealthStatusHealthy = float64(0)

SystemHealthStatusWarning = float64(1)

SystemHealthStatusError = float64(2)

)

)

type SystemHealthStatus float64

const (

SystemHealthStatusHealthy SystemHealthStatus = iota

SystemHealthStatusWarning

SystemHealthStatusError

)

Done without the new type. Let me know what you think

nunnatsa · 2023-01-16T13:28:44Z

/ok-to-test

nunnatsa · 2023-01-16T13:39:30Z

/ok-to-test

kubevirt-bot · 2023-01-17T18:51:30Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-azure

In response to this:

hco-e2e-upgrade-index-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-17T19:03:26Z

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure
okd-hco-e2e-upgrade-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-upgrade-index-aws
okd-hco-e2e-image-index-aws lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-gcp

kubevirt-bot · 2023-01-17T19:03:31Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-sno-azure, ci/prow/okd-hco-e2e-image-index-gcp, ci/prow/okd-hco-e2e-upgrade-index-aws

In response to this:

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure
okd-hco-e2e-upgrade-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-upgrade-index-aws
okd-hco-e2e-image-index-aws lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-gcp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-17T19:17:17Z

hco-e2e-upgrade-prev-index-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-azure

kubevirt-bot · 2023-01-17T19:17:20Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-azure

In response to this:

hco-e2e-upgrade-prev-index-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-17T19:23:14Z

hco-e2e-image-index-sno-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-sno-aws

kubevirt-bot · 2023-01-17T19:23:17Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-sno-aws

In response to this:

hco-e2e-image-index-sno-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-sno-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pkg/metrics/metrics.go

dharmit · 2023-01-18T06:17:19Z

pkg/metrics/metrics.go

@@ -80,6 +87,20 @@ var HcoMetrics = func() hcoMetrics {
 				)
 			},
 		},
+		HCOMetricSystemHealthStatus: {
+			fqName:          "kubevirt_hco_system_health_status",
+			help:            "Indicates whether the system health status is healthy (0), warning (1), or error (2)",


Is "HCO system health status" same as "system health status"? If not, should we add that bit in the help?

The metric aggregates the health of the system - HCO and its secondary resources. We use hco_ prefix for all HCO metrics, in order to identify and group metrics that are generated by it.
But sure, it is possible to add more information to the help. @sradco WDYT about this description?

@assafad Please update the help text with this information. Like, "The metric aggregates the health of the system - HCO and its secondary resources, based on the aggregated conditions."

pkg/metrics/metrics.go

hco-bot · 2023-01-18T10:51:57Z

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure

kubevirt-bot · 2023-01-18T10:52:01Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-sno-azure

In response to this:

hco-e2e-upgrade-prev-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-prev-index-sno-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-18T11:24:31Z

hco-e2e-upgrade-index-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-aws

kubevirt-bot · 2023-01-18T11:24:34Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-aws

In response to this:

hco-e2e-upgrade-index-azure lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-18T11:31:31Z

hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-gcp
hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

kubevirt-bot · 2023-01-18T11:31:35Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-aws, ci/prow/hco-e2e-image-index-gcp

In response to this:

hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-gcp
hco-e2e-image-index-azure lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add HCO health metric, which indicates the health of HCO and its secondary resources, based on the aggregated conditions Signed-off-by: assafad <[email protected]>

sonarqubecloud · 2023-01-18T13:31:19Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

openshift-ci · 2023-01-18T14:20:17Z

@assafad: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/hco-e2e-image-index-aws	`cdb68e6`	link	true	`/test hco-e2e-image-index-aws`
ci/prow/okd-hco-e2e-image-index-aws	`cdb68e6`	link	true	`/test okd-hco-e2e-image-index-aws`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

hco-bot · 2023-01-18T14:51:37Z

hco-e2e-upgrade-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-sno-azure

kubevirt-bot · 2023-01-18T14:51:41Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-sno-azure

In response to this:

hco-e2e-upgrade-index-sno-aws lane succeeded.
/override ci/prow/hco-e2e-upgrade-index-sno-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-18T15:06:48Z

okd-hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-aws

kubevirt-bot · 2023-01-18T15:06:52Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/okd-hco-e2e-image-index-aws

In response to this:

okd-hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/okd-hco-e2e-image-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hco-bot · 2023-01-18T15:24:23Z

hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

kubevirt-bot · 2023-01-18T15:24:26Z

@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-aws

In response to this:

hco-e2e-image-index-gcp lane succeeded.
/override ci/prow/hco-e2e-image-index-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nunnatsa · 2023-02-08T06:28:37Z

/lgtm
/approve

kubevirt-bot · 2023-02-08T06:28:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nunnatsa]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kubevirt-bot requested review from orenc1 and tiraboschi January 11, 2023 12:27

kubevirt-bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 11, 2023

kubevirt-bot added the size/L label Jan 11, 2023

assafad changed the title ~~[WIP] Add HCO health metric~~ WIP:Add HCO health metric Jan 11, 2023

assafad changed the title ~~WIP:Add HCO health metric~~ WIP: Add HCO health metric Jan 11, 2023

assafad changed the title ~~WIP: Add HCO health metric~~ Add HCO health metric Jan 12, 2023

kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2023

nunnatsa reviewed Jan 16, 2023

View reviewed changes

controllers/hyperconverged/hyperconverged_controller.go Outdated Show resolved Hide resolved

controllers/common/hcoConditions.go Outdated Show resolved Hide resolved

nunnatsa reviewed Jan 16, 2023

View reviewed changes

controllers/hyperconverged/hyperconverged_controller.go Outdated Show resolved Hide resolved

nunnatsa reviewed Jan 16, 2023

View reviewed changes

controllers/hyperconverged/hyperconverged_controller.go Outdated Show resolved Hide resolved

nunnatsa reviewed Jan 16, 2023

View reviewed changes

controllers/hyperconverged/hyperconverged_controller.go Outdated Show resolved Hide resolved

nunnatsa reviewed Jan 16, 2023

View reviewed changes

assafad commented Jan 16, 2023

View reviewed changes

controllers/hyperconverged/hyperconverged_controller.go Outdated Show resolved Hide resolved

nunnatsa reviewed Jan 16, 2023

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

nunnatsa reviewed Jan 16, 2023

View reviewed changes

kubevirt-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 16, 2023

dharmit reviewed Jan 18, 2023

View reviewed changes

Add HCO health metric

cdb68e6

Add HCO health metric, which indicates the health of HCO and its secondary resources, based on the aggregated conditions Signed-off-by: assafad <[email protected]>

kubevirt-bot assigned nunnatsa Feb 8, 2023

kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 8, 2023

kubevirt-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 8, 2023

kubevirt-bot merged commit bab7b17 into kubevirt:main Feb 8, 2023

assafad mentioned this pull request Mar 22, 2023

Add assafad to the org kubevirt/project-infra#2671

Merged

7 tasks

Add HCO health metric #2204

Add HCO health metric #2204

Conversation

assafad commented Jan 11, 2023 • edited by avlitman Loading

kubevirt-bot commented Jan 11, 2023

assafad commented Jan 11, 2023 • edited Loading

openshift-ci bot commented Jan 11, 2023

coveralls commented Jan 11, 2023 • edited Loading

Pull Request Test Coverage Report for Build 3949350243

💛 - Coveralls

Choose a reason for hiding this comment

assafad Jan 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nunnatsa left a comment

Choose a reason for hiding this comment

nunnatsa commented Jan 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nunnatsa commented Jan 16, 2023

nunnatsa commented Jan 16, 2023

kubevirt-bot commented Jan 17, 2023

hco-bot commented Jan 17, 2023

kubevirt-bot commented Jan 17, 2023

hco-bot commented Jan 17, 2023

kubevirt-bot commented Jan 17, 2023

hco-bot commented Jan 17, 2023

kubevirt-bot commented Jan 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sradco Jan 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hco-bot commented Jan 18, 2023

kubevirt-bot commented Jan 18, 2023

hco-bot commented Jan 18, 2023

kubevirt-bot commented Jan 18, 2023

hco-bot commented Jan 18, 2023

kubevirt-bot commented Jan 18, 2023

sonarqubecloud bot commented Jan 18, 2023

openshift-ci bot commented Jan 18, 2023 • edited Loading

hco-bot commented Jan 18, 2023

kubevirt-bot commented Jan 18, 2023

hco-bot commented Jan 18, 2023

kubevirt-bot commented Jan 18, 2023

hco-bot commented Jan 18, 2023

kubevirt-bot commented Jan 18, 2023

nunnatsa commented Feb 8, 2023

kubevirt-bot commented Feb 8, 2023

assafad commented Jan 11, 2023 •

edited by avlitman

Loading

assafad commented Jan 11, 2023 •

edited

Loading

coveralls commented Jan 11, 2023 •

edited

Loading

assafad Jan 16, 2023 •

edited

Loading

sradco Jan 18, 2023 •

edited

Loading

openshift-ci bot commented Jan 18, 2023 •

edited

Loading