-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HCO health metric #2204
Add HCO health metric #2204
Conversation
Hi @assafad. Thanks for your PR. I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @assafad. Thanks for your PR. I'm waiting for a kubevirt member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Pull Request Test Coverage Report for Build 3949350243
💛 - Coveralls |
// SystemHealthStatus reflects the health of HCO and its secondary resources, based on the aggregated conditions. | ||
// +optional | ||
SystemHealthStatus string `json:"systemHealthStatus,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why do we need this field. K8s is getting away from a single status fields. This is why we have conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation was to have a way for users that don’t use Prometheus to check the operator's health.
@sradco is this field necessary for these users? maybe checking the metrics endpoint (which will include the new metric), without accessing Prometheus UI would be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why we have the conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand - what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a condition which is providing this information (healthy or not). If there is a condition, then this is enough imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand There is no single condition for this
isConditionReconcileCompleteTrue := req.Conditions.IsStatusConditionTrue(hcov1beta1.ConditionReconcileComplete) | ||
|
||
isSystemHealthStatusError := !isConditionAvailableTrue || isConditionDegradedTrue | ||
if isSystemHealthStatusError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this logic. The result will be that we'll maybe get false negative, mostly during setup, until the system is fully function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what would be the reason for this false negative?
Is it related to the conditions' logic, or to the location in the reconciliation in which we update the system health?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HyperConverged won't be available during setup. According to this code, we'll get an error metric in this case. Is that what we want? @sradco
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments and questions.
@assafad - please notice that you have 1 code smell. |
pkg/metrics/metrics.go
Outdated
|
||
SystemHealthStatusHealthy = float64(0) | ||
SystemHealthStatusWarning = float64(1) | ||
SystemHealthStatusError = float64(2) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be more robust, if you move to one setting function.
SystemHealthStatusHealthy = float64(0) | |
SystemHealthStatusWarning = float64(1) | |
SystemHealthStatusError = float64(2) | |
) | |
) | |
type SystemHealthStatus float64 | |
const ( | |
SystemHealthStatusHealthy SystemHealthStatus = iota | |
SystemHealthStatusWarning | |
SystemHealthStatusError | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done without the new type. Let me know what you think
/ok-to-test |
1 similar comment
/ok-to-test |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hco-e2e-upgrade-prev-index-sno-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-sno-azure, ci/prow/okd-hco-e2e-image-index-gcp, ci/prow/okd-hco-e2e-upgrade-index-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hco-e2e-upgrade-prev-index-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hco-e2e-image-index-sno-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-sno-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/metrics/metrics.go
Outdated
@@ -80,6 +87,20 @@ var HcoMetrics = func() hcoMetrics { | |||
) | |||
}, | |||
}, | |||
HCOMetricSystemHealthStatus: { | |||
fqName: "kubevirt_hco_system_health_status", | |||
help: "Indicates whether the system health status is healthy (0), warning (1), or error (2)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "HCO system health status" same as "system health status"? If not, should we add that bit in the help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric aggregates the health of the system - HCO and its secondary resources. We use hco_
prefix for all HCO metrics, in order to identify and group metrics that are generated by it.
But sure, it is possible to add more information to the help
. @sradco WDYT about this description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@assafad Please update the help text with this information. Like, "The metric aggregates the health of the system - HCO and its secondary resources, based on the aggregated conditions."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
hco-e2e-upgrade-prev-index-sno-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-prev-index-sno-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hco-e2e-upgrade-index-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hco-e2e-image-index-azure lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-aws, ci/prow/hco-e2e-image-index-gcp In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Add HCO health metric, which indicates the health of HCO and its secondary resources, based on the aggregated conditions Signed-off-by: assafad <[email protected]>
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
@assafad: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
hco-e2e-upgrade-index-sno-aws lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-index-sno-azure In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
okd-hco-e2e-image-index-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/okd-hco-e2e-image-index-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
hco-e2e-image-index-gcp lane succeeded. |
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-image-index-aws In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nunnatsa The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Add a metric which indicates the health of HCO and its secondary resources, based on the aggregated conditions. The proposed metric is exposed both as a Prometheus metric (named
kubevirt_hco_system_health_status
), and as a field in HCO status (namedsystemHealthStatus
).Its value is set according to the following logic:
If at least one resource, out of the resources maintained by HCO, is degraded or not available:
kubevirt_hco_system_health_status
= 2 andsystemHealthStatus
= errorIf all resources maintained by HCO are available and not degraded, but either one of the resources is progressing, or the reconciliation wasn't completed:
kubevirt_hco_system_health_status
= 1 andsystemHealthStatus
= warningIf all resources maintained by HCO are available, not degraded, not progressing and the reconciliation was completed:
kubevirt_hco_system_health_status
= 0 andsystemHealthStatus
= healthySigned-off-by: assafad [email protected]
Reviewer Checklist
Release note:
Jira-Ticket: https://issues.redhat.com/browse/CNV-24648