Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apps: change query expression for LessKubeletsThanNodes alerts #2411

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

HaoruiPeng
Copy link
Contributor

@HaoruiPeng HaoruiPeng commented Jan 22, 2025

Warning

This is a public repository, ensure not to disclose:

  • personal data beyond what is necessary for interacting with this pull request, nor
  • business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • kind/feature
  • kind/improvement
  • kind/deprecation
  • kind/documentation
  • kind/clean-up
  • kind/bug
  • kind/other

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

  • kind/admin-change
  • kind/dev-change
  • kind/security
  • [kind/adr](set-me)

What does this PR do / why do we need this PR?

This change should mitigate the chances of false firing the LessKubeletsThanNodes alerts during autoscaling process.
The old query checks if there's any kubelet job down.
The new query compare the number of running kubelet jobs and the total number of nodes in the cluster, and triggers the alerts when the former is less than the latter.
The LessKubeletsThanNodes alerts is split into two based on the label node-restriction.kubernetes.io/autoscaled-node-type, which indicates if a node belongs to an autoscaling group or not.

  1. LessKubeletsThanNonautoscalableNodes: Checks nodes without the mentioned label, P2 alerts has 5m pending time, P1 alert has 15m pending time.
  2. LessKubeletsThanAutoscalableNodes: Checks nodes with the mentioned label, P2 alerts has 15m pending time, P1 alert has 30m pending time.

NOTE: Currently the alerts does not check the values of the label, only checks if the label exits or not.

Information to reviewers

See the comment section of the issue for more details.

Checklist

  • Proper commit message prefix on all commits
  • Change checks:
    • The change is transparent
    • The change is disruptive
    • The change requires no migration steps
    • The change requires migration steps
    • The change updates CRDs
    • The change updates the config and the schema
  • Documentation checks:
  • Metrics checks:
    • The metrics are still exposed and present in Grafana after the change
    • The metrics names didn't change (Grafana dashboards and Prometheus alerts required no updates)
    • The metrics names did change (Grafana dashboards and Prometheus alerts required an update)
  • Logs checks:
    • The logs do not show any errors after the change
  • PodSecurityPolicy checks:
    • Any changed Pod is covered by Kubernetes Pod Security Standards
    • Any changed Pod is covered by Gatekeeper Pod Security Policies
    • The change does not cause any Pods to be blocked by Pod Security Standards or Policies
  • NetworkPolicy checks:
    • Any changed Pod is covered by Network Policies
    • The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • The change does not cause any unnecessary Kubernetes audit events
    • The change requires changes to Kubernetes audit policy
  • Falco checks:
    • The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • The bug fix is covered by regression tests

@HaoruiPeng HaoruiPeng requested a review from a team as a code owner January 22, 2025 15:35
@HaoruiPeng HaoruiPeng requested review from robinAwallace, a team and davidumea and removed request for a team January 22, 2025 15:42
@@ -181,7 +181,7 @@ spec:
description: There are less kubelets than nodes for the last 5m.
runbook_url: {{ .Values.defaultRules.runbookUrl }}kubernetes/LessKubeletsThanNodes
summary: There are less kubelets than nodes for the last 5m.
expr: up{job="kubelet", metrics_path="/metrics"} == 0
expr: count(up{job="kubelet", metrics_path="/metrics"} == 1) < count(kube_node_status_condition{condition="Ready", status="true"} == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this captures what we want, testing it on an affected time period definitely works. But I'm unsure if this alert would ever trigger?
If the kubelet is down I'd argue that the node is always "NotReady" by definition.
However with that said, I can't seem to figure out any way to capture this in a good way.

If the node would've been labeled by capi during scaledown we could "filter" them out from the previous expression. Buut they don't get any...
We could've tried to split the alert up into "autoscaled nodes" and "non-autoscaled nodes" using our naming convention of autoscaled nodes and have the timeout longer for "autoscaled nodes". Buut that seems like an ugly solution.

So I think that I'd like us to rather try to get to the source of the slow scale down on GPU nodes, is there something that CAPI/CAPZ does with the nodes from the time that the node gets its kubelet shutdown/removed until the node object is removed from the cluster that we can speed up. Buuut we can't get any more quota to test this 🤦 And I'm not sure if the scope of this issue should become greater than it already is.

So I think I'm honestly at a loss. Any one else that has any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this captures what we want, testing it on an affected time period definitely works. But I'm unsure if this alert would ever trigger?
If the kubelet is down I'd argue that the node is always "NotReady" by definition.
However with that said, I can't seem to figure out any way to capture this in a good way.

I was also thinking about this. Another idea I had is to compare the number of running kubelet to number of total nodes, regardless if they are ready or not. In this way, the alert will be fired in the GPU nodes autoscaling case (according to the metrics data we have), but should have less probability to fire in other cases compared to the old up{job="kubelet", metrics_path="/metrics"} == 0 query.

But I'm also not sure about the other scenarios in ops that we want to capture with this alert (Since I didn't encounter this in my own dev clusters), so yes the changes I made may filter out some events we are supposed to capture. Also had similar ideas that to filter out the autoscaled nodes, but as you said, it seems not a neat solution :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but should have less probability to fire in other cases compared to the old up{job="kubelet", metrics_path="/metrics"} == 0 query

Right, so basically this query would be

# Not a real query, just illustrative
count(up{job="kubelet", metrics_path="/metrics"} == 1) < count(nodes)

And that would capture the case where the node exists but the kubelet scrape target doesn't AND when it exists and it's down instead of only capturing the case where it exists and is down, right?
That would certainly match the name of the alert better at least, and it feels like a good approach. But, as you said, it wouldn't fix the issue of autoscaling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would splitting this alert into separate alerts per node group be a possible solution? E.g. LessKubeletsThanNodesControlPlane and LessKubeletsThanNodesGPU or something. Then you could have separate duration configured for a GPU node-group if we know they take longer time to scale down. But that requires that we have specific labels for the GPU nodes.

Copy link
Contributor Author

@HaoruiPeng HaoruiPeng Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanna write here a summary on possible approaches we have brought up and I have thought (will add if we come up more), please leave some thoughts.

  1. Compare with count(up{kubelet}) with count(nodes),
    • Pro: Matche better the alert name, and mitigate the problem of firing alerts due to the weird behavior of up{job="kubelet", metrics_path="/metrics"} == 0 metrics.
    • Con: Does not solve the autoscaling issue with gpu nodes.
  2. Split alerts for 'non-autoscaled nodes' and 'autoscale node', and use longer timeout for the latter. The node label node-restriction.kubernetes.io/autoscaled-node-type can be used for splitting.
    • Pro: Kinda solves the con of solution 1.
    • Con: Non-GPU but autoscaled nodes willl be affected.
  3. Split alerts based on elastisys.io/node-group label and use longer timeout for GPU nodes.
  4. Split alerts based on elastisys.io/node-type label and use longer timeout for GPU nodes.
    • Pro: Compared to solution 3, this label already exists on our gpu nodes. Only prior fix is to add it to our Prometheus metricLabelsAllowlist.
    • Con: This feels more like a workaround than a solution? Also a bit ugly to goto-monitoring people?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for option 2 you can configure it per key value pair, so it only needs to have increased duration for e.g. node-restriction.kubernetes.io/autoscaled-node-type=gpu or similar and let duration be the same as for non-autoscale nodes, but it is maybe not as clean as option 3. I am not entirely sure how if we have a default set up for GPU node groups in CAPI to ensure which label names are used.
Personally I like option 3 if we implement validation that ensures that nodes have the proper labels in this repo, but since that is not in place atm we can maybe wait with implementing this. A short-term solution could be to go with option 1 for now to get this merged and then implement 3 once https://github.com/elastisys/ck8s-issue-tracker/issues/429 is implemented, but I am interested to hear what others think of this @Xartos @elastisys/ops-lead @davidumea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to go for option 2 as the default configuration but make the label configurable. So that by default it just checks if there label node-restriction.kubernetes.io/autoscaled-node-type is there, but we can then change it to a specific label that catches just gpu nodes if we want (including both label key and value). That way we do not have to ensure that any specific label is set.

@HaoruiPeng HaoruiPeng force-pushed the haorui/fix-welkin-alerts-triggers-for-autoscaling branch from d3765a9 to 1e5bada Compare February 4, 2025 10:28
@HaoruiPeng
Copy link
Contributor Author

PR updated with solution 2 suggested in the comment section.

Copy link
Contributor

@raviranjanelastisys raviranjanelastisys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, LGTM

@HaoruiPeng HaoruiPeng requested review from a team as code owners February 4, 2025 12:56
@HaoruiPeng HaoruiPeng requested a review from davidumea February 4, 2025 12:57
@davidumea
Copy link
Contributor

I think you need to rebase or drop this commit 1d72f29

@HaoruiPeng
Copy link
Contributor Author

I think you need to rebase or drop this commit 1d72f29

Ouch I cherry-picked it for another quick test and forgot to drop it. Thanks for reminding!

@HaoruiPeng HaoruiPeng force-pushed the haorui/fix-welkin-alerts-triggers-for-autoscaling branch from a7d62c4 to da91076 Compare February 4, 2025 13:31
@HaoruiPeng HaoruiPeng removed request for robinAwallace and a team February 4, 2025 13:32
expr: up{job="kubelet", metrics_path="/metrics"} == 0
description: There are less running kubelet jobs than existing nodes in an autoscaling enabled node group for the last 15m.
summary: There are less running kubelet jobs than existing nodes in an autoscaling enabled node group for the last 15m.
expr: count((up{job="kubelet", metrics_path="/metrics"} == 1) and on(node) kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type!=""}) < count(kube_node_info and on(node) kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type!=""})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I suggested in the other comment thread I would like it if was possible to change the label (including value) this looks at via config.
Perhaps even making it possible to look at several labels, but that be more complicated than it is worth.
What do you think of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type=$value}, and we read the $value from the value file? But we need to pre-config it in the value file anyway right?

I was trying to get labels out from the query results, so it tells which node triggers the alerts. But I didn't manage to do that. To get the instance label information we need to compare the kulebet and kube_node_info for individual node instead of summing them up. But in that case the query will drop the nodes returning nil values, which is exactly what we want to capture with the alert. Couldn't think of any idea for it. What do you think? @anders-elastisys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. As you say we do need a default value that we put in config/common-config.yaml.

Hmm, yeah I think that it might not be possible to get the missing nodes with the way we are currently calculating this.
The current approach of calculate_up_nodes - calculate_all_registered_nodes would not give us enough context. If we want to know what nodes are missing, then we would need to make some individual comparison for individual nodes between the two metrics. It's probably possible, but would be a bit different. It would then also create one alert per missing node instead of one alert regardless of how many nodes are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. As you say we do need a default value that we put in config/common-config.yaml.

Ah ok, I will update it to make it configurable in the common-config.yaml

The current approach of calculate_up_nodes - calculate_all_registered_nodes would not give us enough context. If we want to know what nodes are missing, then we would need to make some individual comparison for individual nodes between the two metrics. It's probably possible, but would be a bit different. It would then also create one alert per missing node instead of one alert regardless of how many nodes are missing.

Yea I was trying to get label information for missing node but didn't manage to do that, the query logic would be different if we want this, maybe we'd better change the alerts names 😂, since the old name suggests the compare between the sums.
I will look into that more and get back with an alternative for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some more digs into this alert, I think we need to first clarify what we exactly want this alert to capture.

  1. The old LessKubeletThanNodes alert query only calculates count(up{job="kubelet"}==0), which triggers if any kubelet job is not running properly. However, when a node was already taken down from the cluster after autoscaling, the up{job="kubelet"} query for some reason can still return 0 value on that non-existing node for a while and eventually notifies us the kubelet is not running. We don't want our alert to capture it because the node is already taken down, and we don't care if the metric for that node is still there, whether returning 0 or 1.

  2. If we change the query to count(up{job="kubelet"}==1) < count{kube_node_info} avoid the false trigger, and it capture two things:

    1. kubelet job of an existing node is not running, aka. up{job="kubelet"} returns 0 value on that node.
    2. kubelet is absent on an existing node, this will also be captured by KubeletDown alert.
      The cons of using this query or the old query is that we don't get any label info on which node is wrong, since we only calculated the count metric.
  3. If we compare it node by node, the query will be like sum by (node)(kube_node_info) - sum by (node)(up{job="kubelet"}), with this we can get information on which existing nodes' kubelet stopped running, but we won't capture anything when kube_node_info or up{job="kubelet"} is absent for a node, since the query does not calculate on nil value. I think this query is equivalent to kube_node_info and on(node) up{job="kubelet"}, which only checks up{job="kubelet"} on existing nodes, return 1 if kubelet is running, 0 if kubelet stops, returns no data (nil) if this metric is absent.

I'm inclined towards kube_node_info and on(node) up{job="kubelet"}, which ignores the absent node or absent kubelet job, but also won't trigger at all during a normal autoscaling case (not even in pending status).

Any thoughts? @viktor-f @anders-elastisys

HaoruiPeng and others added 7 commits February 4, 2025 15:38
@HaoruiPeng
Copy link
Contributor Author

PR updated, changed to query to sum by (node, cluster) (kube_node_info and on (node) up{job="kubelet",metrics_path="/metrics"} == 0). Now the alert will be triggered by individual node with configurable label and label values, the message will be like.
Screenshot from 2025-02-05 14-03-54

I also added back LessKubeletThanNode alert with the new metric, this alert will only be enabled if alert splitting for autoscaled node is disabled in the config.

Still need some thought on the alert names and the query, since the logic does not match any more if we choose to go with this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants