apps: change query expression for LessKubeletsThanNodes alerts #2411

HaoruiPeng · 2025-01-22T15:35:24Z

Warning

This is a public repository, ensure not to disclose:

personal data beyond what is necessary for interacting with this pull request, nor
business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

Important

Breaking changes should be marked kind/admin-change or kind/dev-change depending on type
Critical security fixes should be marked with kind/security

kind/admin-change
kind/dev-change
kind/security
[kind/adr](set-me)

What does this PR do / why do we need this PR?

This change should mitigate the chances of false firing the LessKubeletsThanNodes alerts during autoscaling process.
The old query checks if there's any kubelet job down.
The new query compare the number of running kubelet jobs and the total number of nodes in the cluster, and triggers the alerts when the former is less than the latter.
The LessKubeletsThanNodes alerts is split into two based on the label node-restriction.kubernetes.io/autoscaled-node-type, which indicates if a node belongs to an autoscaling group or not.

LessKubeletsThanNonautoscalableNodes: Checks nodes without the mentioned label, P2 alerts has 5m pending time, P1 alert has 15m pending time.
LessKubeletsThanAutoscalableNodes: Checks nodes with the mentioned label, P2 alerts has 15m pending time, P1 alert has 30m pending time.

NOTE: Currently the alerts does not check the values of the label, only checks if the label exits or not.

Fixes https://github.com/elastisys/ck8s-cluster-api/issues/276

Information to reviewers

See the comment section of the issue for more details.

Checklist

Xartos · 2025-01-23T07:58:38Z

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml

@@ -181,7 +181,7 @@ spec:
        description: There are less kubelets than nodes for the last 5m.
        runbook_url: {{ .Values.defaultRules.runbookUrl }}kubernetes/LessKubeletsThanNodes
        summary: There are less kubelets than nodes for the last 5m.
-      expr: up{job="kubelet", metrics_path="/metrics"} == 0
+      expr: count(up{job="kubelet", metrics_path="/metrics"} == 1) < count(kube_node_status_condition{condition="Ready", status="true"} == 1)


I'm not sure if this captures what we want, testing it on an affected time period definitely works. But I'm unsure if this alert would ever trigger?
If the kubelet is down I'd argue that the node is always "NotReady" by definition.
However with that said, I can't seem to figure out any way to capture this in a good way.

If the node would've been labeled by capi during scaledown we could "filter" them out from the previous expression. Buut they don't get any...
We could've tried to split the alert up into "autoscaled nodes" and "non-autoscaled nodes" using our naming convention of autoscaled nodes and have the timeout longer for "autoscaled nodes". Buut that seems like an ugly solution.

So I think that I'd like us to rather try to get to the source of the slow scale down on GPU nodes, is there something that CAPI/CAPZ does with the nodes from the time that the node gets its kubelet shutdown/removed until the node object is removed from the cluster that we can speed up. Buuut we can't get any more quota to test this 🤦 And I'm not sure if the scope of this issue should become greater than it already is.

So I think I'm honestly at a loss. Any one else that has any ideas?

I'm not sure if this captures what we want, testing it on an affected time period definitely works. But I'm unsure if this alert would ever trigger?
If the kubelet is down I'd argue that the node is always "NotReady" by definition.
However with that said, I can't seem to figure out any way to capture this in a good way.

I was also thinking about this. Another idea I had is to compare the number of running kubelet to number of total nodes, regardless if they are ready or not. In this way, the alert will be fired in the GPU nodes autoscaling case (according to the metrics data we have), but should have less probability to fire in other cases compared to the old up{job="kubelet", metrics_path="/metrics"} == 0 query.

But I'm also not sure about the other scenarios in ops that we want to capture with this alert (Since I didn't encounter this in my own dev clusters), so yes the changes I made may filter out some events we are supposed to capture. Also had similar ideas that to filter out the autoscaled nodes, but as you said, it seems not a neat solution :/

but should have less probability to fire in other cases compared to the old up{job="kubelet", metrics_path="/metrics"} == 0 query

Right, so basically this query would be

# Not a real query, just illustrative count(up{job="kubelet", metrics_path="/metrics"} == 1) < count(nodes)

And that would capture the case where the node exists but the kubelet scrape target doesn't AND when it exists and it's down instead of only capturing the case where it exists and is down, right?
That would certainly match the name of the alert better at least, and it feels like a good approach. But, as you said, it wouldn't fix the issue of autoscaling

Would splitting this alert into separate alerts per node group be a possible solution? E.g. LessKubeletsThanNodesControlPlane and LessKubeletsThanNodesGPU or something. Then you could have separate duration configured for a GPU node-group if we know they take longer time to scale down. But that requires that we have specific labels for the GPU nodes.

Wanna write here a summary on possible approaches we have brought up and I have thought (will add if we come up more), please leave some thoughts.

Compare with count(up{kubelet}) with count(nodes),

Pro: Matche better the alert name, and mitigate the problem of firing alerts due to the weird behavior of up{job="kubelet", metrics_path="/metrics"} == 0 metrics.

Con: Does not solve the autoscaling issue with gpu nodes.

Split alerts for 'non-autoscaled nodes' and 'autoscale node', and use longer timeout for the latter. The node label node-restriction.kubernetes.io/autoscaled-node-type can be used for splitting.

Pro: Kinda solves the con of solution 1.

Con: Non-GPU but autoscaled nodes willl be affected.

Split alerts based on elastisys.io/node-group label and use longer timeout for GPU nodes.

Pro: Seems a better way of splitting, as we basically filter out GPU nodes from this alert.

Con: I don't think we have enforced adding this label to our clusters yet . The lables were added by default if clusters were deployed by devbox (https://github.com/elastisys/ck8s-devbox/issues/374), but we didn't have gpu nodes labled elastisys.io/node-group=gpu by that fix. This solution could work as expected after we make sure all our clusters have been enforced to add the labels by these two issues Add an alert for missing Node Group labels #2014, https://github.com/elastisys/ck8s-issue-tracker/issues/429 (include GPU nodes)

Split alerts based on elastisys.io/node-type label and use longer timeout for GPU nodes.

Pro: Compared to solution 3, this label already exists on our gpu nodes. Only prior fix is to add it to our Prometheus metricLabelsAllowlist.

Con: This feels more like a workaround than a solution? Also a bit ugly to goto-monitoring people?

I think for option 2 you can configure it per key value pair, so it only needs to have increased duration for e.g. node-restriction.kubernetes.io/autoscaled-node-type=gpu or similar and let duration be the same as for non-autoscale nodes, but it is maybe not as clean as option 3. I am not entirely sure how if we have a default set up for GPU node groups in CAPI to ensure which label names are used.
Personally I like option 3 if we implement validation that ensures that nodes have the proper labels in this repo, but since that is not in place atm we can maybe wait with implementing this. A short-term solution could be to go with option 1 for now to get this merged and then implement 3 once https://github.com/elastisys/ck8s-issue-tracker/issues/429 is implemented, but I am interested to hear what others think of this @Xartos @elastisys/ops-lead @davidumea

I think it would make sense to go for option 2 as the default configuration but make the label configurable. So that by default it just checks if there label node-restriction.kubernetes.io/autoscaled-node-type is there, but we can then change it to a specific label that catches just gpu nodes if we want (including both label key and value). That way we do not have to ensure that any specific label is set.

…of nodes in the cluster

…n-autoscaled node

HaoruiPeng · 2025-02-04T10:39:46Z

PR updated with solution 2 suggested in the comment section.

raviranjanelastisys

Nice work, LGTM

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml

davidumea · 2025-02-04T13:11:09Z

I think you need to rebase or drop this commit 1d72f29

HaoruiPeng · 2025-02-04T13:23:11Z

I think you need to rebase or drop this commit 1d72f29

Ouch I cherry-picked it for another quick test and forgot to drop it. Thanks for reminding!

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml

viktor-f · 2025-02-04T13:45:48Z

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml

-      expr: up{job="kubelet", metrics_path="/metrics"} == 0
+        description: There are less running kubelet jobs than existing nodes in an autoscaling enabled node group for the last 15m.
+        summary: There are less running kubelet jobs than existing nodes in an autoscaling enabled node group for the last 15m.
+      expr: count((up{job="kubelet", metrics_path="/metrics"} == 1) and on(node) kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type!=""}) < count(kube_node_info and on(node) kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type!=""})


As I suggested in the other comment thread I would like it if was possible to change the label (including value) this looks at via config.
Perhaps even making it possible to look at several labels, but that be more complicated than it is worth.
What do you think of that?

Do you mean something like kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type=$value}, and we read the $value from the value file? But we need to pre-config it in the value file anyway right?

I was trying to get labels out from the query results, so it tells which node triggers the alerts. But I didn't manage to do that. To get the instance label information we need to compare the kulebet and kube_node_info for individual node instead of summing them up. But in that case the query will drop the nodes returning nil values, which is exactly what we want to capture with the alert. Couldn't think of any idea for it. What do you think? @anders-elastisys

Yes, exactly. As you say we do need a default value that we put in config/common-config.yaml.

Hmm, yeah I think that it might not be possible to get the missing nodes with the way we are currently calculating this.
The current approach of calculate_up_nodes - calculate_all_registered_nodes would not give us enough context. If we want to know what nodes are missing, then we would need to make some individual comparison for individual nodes between the two metrics. It's probably possible, but would be a bit different. It would then also create one alert per missing node instead of one alert regardless of how many nodes are missing.

Yes, exactly. As you say we do need a default value that we put in config/common-config.yaml.

Ah ok, I will update it to make it configurable in the common-config.yaml

The current approach of calculate_up_nodes - calculate_all_registered_nodes would not give us enough context. If we want to know what nodes are missing, then we would need to make some individual comparison for individual nodes between the two metrics. It's probably possible, but would be a bit different. It would then also create one alert per missing node instead of one alert regardless of how many nodes are missing.

Yea I was trying to get label information for missing node but didn't manage to do that, the query logic would be different if we want this, maybe we'd better change the alerts names 😂, since the old name suggests the compare between the sums.
I will look into that more and get back with an alternative for that.

Did some more digs into this alert, I think we need to first clarify what we exactly want this alert to capture.

The old LessKubeletThanNodes alert query only calculates count(up{job="kubelet"}==0), which triggers if any kubelet job is not running properly. However, when a node was already taken down from the cluster after autoscaling, the up{job="kubelet"} query for some reason can still return 0 value on that non-existing node for a while and eventually notifies us the kubelet is not running. We don't want our alert to capture it because the node is already taken down, and we don't care if the metric for that node is still there, whether returning 0 or 1.

If we change the query to count(up{job="kubelet"}==1) < count{kube_node_info} avoid the false trigger, and it capture two things:

kubelet job of an existing node is not running, aka. up{job="kubelet"} returns 0 value on that node.

kubelet is absent on an existing node, this will also be captured by KubeletDown alert.
The cons of using this query or the old query is that we don't get any label info on which node is wrong, since we only calculated the count metric.

If we compare it node by node, the query will be like sum by (node)(kube_node_info) - sum by (node)(up{job="kubelet"}), with this we can get information on which existing nodes' kubelet stopped running, but we won't capture anything when kube_node_info or up{job="kubelet"} is absent for a node, since the query does not calculate on nil value. I think this query is equivalent to kube_node_info and on(node) up{job="kubelet"}, which only checks up{job="kubelet"} on existing nodes, return 1 if kubelet is running, 0 if kubelet stops, returns no data (nil) if this metric is absent.

I'm inclined towards kube_node_info and on(node) up{job="kubelet"}, which ignores the absent node or absent kubelet job, but also won't trigger at all during a normal autoscaling case (not even in pending status).

Any thoughts? @viktor-f @anders-elastisys

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml

Co-authored-by: Viktor <[email protected]>

…late

…LabelAllowList if alert splitting for autoscaled nodes is enabled

…fig and transform the values for alert templating

…trigger for individual node, also add back the lessKubeThanNode alerts for all nodes if alerts spliiting is not enabled

HaoruiPeng · 2025-02-06T07:58:02Z

PR updated, changed to query to sum by (node, cluster) (kube_node_info and on (node) up{job="kubelet",metrics_path="/metrics"} == 0). Now the alert will be triggered by individual node with configurable label and label values, the message will be like.

I also added back LessKubeletThanNode alert with the new metric, this alert will only be enabled if alert splitting for autoscaled node is disabled in the config.

Still need some thought on the alert names and the query, since the logic does not match any more if we choose to go with this way.

apps: change query expression for LessKubeletsThanNodes alerts

01ad4e2

HaoruiPeng requested a review from a team as a code owner January 22, 2025 15:35

HaoruiPeng requested review from robinAwallace, a team and davidumea and removed request for a team January 22, 2025 15:42

Xartos reviewed Jan 23, 2025

View reviewed changes

HaoruiPeng added 3 commits February 3, 2025 10:12

change the query to compare the running kubelet jobd to total number …

8c72b4a

…of nodes in the cluster

Add label in metricLabelsAllowlist for prometheus

db3c490

split LessKubeletThanNodes into two alerts for autoscaled node and no…

1e5bada

…n-autoscaled node

HaoruiPeng force-pushed the haorui/fix-welkin-alerts-triggers-for-autoscaling branch from d3765a9 to 1e5bada Compare February 4, 2025 10:28

HaoruiPeng requested review from anders-elastisys, Xartos and viktor-f February 4, 2025 10:39

raviranjanelastisys approved these changes Feb 4, 2025

View reviewed changes

davidumea reviewed Feb 4, 2025

View reviewed changes

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml Outdated Show resolved Hide resolved

HaoruiPeng requested review from a team as code owners February 4, 2025 12:56

HaoruiPeng requested a review from davidumea February 4, 2025 12:57

Change alerts names

da91076

HaoruiPeng force-pushed the haorui/fix-welkin-alerts-triggers-for-autoscaling branch from a7d62c4 to da91076 Compare February 4, 2025 13:31

HaoruiPeng removed request for robinAwallace and a team February 4, 2025 13:32

viktor-f reviewed Feb 4, 2025

View reviewed changes

helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml Outdated Show resolved Hide resolved

HaoruiPeng and others added 7 commits February 4, 2025 15:38

Fix typo in alert description

1ca2350

Co-authored-by: Viktor <[email protected]>

add .prometheus.autoscaledNodeGroupAlerts for common-config.yaml temp…

5f3e11b

…late

declare autoscaledNodeGroupAlerts for prometheus alerts

34a5e34

read autoscaled node group label from value file and add it to metric…

5793fb2

…LabelAllowList if alert splitting for autoscaled nodes is enabled

read label and label values of autoscaled nodes group from common-con…

5034ecc

…fig and transform the values for alert templating

change the queries expr for alerts, so they are configurable and can …

157b18c

…trigger for individual node, also add back the lessKubeThanNode alerts for all nodes if alerts spliiting is not enabled

make pre-commit happy

b861684

HaoruiPeng requested review from viktor-f and raviranjanelastisys February 6, 2025 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apps: change query expression for LessKubeletsThanNodes alerts #2411

apps: change query expression for LessKubeletsThanNodes alerts #2411

HaoruiPeng commented Jan 22, 2025 •

edited by raviranjanelastisys

Loading

Xartos Jan 23, 2025

HaoruiPeng Jan 23, 2025

Xartos Jan 24, 2025

anders-elastisys Jan 24, 2025

HaoruiPeng Jan 24, 2025 •

edited

Loading

anders-elastisys Jan 27, 2025

viktor-f Jan 31, 2025

HaoruiPeng commented Feb 4, 2025

raviranjanelastisys left a comment

davidumea commented Feb 4, 2025

HaoruiPeng commented Feb 4, 2025

viktor-f Feb 4, 2025

HaoruiPeng Feb 4, 2025

viktor-f Feb 4, 2025

HaoruiPeng Feb 4, 2025

HaoruiPeng Feb 5, 2025

HaoruiPeng commented Feb 6, 2025

apps: change query expression for LessKubeletsThanNodes alerts #2411

Are you sure you want to change the base?

apps: change query expression for LessKubeletsThanNodes alerts #2411

Conversation

HaoruiPeng commented Jan 22, 2025 • edited by raviranjanelastisys Loading

What kind of PR is this?

What does this PR do / why do we need this PR?

Information to reviewers

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoruiPeng Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoruiPeng commented Feb 4, 2025

raviranjanelastisys left a comment

Choose a reason for hiding this comment

davidumea commented Feb 4, 2025

HaoruiPeng commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoruiPeng commented Feb 6, 2025

HaoruiPeng commented Jan 22, 2025 •

edited by raviranjanelastisys

Loading

HaoruiPeng Jan 24, 2025 •

edited

Loading