-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apps: change query expression for LessKubeletsThanNodes alerts #2411
base: main
Are you sure you want to change the base?
apps: change query expression for LessKubeletsThanNodes alerts #2411
Conversation
@@ -181,7 +181,7 @@ spec: | |||
description: There are less kubelets than nodes for the last 5m. | |||
runbook_url: {{ .Values.defaultRules.runbookUrl }}kubernetes/LessKubeletsThanNodes | |||
summary: There are less kubelets than nodes for the last 5m. | |||
expr: up{job="kubelet", metrics_path="/metrics"} == 0 | |||
expr: count(up{job="kubelet", metrics_path="/metrics"} == 1) < count(kube_node_status_condition{condition="Ready", status="true"} == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this captures what we want, testing it on an affected time period definitely works. But I'm unsure if this alert would ever trigger?
If the kubelet is down I'd argue that the node is always "NotReady" by definition.
However with that said, I can't seem to figure out any way to capture this in a good way.
If the node would've been labeled by capi during scaledown we could "filter" them out from the previous expression. Buut they don't get any...
We could've tried to split the alert up into "autoscaled nodes" and "non-autoscaled nodes" using our naming convention of autoscaled nodes and have the timeout longer for "autoscaled nodes". Buut that seems like an ugly solution.
So I think that I'd like us to rather try to get to the source of the slow scale down on GPU nodes, is there something that CAPI/CAPZ does with the nodes from the time that the node gets its kubelet shutdown/removed until the node object is removed from the cluster that we can speed up. Buuut we can't get any more quota to test this 🤦 And I'm not sure if the scope of this issue should become greater than it already is.
So I think I'm honestly at a loss. Any one else that has any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this captures what we want, testing it on an affected time period definitely works. But I'm unsure if this alert would ever trigger?
If the kubelet is down I'd argue that the node is always "NotReady" by definition.
However with that said, I can't seem to figure out any way to capture this in a good way.
I was also thinking about this. Another idea I had is to compare the number of running kubelet to number of total nodes, regardless if they are ready or not. In this way, the alert will be fired in the GPU nodes autoscaling case (according to the metrics data we have), but should have less probability to fire in other cases compared to the old up{job="kubelet", metrics_path="/metrics"} == 0
query.
But I'm also not sure about the other scenarios in ops that we want to capture with this alert (Since I didn't encounter this in my own dev clusters), so yes the changes I made may filter out some events we are supposed to capture. Also had similar ideas that to filter out the autoscaled nodes, but as you said, it seems not a neat solution :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but should have less probability to fire in other cases compared to the old
up{job="kubelet", metrics_path="/metrics"} == 0
query
Right, so basically this query would be
# Not a real query, just illustrative
count(up{job="kubelet", metrics_path="/metrics"} == 1) < count(nodes)
And that would capture the case where the node exists but the kubelet scrape target doesn't AND when it exists and it's down instead of only capturing the case where it exists and is down, right?
That would certainly match the name of the alert better at least, and it feels like a good approach. But, as you said, it wouldn't fix the issue of autoscaling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would splitting this alert into separate alerts per node group be a possible solution? E.g. LessKubeletsThanNodesControlPlane
and LessKubeletsThanNodesGPU
or something. Then you could have separate duration configured for a GPU node-group if we know they take longer time to scale down. But that requires that we have specific labels for the GPU nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanna write here a summary on possible approaches we have brought up and I have thought (will add if we come up more), please leave some thoughts.
- Compare with
count(up{kubelet})
withcount(nodes)
,- Pro: Matche better the alert name, and mitigate the problem of firing alerts due to the weird behavior of
up{job="kubelet", metrics_path="/metrics"} == 0
metrics. - Con: Does not solve the autoscaling issue with gpu nodes.
- Pro: Matche better the alert name, and mitigate the problem of firing alerts due to the weird behavior of
- Split alerts for 'non-autoscaled nodes' and 'autoscale node', and use longer timeout for the latter. The node label
node-restriction.kubernetes.io/autoscaled-node-type
can be used for splitting.- Pro: Kinda solves the con of solution 1.
- Con: Non-GPU but autoscaled nodes willl be affected.
- Split alerts based on
elastisys.io/node-group
label and use longer timeout for GPU nodes.- Pro: Seems a better way of splitting, as we basically filter out GPU nodes from this alert.
- Con: I don't think we have enforced adding this label to our clusters yet . The lables were added by default if clusters were deployed by devbox (https://github.com/elastisys/ck8s-devbox/issues/374), but we didn't have gpu nodes labled
elastisys.io/node-group=gpu
by that fix. This solution could work as expected after we make sure all our clusters have been enforced to add the labels by these two issues Add an alert for missing Node Group labels #2014, https://github.com/elastisys/ck8s-issue-tracker/issues/429 (include GPU nodes)
- Split alerts based on
elastisys.io/node-type
label and use longer timeout for GPU nodes.- Pro: Compared to solution 3, this label already exists on our gpu nodes. Only prior fix is to add it to our Prometheus
metricLabelsAllowlist
. - Con: This feels more like a workaround than a solution? Also a bit ugly to goto-monitoring people?
- Pro: Compared to solution 3, this label already exists on our gpu nodes. Only prior fix is to add it to our Prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for option 2 you can configure it per key value pair, so it only needs to have increased duration for e.g. node-restriction.kubernetes.io/autoscaled-node-type=gpu
or similar and let duration be the same as for non-autoscale nodes, but it is maybe not as clean as option 3. I am not entirely sure how if we have a default set up for GPU node groups in CAPI to ensure which label names are used.
Personally I like option 3 if we implement validation that ensures that nodes have the proper labels in this repo, but since that is not in place atm we can maybe wait with implementing this. A short-term solution could be to go with option 1 for now to get this merged and then implement 3 once https://github.com/elastisys/ck8s-issue-tracker/issues/429 is implemented, but I am interested to hear what others think of this @Xartos @elastisys/ops-lead @davidumea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to go for option 2 as the default configuration but make the label configurable. So that by default it just checks if there label node-restriction.kubernetes.io/autoscaled-node-type
is there, but we can then change it to a specific label that catches just gpu nodes if we want (including both label key and value). That way we do not have to ensure that any specific label is set.
d3765a9
to
1e5bada
Compare
PR updated with solution 2 suggested in the comment section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, LGTM
helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml
Outdated
Show resolved
Hide resolved
I think you need to rebase or drop this commit 1d72f29 |
Ouch I cherry-picked it for another quick test and forgot to drop it. Thanks for reminding! |
a7d62c4
to
da91076
Compare
helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml
Outdated
Show resolved
Hide resolved
expr: up{job="kubelet", metrics_path="/metrics"} == 0 | ||
description: There are less running kubelet jobs than existing nodes in an autoscaling enabled node group for the last 15m. | ||
summary: There are less running kubelet jobs than existing nodes in an autoscaling enabled node group for the last 15m. | ||
expr: count((up{job="kubelet", metrics_path="/metrics"} == 1) and on(node) kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type!=""}) < count(kube_node_info and on(node) kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type!=""}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I suggested in the other comment thread I would like it if was possible to change the label (including value) this looks at via config.
Perhaps even making it possible to look at several labels, but that be more complicated than it is worth.
What do you think of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like kube_node_labels{label_node_restriction_kubernetes_io_autoscaled_node_type=$value}
, and we read the $value
from the value file? But we need to pre-config it in the value file anyway right?
I was trying to get labels out from the query results, so it tells which node triggers the alerts. But I didn't manage to do that. To get the instance label information we need to compare the kulebet and kube_node_info for individual node instead of summing them up. But in that case the query will drop the nodes returning nil values, which is exactly what we want to capture with the alert. Couldn't think of any idea for it. What do you think? @anders-elastisys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. As you say we do need a default value that we put in config/common-config.yaml.
Hmm, yeah I think that it might not be possible to get the missing nodes with the way we are currently calculating this.
The current approach of calculate_up_nodes - calculate_all_registered_nodes
would not give us enough context. If we want to know what nodes are missing, then we would need to make some individual comparison for individual nodes between the two metrics. It's probably possible, but would be a bit different. It would then also create one alert per missing node instead of one alert regardless of how many nodes are missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. As you say we do need a default value that we put in config/common-config.yaml.
Ah ok, I will update it to make it configurable in the common-config.yaml
The current approach of calculate_up_nodes - calculate_all_registered_nodes would not give us enough context. If we want to know what nodes are missing, then we would need to make some individual comparison for individual nodes between the two metrics. It's probably possible, but would be a bit different. It would then also create one alert per missing node instead of one alert regardless of how many nodes are missing.
Yea I was trying to get label information for missing node but didn't manage to do that, the query logic would be different if we want this, maybe we'd better change the alerts names 😂, since the old name suggests the compare between the sums.
I will look into that more and get back with an alternative for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some more digs into this alert, I think we need to first clarify what we exactly want this alert to capture.
-
The old LessKubeletThanNodes alert query only calculates
count(up{job="kubelet"}==0)
, which triggers if any kubelet job is not running properly. However, when a node was already taken down from the cluster after autoscaling, theup{job="kubelet"}
query for some reason can still return 0 value on that non-existing node for a while and eventually notifies us the kubelet is not running. We don't want our alert to capture it because the node is already taken down, and we don't care if the metric for that node is still there, whether returning 0 or 1. -
If we change the query to
count(up{job="kubelet"}==1) < count{kube_node_info}
avoid the false trigger, and it capture two things:- kubelet job of an existing node is not running, aka.
up{job="kubelet"}
returns 0 value on that node. - kubelet is absent on an existing node, this will also be captured by KubeletDown alert.
The cons of using this query or the old query is that we don't get any label info on which node is wrong, since we only calculated thecount
metric.
- kubelet job of an existing node is not running, aka.
-
If we compare it node by node, the query will be like
sum by (node)(kube_node_info) - sum by (node)(up{job="kubelet"})
, with this we can get information on which existing nodes' kubelet stopped running, but we won't capture anything whenkube_node_info
orup{job="kubelet"}
is absent for a node, since the query does not calculate on nil value. I think this query is equivalent tokube_node_info and on(node) up{job="kubelet"}
, which only checksup{job="kubelet"}
on existing nodes, return 1 if kubelet is running, 0 if kubelet stops, returns no data (nil) if this metric is absent.
I'm inclined towards kube_node_info and on(node) up{job="kubelet"}
, which ignores the absent node or absent kubelet job, but also won't trigger at all during a normal autoscaling case (not even in pending status).
Any thoughts? @viktor-f @anders-elastisys
helmfile.d/charts/prometheus-alerts/templates/alerts/kubernetes-system-kubelet.yaml
Outdated
Show resolved
Hide resolved
Co-authored-by: Viktor <[email protected]>
…LabelAllowList if alert splitting for autoscaled nodes is enabled
…fig and transform the values for alert templating
…trigger for individual node, also add back the lessKubeThanNode alerts for all nodes if alerts spliiting is not enabled
Warning
This is a public repository, ensure not to disclose:
What kind of PR is this?
Required: Mark one of the following that is applicable:
Optional: Mark one or more of the following that are applicable:
Important
Breaking changes should be marked
kind/admin-change
orkind/dev-change
depending on typeCritical security fixes should be marked with
kind/security
What does this PR do / why do we need this PR?
This change should mitigate the chances of false firing the LessKubeletsThanNodes alerts during autoscaling process.
The old query checks if there's any kubelet job down.
The new query compare the number of running kubelet jobs and the total number of nodes in the cluster, and triggers the alerts when the former is less than the latter.
The LessKubeletsThanNodes alerts is split into two based on the label
node-restriction.kubernetes.io/autoscaled-node-type
, which indicates if a node belongs to an autoscaling group or not.NOTE: Currently the alerts does not check the values of the label, only checks if the label exits or not.
Information to reviewers
See the comment section of the issue for more details.
Checklist