Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kube_pod_completion_time to kube-state-metrics #37206

Closed
rudolf opened this issue Nov 27, 2023 · 4 comments
Closed

Add kube_pod_completion_time to kube-state-metrics #37206

rudolf opened this issue Nov 27, 2023 · 4 comments
Assignees
Labels
Metricbeat Metricbeat needs_team Indicates that the issue/PR needs a Team:* label Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better.

Comments

@rudolf
Copy link
Contributor

rudolf commented Nov 27, 2023

Expose the kube_pod_completion_time from kube-state-metrics https://github.com/kubernetes/kube-state-metrics/blob/main/docs/pod-metrics.md

Since this captures the time a pod was terminated https://github.com/kubernetes/kube-state-metrics/blob/240cffd908220854a27f7e92d8157eaee4dc8d42/internal/store/pod.go#L103-L115 this is useful for alerting on terminated pods due to Error or OOMKilled conditions.

Related:

@rudolf rudolf added Metricbeat Metricbeat Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels Nov 27, 2023
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 27, 2023
@botelastic
Copy link

botelastic bot commented Nov 27, 2023

This issue doesn't have a Team:<team> label.

@tetianakravchenko
Copy link
Contributor

Hey @rudolf

this is useful for alerting on terminated pods due to Error or OOMKilled conditions.

could you please explain what are you trying to achieve?

The problem here is that kube_pod_completion_time metric is only reported if the container status is Terminated https://github.com/kubernetes/kube-state-metrics/blob/240cffd908220854a27f7e92d8157eaee4dc8d42/internal/store/pod.go#L113
as an example it could be useful for jobs:

$ kubectl get pods | grep Completed
pi-2kr6q                                     0/1     Completed   0             110s

then kube_pod_completion_time is reported:

$ curl -s kube-state-metrics:8080/metrics | grep time | grep kube_pod_completion_time
kube_pod_completion_time{namespace="kube-system",pod="pi-2kr6q",uid="5673c7fb-95d0-41e1-85a6-3dd642202e8b"} 1.703698181e+09

But deletion of the pod do not cause reporting of this metric - issue

Also if the container was restarted in the past because of the Error or OOMKilled, kube_pod_completion_time will not be reported.
Example: in my cluster I have pod restarted due to an error:

kubectl get pods
NAME                                         READY   STATUS      RESTARTS      AGE
...
kube-scheduler-test-control-plane            1/1     Running     1 (84m ago)   4h51m

Container status:

Containers:
  kube-scheduler:
    ...
    State:          Running
      Started:      Wed, 27 Dec 2023 17:05:34 +0100
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 27 Dec 2023 13:38:18 +0100
      Finished:     Wed, 27 Dec 2023 17:05:33 +0100
    Ready:          True
    Restart Count:  1

reported metrics regarding the Last State:

kube_pod_container_status_last_terminated_reason{namespace="kube-system",pod="kube-scheduler-test-control-plane",uid="95b333a1-d3ff-419e-aee4-4c014b3f4f4b",container="kube-scheduler",reason="Error"} 1
kube_pod_container_status_last_terminated_exitcode{namespace="kube-system",pod="kube-scheduler-test-control-plane",uid="95b333a1-d3ff-419e-aee4-4c014b3f4f4b",container="kube-scheduler"} 1

In kube_pod_completion_time metric is used cs.State.Terminated.FinishedAt.Unix() - https://github.com/kubernetes/kube-state-metrics/blob/240cffd908220854a27f7e92d8157eaee4dc8d42/internal/store/pod.go#L115
Container status of the pod above does not contain State.Terminated.FinishedAt, because the pod is in state Running

Are you interested in cs.LastTerminationState.Terminated.FinishedAt.Unix() instead?

@rudolf
Copy link
Contributor Author

rudolf commented Dec 28, 2023

@tetianakravchenko The problem is that we want to alert on pods that are OOMKilled or terminated due to Error. But since the last terminated reason is only cleared once the pod successfully starts, simply matching on pod events with a last_terminated_reason causes a lot of noise. If instead we could query for any events where the pod was terminated in the last 5 minutes we could only alert once per pod termination even if it's reported in several event documents.

Providing the last_terminated_reason_timestamp as in elastic/elastic-agent#3802 should solve our problem 👍

@lukeelmers
Copy link
Member

Closing as @rudolf has indicated this work is done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metricbeat Metricbeat needs_team Indicates that the issue/PR needs a Team:* label Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better.
Projects
None yet
Development

No branches or pull requests

3 participants