[FEATURE] Report number of job restarts as a metric #122

s-vitaliy · 2024-08-12T11:40:40Z

Description

We need to track the streams that can be stopped due to exhaustion of a backoff limit.

Possible solution

On every job change event get number of pods in failed state and calculate metrics based on this value.

Alternatives

Listen to pods events and store mapping job_id <--> failed_pods_count in memory.

Context

The backoffLimit algorithm is explained here: https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

The text was updated successfully, but these errors were encountered:

s-vitaliy · 2024-08-14T14:00:12Z

Creating a metric like that seems to be not so easy. We need to investigate how the Kubernetes Job Controller actually manages the backoff counter. Consequently, the task size has been upgraded to XL and the priority has been decreased.

s-vitaliy added the code/new-feature New feature or request label Aug 12, 2024

s-vitaliy self-assigned this Aug 12, 2024

s-vitaliy added this to Arcane Aug 12, 2024

s-vitaliy moved this to Backlog in Arcane Aug 14, 2024

s-vitaliy removed their assignment Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Report number of job restarts as a metric #122

[FEATURE] Report number of job restarts as a metric #122

s-vitaliy commented Aug 12, 2024 •

edited

Loading

s-vitaliy commented Aug 14, 2024

[FEATURE] Report number of job restarts as a metric #122

[FEATURE] Report number of job restarts as a metric #122

Comments

s-vitaliy commented Aug 12, 2024 • edited Loading

Description

Possible solution

Alternatives

Context

s-vitaliy commented Aug 14, 2024

s-vitaliy commented Aug 12, 2024 •

edited

Loading