Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

When a Job can't schedule a Pod, log the Job's recent events #88

Merged
merged 3 commits into from
Aug 29, 2023

Conversation

chrisguidry
Copy link
Contributor

@chrisguidry chrisguidry commented Aug 28, 2023

In cases where the Job definition itself is fine but the Job is unable to
schedule any Pods due to scheduling constraints (resources, node availability,
etc), we weren't able to give users much more information than that the Pod for
their flow never started. With this change, we'll go inspect any events related
to that Job and include them in the logs. These events include things like
scheduling constraint violations, in enough detail to help someone diagnose the
issue without going back to the cluster.

Closes #87

Screenshots

image

image

Checklist

  • References any related issue by including "Closes #" or "Closes ".
    • If no issue exists and your change is not a small fix, please create an issue first.
  • Includes tests or only affects documentation.
  • Passes pre-commit checks.
    • Run pre-commit install && pre-commit run --all locally for formatting and linting.
  • Includes screenshots of documentation updates.
    • Run mkdocs serve view documentation locally.
  • Summarizes PR's changes in CHANGELOG.md

In cases where the Job definition itself is fine but the Job is unable to
schedule any Pods due to scheduling constraints (resources, node availability,
etc), we weren't able to give users much more information than that the Pod for
their flow never started.  With this change, we'll go inspect any events related
to that Job and include them in the logs.  These events include things like
scheduling constraint violations, in enough detail to help someone diagnose the
issue without going back to the cluster.

Closes #87
@chrisguidry chrisguidry requested a review from a team as a code owner August 28, 2023 22:00
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chrisguidry chrisguidry merged commit 9ccee0a into main Aug 29, 2023
@chrisguidry chrisguidry deleted the more-insight-into-scheduling branch August 29, 2023 13:40
chrisguidry added a commit that referenced this pull request Aug 29, 2023
In #87/#88, we logged additional information from the Kubernetes Event API when
a Job couldn't create a Pod.  In this change, we expand that to log additional
event information when a Pod can't start running.  This covers cases like
`ErrImagePull` or pod scheduling constraint failures that will prevent the Pod
from going into a Running state.

Closes #90
chrisguidry added a commit that referenced this pull request Aug 30, 2023
* Extending the Kubernetes event logging to cover Pod events

In #87/#88, we logged additional information from the Kubernetes Event API when
a Job couldn't create a Pod.  In this change, we expand that to log additional
event information when a Pod can't start running.  This covers cases like
`ErrImagePull` or pod scheduling constraint failures that will prevent the Pod
from going into a Running state.

Closes #90

* Python 3.8 type hint compatibility
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request: detect when jobs are blocked from starting due to k8s limits
2 participants