Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Request: detect when jobs are blocked from starting due to k8s limits #87

Closed
ShardPhoenix opened this issue Nov 11, 2021 · 4 comments · Fixed by #88
Closed

Request: detect when jobs are blocked from starting due to k8s limits #87

ShardPhoenix opened this issue Nov 11, 2021 · 4 comments · Fixed by #88

Comments

@ShardPhoenix
Copy link

Description

My k8s namespace has resource limits (max total CPU and RAM). If starting a Prefect job pod would go over those limits, it just never starts, and the flow on the UI hangs indefinitely. The agent logs still say Completed deployment of flow run. When I increased the limits, the job pod started immediately and the flow completed as normal.

Expected Behavior

It would be nice if this scenario could be detected and result in a useful error on the UI or agent level.

@zanieb zanieb self-assigned this Nov 11, 2021
@zanieb zanieb removed their assignment Dec 5, 2022
@chrisguidry
Copy link
Contributor

I was able to reproduce this on a local minikube running Prefect's Kubernetes Worker by adding a ResourceQuota to the namespace:

namespace-quotas.yaml 
apiVersion: v1
kind: ResourceQuota
metadata:
  namespace: prefect
  name: mem-cpu-quota
spec:
  hard:
    requests.cpu: "250m"
    requests.memory: 50Mi
    limits.cpu: "500m"
    limits.memory: 100Mi

My work pool doesn't specify cpu/memory requests for the pods it creates, so the flow run is stuck in Pending, and what's concerning is that there's no indication of an error from my worker's logs.

@chrisguidry
Copy link
Contributor

Correction, after 5 minutes (my pod watch timeout), I did get a notice that the flow run had crashed because the pod never started:

prefect-worker-774d5b666-7smb2 prefect-worker 18:16:45.799 | ERROR   | prefect.flow_runs.worker - Job 'classic-limpet-cmmkp': Pod never started.
prefect-worker-774d5b666-7smb2 prefect-worker 18:16:45.937 | INFO    | prefect.flow_runs.worker - Reported flow run '07e54c68-2098-4c0b-accd-f71b44d23e7d' as crashed: Flow run infrastructure exited with non-zero status code -1.

@chrisguidry
Copy link
Contributor

With the Kubernetes worker, we are getting the correct behavior we want. If a pod can't be scheduled, after the pod watch timeout for the work pool, the flow will be reported as crashed. The relevant logs from the worker will be included as well.

However, the logs are exceedingly vague:

Flow run infrastructure exited with non-zero status code -1.

For pods that couldn't be scheduled at all

or

Flow run infrastructure exited with non-zero status code 137.

For pods that start but then OOM

I'd like to expose more details from the Kubernetes worker when there's a scheduling failure, if possible

@chrisguidry
Copy link
Contributor

[Note, this issue should move to the prefect-kubernetes collection]

@chrisguidry chrisguidry transferred this issue from PrefectHQ/prefect Aug 28, 2023
chrisguidry added a commit that referenced this issue Aug 28, 2023
In cases where the Job definition itself is fine but the Job is unable to
schedule any Pods due to scheduling constraints (resources, node availability,
etc), we weren't able to give users much more information than that the Pod for
their flow never started.  With this change, we'll go inspect any events related
to that Job and include them in the logs.  These events include things like
scheduling constraint violations, in enough detail to help someone diagnose the
issue without going back to the cluster.

Closes #87
chrisguidry added a commit that referenced this issue Aug 29, 2023
When a Job can't schedule a Pod, log the Job's recent events

In cases where the Job definition itself is fine but the Job is unable to
schedule any Pods due to scheduling constraints (resources, node availability,
etc), we weren't able to give users much more information than that the Pod for
their flow never started.  With this change, we'll go inspect any events related
to that Job and include them in the logs.  These events include things like
scheduling constraint violations, in enough detail to help someone diagnose the
issue without going back to the cluster.

Closes #87
chrisguidry added a commit that referenced this issue Aug 29, 2023
In #87/#88, we logged additional information from the Kubernetes Event API when
a Job couldn't create a Pod.  In this change, we expand that to log additional
event information when a Pod can't start running.  This covers cases like
`ErrImagePull` or pod scheduling constraint failures that will prevent the Pod
from going into a Running state.

Closes #90
chrisguidry added a commit that referenced this issue Aug 30, 2023
* Extending the Kubernetes event logging to cover Pod events

In #87/#88, we logged additional information from the Kubernetes Event API when
a Job couldn't create a Pod.  In this change, we expand that to log additional
event information when a Pod can't start running.  This covers cases like
`ErrImagePull` or pod scheduling constraint failures that will prevent the Pod
from going into a Running state.

Closes #90

* Python 3.8 type hint compatibility
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants