-
Notifications
You must be signed in to change notification settings - Fork 9
Request: detect when jobs are blocked from starting due to k8s limits #87
Comments
I was able to reproduce this on a local
My work pool doesn't specify cpu/memory requests for the pods it creates, so the flow run is stuck in |
Correction, after 5 minutes (my pod watch timeout), I did get a notice that the flow run had crashed because the pod never started:
|
With the Kubernetes worker, we are getting the correct behavior we want. If a pod can't be scheduled, after the pod watch timeout for the work pool, the flow will be reported as crashed. The relevant logs from the worker will be included as well. However, the logs are exceedingly vague:
For pods that couldn't be scheduled at all or
For pods that start but then OOM I'd like to expose more details from the Kubernetes worker when there's a scheduling failure, if possible |
[Note, this issue should move to the |
In cases where the Job definition itself is fine but the Job is unable to schedule any Pods due to scheduling constraints (resources, node availability, etc), we weren't able to give users much more information than that the Pod for their flow never started. With this change, we'll go inspect any events related to that Job and include them in the logs. These events include things like scheduling constraint violations, in enough detail to help someone diagnose the issue without going back to the cluster. Closes #87
When a Job can't schedule a Pod, log the Job's recent events In cases where the Job definition itself is fine but the Job is unable to schedule any Pods due to scheduling constraints (resources, node availability, etc), we weren't able to give users much more information than that the Pod for their flow never started. With this change, we'll go inspect any events related to that Job and include them in the logs. These events include things like scheduling constraint violations, in enough detail to help someone diagnose the issue without going back to the cluster. Closes #87
In #87/#88, we logged additional information from the Kubernetes Event API when a Job couldn't create a Pod. In this change, we expand that to log additional event information when a Pod can't start running. This covers cases like `ErrImagePull` or pod scheduling constraint failures that will prevent the Pod from going into a Running state. Closes #90
* Extending the Kubernetes event logging to cover Pod events In #87/#88, we logged additional information from the Kubernetes Event API when a Job couldn't create a Pod. In this change, we expand that to log additional event information when a Pod can't start running. This covers cases like `ErrImagePull` or pod scheduling constraint failures that will prevent the Pod from going into a Running state. Closes #90 * Python 3.8 type hint compatibility
Description
My k8s namespace has resource limits (max total CPU and RAM). If starting a Prefect job pod would go over those limits, it just never starts, and the flow on the UI hangs indefinitely. The agent logs still say
Completed deployment of flow run
. When I increased the limits, the job pod started immediately and the flow completed as normal.Expected Behavior
It would be nice if this scenario could be detected and result in a useful error on the UI or agent level.
The text was updated successfully, but these errors were encountered: