Request: detect when jobs are blocked from starting due to k8s limits #87

ShardPhoenix · 2021-11-11T02:58:43Z

Description

My k8s namespace has resource limits (max total CPU and RAM). If starting a Prefect job pod would go over those limits, it just never starts, and the flow on the UI hangs indefinitely. The agent logs still say Completed deployment of flow run. When I increased the limits, the job pod started immediately and the flow completed as normal.

Expected Behavior

It would be nice if this scenario could be detected and result in a useful error on the UI or agent level.

The text was updated successfully, but these errors were encountered:

chrisguidry · 2023-08-28T18:14:12Z

I was able to reproduce this on a local minikube running Prefect's Kubernetes Worker by adding a ResourceQuota to the namespace:

namespace-quotas.yaml 
apiVersion: v1
kind: ResourceQuota
metadata:
  namespace: prefect
  name: mem-cpu-quota
spec:
  hard:
    requests.cpu: "250m"
    requests.memory: 50Mi
    limits.cpu: "500m"
    limits.memory: 100Mi

My work pool doesn't specify cpu/memory requests for the pods it creates, so the flow run is stuck in Pending, and what's concerning is that there's no indication of an error from my worker's logs.

chrisguidry · 2023-08-28T18:30:01Z

Correction, after 5 minutes (my pod watch timeout), I did get a notice that the flow run had crashed because the pod never started:

prefect-worker-774d5b666-7smb2 prefect-worker 18:16:45.799 | ERROR   | prefect.flow_runs.worker - Job 'classic-limpet-cmmkp': Pod never started.
prefect-worker-774d5b666-7smb2 prefect-worker 18:16:45.937 | INFO    | prefect.flow_runs.worker - Reported flow run '07e54c68-2098-4c0b-accd-f71b44d23e7d' as crashed: Flow run infrastructure exited with non-zero status code -1.

chrisguidry · 2023-08-28T19:06:58Z

With the Kubernetes worker, we are getting the correct behavior we want. If a pod can't be scheduled, after the pod watch timeout for the work pool, the flow will be reported as crashed. The relevant logs from the worker will be included as well.

However, the logs are exceedingly vague:

Flow run infrastructure exited with non-zero status code -1.

For pods that couldn't be scheduled at all

or

Flow run infrastructure exited with non-zero status code 137.

For pods that start but then OOM

I'd like to expose more details from the Kubernetes worker when there's a scheduling failure, if possible

chrisguidry · 2023-08-28T19:11:19Z

[Note, this issue should move to the prefect-kubernetes collection]

In cases where the Job definition itself is fine but the Job is unable to schedule any Pods due to scheduling constraints (resources, node availability, etc), we weren't able to give users much more information than that the Pod for their flow never started. With this change, we'll go inspect any events related to that Job and include them in the logs. These events include things like scheduling constraint violations, in enough detail to help someone diagnose the issue without going back to the cluster. Closes #87

When a Job can't schedule a Pod, log the Job's recent events In cases where the Job definition itself is fine but the Job is unable to schedule any Pods due to scheduling constraints (resources, node availability, etc), we weren't able to give users much more information than that the Pod for their flow never started. With this change, we'll go inspect any events related to that Job and include them in the logs. These events include things like scheduling constraint violations, in enough detail to help someone diagnose the issue without going back to the cluster. Closes #87

In #87/#88, we logged additional information from the Kubernetes Event API when a Job couldn't create a Pod. In this change, we expand that to log additional event information when a Pod can't start running. This covers cases like `ErrImagePull` or pod scheduling constraint failures that will prevent the Pod from going into a Running state. Closes #90

* Extending the Kubernetes event logging to cover Pod events In #87/#88, we logged additional information from the Kubernetes Event API when a Job couldn't create a Pod. In this change, we expand that to log additional event information when a Pod can't start running. This covers cases like `ErrImagePull` or pod scheduling constraint failures that will prevent the Pod from going into a Running state. Closes #90 * Python 3.8 type hint compatibility

zanieb self-assigned this Nov 11, 2021

zanieb removed their assignment Dec 5, 2022

chrisguidry transferred this issue from PrefectHQ/prefect Aug 28, 2023

chrisguidry mentioned this issue Aug 28, 2023

When a Job can't schedule a Pod, log the Job's recent events #88

Merged

5 tasks

chrisguidry closed this as completed in #88 Aug 29, 2023

chrisguidry mentioned this issue Aug 29, 2023

Extending the Kubernetes event logging to cover Pod events #91

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: detect when jobs are blocked from starting due to k8s limits #87

Request: detect when jobs are blocked from starting due to k8s limits #87

ShardPhoenix commented Nov 11, 2021

chrisguidry commented Aug 28, 2023

chrisguidry commented Aug 28, 2023

chrisguidry commented Aug 28, 2023

chrisguidry commented Aug 28, 2023

Request: detect when jobs are blocked from starting due to k8s limits #87

Request: detect when jobs are blocked from starting due to k8s limits #87

Comments

ShardPhoenix commented Nov 11, 2021

Description

Expected Behavior

chrisguidry commented Aug 28, 2023

chrisguidry commented Aug 28, 2023

chrisguidry commented Aug 28, 2023

chrisguidry commented Aug 28, 2023