-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent not detecting flow crash when EC2 spot instance revoked #9246
Comments
This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment. |
Have there been any updates which would have addressed this? |
Can you provide a reproduction that does not rely on spot instance eviction? We will need to be able to test changes to resolve this. Ideally the example would not require AWS. A possible solution is to report flow runs as CRASHED if the infrastructure cannot be found to report a status. |
Unfortunately I cannot provide a reproduction outside of spot instance eviction. All of our EKS clusters use exclusively spot instances for ETL jobs to cut back on cost, so this is entirely representative of our workloads. We also had another instance of this last night which proved to be very disruptive, since some external systems rely on accurate flow run state. |
Sure thing. |
Despite #10125, there are still reports of flow runs not being marked as Crashed correctly when spot instances are revoked. We are continuing to investigate |
Possibly connected to #10141 |
After testing internally, we think the issue after #10125 is the pod status does not have termination information, resulting in this error
The fix in PrefectHQ/prefect-kubernetes#85 should resolve the issue and we will backport to Prefect Agents too. |
This issue should be resolved with the release of Prefect 2.11.4 today. Please let us know if you still experience issues! |
First check
Bug summary
We run Prefect on an EKS cluster made primarily of EC2 spot instances. After receiving a BidEvictedEvent, the
aws-node-termination-handler
will drain the node gracefully, killing any Prefect job pods which may be running on it.Even though the Prefect agent raises an error that the job container cannot be found, Prefect cloud will leave the job in a running state instead of marking it as crashed.
The flow run is using a Kubernetes infrastructure block.
Reproduction
Logs
Versions
Additional context
No response
The text was updated successfully, but these errors were encountered: