Agent not detecting flow crash when EC2 spot instance revoked #9246

paulinjo · 2023-04-17T19:43:54Z

First check

I added a descriptive title to this issue.
I used the GitHub search to find a similar issue and didn't find it.
I searched the Prefect documentation for this issue.
I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

We run Prefect on an EKS cluster made primarily of EC2 spot instances. After receiving a BidEvictedEvent, the aws-node-termination-handler will drain the node gracefully, killing any Prefect job pods which may be running on it.

Even though the Prefect agent raises an error that the job container cannot be found, Prefect cloud will leave the job in a running state instead of marking it as crashed.

The flow run is using a Kubernetes infrastructure block.

Reproduction

N/A

Logs

[
    {
        "@timestamp": "2023-04-17 12:03:34.902",
        "@message": {
            "az": "us-east-1b",
            "ec2_instance_id": "i-0bfbe6ed3e71b9d24",
            "log": "2023-04-17T12:03:34.902154444Z stdout F 2023/04/17 12:03:34 INF Adding new event to the event store event={\"AutoScalingGroupName\":\"\",\"Description\":\"Spot ITN received. Instance will be interrupted at 2023-04-17T12:05:31Z \\n\",\"EndTime\":\"0001-01-01T00:00:00Z\",\"EventID\":\"spot-itn-aca1aaae362f8bf5b28dcf1b0912c5ea65982e0dd63e647c80a2f78678d55334\",\"InProgress\":false,\"InstanceID\":\"\",\"IsManaged\":false,\"Kind\":\"SPOT_ITN\",\"Monitor\":\"SPOT_ITN_MONITOR\",\"NodeLabels\":null,\"NodeName\":\"ip-10-160-20-154.ec2.internal\",\"NodeProcessed\":false,\"Pods\":null,\"ProviderID\":\"\",\"StartTime\":\"2023-04-17T12:05:31Z\",\"State\":\"\"}"
        },
        "@logStream": "/fluentbit-default",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/dataplane"
    },
    {
        "@timestamp": "2023-04-17 12:03:36.168",
        "@message": {
            "az": "us-east-1b",
            "ec2_instance_id": "i-0bfbe6ed3e71b9d24",
            "hostname": "ip-10-160-20-154.ec2.internal",
            "message": "I0417 12:03:36.167940    4497 kuberuntime_container.go:702] \"Killing container with a grace period\" pod=\"prefect-orion/electric-pigeon-hmbp6-69q59\" podUID=66249ddb-ac0a-4d30-bde3-33e4e5cf2bb4 containerName=\"prefect-job\" containerID=\"containerd://a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0\" gracePeriod=30",
            "systemd_unit": "kubelet.service"
        },
        "@logStream": "kubelet.service-ip-10-160-20-154.ec2.internal",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/dataplane"
    },
    {
        "@timestamp": "2023-04-17 12:04:05.782",
        "@message": {
            "kubernetes": {
                "container_hash": "650551417061.dkr.ecr.us-east-1.amazonaws.com/mercury@sha256:5426e855ad378c8c7be0cfd6c2cabe850a3b4879b5118f6f8ed791d8b539c62d",
                "container_image": "650551417061.dkr.ecr.us-east-1.amazonaws.com/mercury:data-prefect.prefect-runtime",
                "container_name": "prefect-job",
                "docker_id": "a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0",
                "host": "ip-10-160-20-154.ec2.internal",
                "labels": {
                    "controller-uid": "00456970-bc69-4346-ba7a-b7a5529004bb",
                    "job-name": "electric-pigeon-hmbp6"
                },
                "namespace_name": "prefect-orion",
                "pod_id": "66249ddb-ac0a-4d30-bde3-33e4e5cf2bb4",
                "pod_name": "electric-pigeon-hmbp6-69q59"
            },
            "log": "2023-04-17T12:04:05.782652673Z stderr F 12:04:05.781 | INFO    | Task run 'extract_signups_and_revisions-149' - Finished in state Completed()"
        },
        "@logStream": "electric-pigeon-hmbp6-69q59_prefect-orion_prefect-job-a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/application"
    },
    {
        "@timestamp": "2023-04-17 12:04:25.712",
        "@message": {
            "kubernetes": {
                "container_hash": "docker.io/prefecthq/prefect@sha256:e9f83df992b718a1f1a03c2567f3dbba120e6ef70ae9ba62efcdbbc0ef1a37d3",
                "container_image": "docker.io/prefecthq/prefect:2.9.0-python3.9",
                "container_name": "agent",
                "docker_id": "528412c4de7d466e22dd71f54072d6e02be599c8606fd6289eb82f3c9c5c1365",
                "host": "ip-10-160-10-242.ec2.internal",
                "labels": {
                    "app.kubernetes.io/instance": "prefect-agent-orion",
                    "app.kubernetes.io/name": "prefect-orion-agent",
                    "pod-template-hash": "8c7b86d78"
                },
                "namespace_name": "prefect-orion",
                "pod_id": "b805558a-85fd-429a-a013-2867847b1b30",
                "pod_name": "prefect-agent-orion-prefect-orion-agent-8c7b86d78-th67x"
            },
            "log": "2023-04-17T12:04:25.712695276Z stdout F rpc error: code = NotFound desc = an error occurred when try to find container \"a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0\": not found"
        },
        "@logStream": "prefect-agent-orion-prefect-orion-agent-8c7b86d78-th67x_prefect-orion_agent-528412c4de7d466e22dd71f54072d6e02be599c8606fd6289eb82f3c9c5c1365",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/application"
    }
]

Versions

2.10.4

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-05-17T21:01:14Z

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

paulinjo · 2023-05-17T21:28:59Z

Have there been any updates which would have addressed this?

zanieb · 2023-05-17T21:56:59Z

Can you provide a reproduction that does not rely on spot instance eviction? We will need to be able to test changes to resolve this. Ideally the example would not require AWS.

A possible solution is to report flow runs as CRASHED if the infrastructure cannot be found to report a status.

paulinjo · 2023-05-18T13:12:25Z

Unfortunately I cannot provide a reproduction outside of spot instance eviction.

All of our EKS clusters use exclusively spot instances for ETL jobs to cut back on cost, so this is entirely representative of our workloads.

We also had another instance of this last night which proved to be very disruptive, since some external systems rely on accurate flow run state.

zangell44 · 2023-07-13T16:33:37Z

@paulinjo I think the handling added for STOPPED jobs and missing containers in this PR #10125 should resolve the issue you're seeing.

After the release today, could you try upgrading your agent and runtime environment to 2.10.21?

paulinjo · 2023-07-13T16:34:48Z

Sure thing.

zangell44 · 2023-07-26T20:51:15Z

Despite #10125, there are still reports of flow runs not being marked as Crashed correctly when spot instances are revoked. We are continuing to investigate

zhen0 · 2023-08-03T19:52:31Z

Possibly connected to #10141

zangell44 · 2023-08-16T21:14:50Z

After testing internally, we think the issue after #10125 is the pod status does not have termination information, resulting in this error

An error occurred while monitoring flow run '3baff259-cfac-4685-a4b0-2fc33504685e'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 530, in run
    status_code = await run_sync_in_worker_thread(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 857, in _watch_job
    return first_container_status.state.terminated.exit_code
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'exit_code'

The fix in PrefectHQ/prefect-kubernetes#85 should resolve the issue and we will backport to Prefect Agents too.

zangell44 · 2023-08-17T17:51:23Z

This issue should be resolved with the release of Prefect 2.11.4 today. Please let us know if you still experience issues!

paulinjo added bug Something isn't working status:triage labels Apr 17, 2023

gabcoyne added the from:sales label Apr 17, 2023

github-actions bot added the status:stale label May 17, 2023

zanieb added needs:design Blocked by a need for an implementation outline status:accepted needs:mre Needs minimal reproduction and removed status:stale labels May 17, 2023

billpalombi added priority:high and removed priority:medium labels Jun 28, 2023

zangell44 self-assigned this Jun 30, 2023

zangell44 mentioned this issue Jul 26, 2023

Return -1 status code if backoff limit reached #10311

Closed

4 tasks

cicdw removed the priority:high label Aug 15, 2023

zangell44 mentioned this issue Aug 16, 2023

Worker handle eviction events PrefectHQ/prefect-kubernetes#85

Merged

5 tasks

zangell44 closed this as completed in PrefectHQ/prefect-kubernetes#85 Aug 16, 2023

zangell44 mentioned this issue Aug 22, 2023

Flows get stuck in Running state #10141

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent not detecting flow crash when EC2 spot instance revoked #9246

Agent not detecting flow crash when EC2 spot instance revoked #9246

paulinjo commented Apr 17, 2023 •

edited by zangell44

Loading

github-actions bot commented May 17, 2023

paulinjo commented May 17, 2023

zanieb commented May 17, 2023

paulinjo commented May 18, 2023

zangell44 commented Jul 13, 2023

paulinjo commented Jul 13, 2023

zangell44 commented Jul 26, 2023

zhen0 commented Aug 3, 2023

zangell44 commented Aug 16, 2023

zangell44 commented Aug 17, 2023

Agent not detecting flow crash when EC2 spot instance revoked #9246

Agent not detecting flow crash when EC2 spot instance revoked #9246

Comments

paulinjo commented Apr 17, 2023 • edited by zangell44 Loading

First check

Bug summary

Reproduction

Logs

Versions

Additional context

github-actions bot commented May 17, 2023

paulinjo commented May 17, 2023

zanieb commented May 17, 2023

paulinjo commented May 18, 2023

zangell44 commented Jul 13, 2023

paulinjo commented Jul 13, 2023

zangell44 commented Jul 26, 2023

zhen0 commented Aug 3, 2023

zangell44 commented Aug 16, 2023

zangell44 commented Aug 17, 2023

paulinjo commented Apr 17, 2023 •

edited by zangell44

Loading