Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent not detecting flow crash when EC2 spot instance revoked #9246

Closed
4 tasks done
paulinjo opened this issue Apr 17, 2023 · 10 comments · Fixed by PrefectHQ/prefect-kubernetes#85
Closed
4 tasks done
Assignees
Labels
bug Something isn't working needs:design Blocked by a need for an implementation outline needs:mre Needs minimal reproduction

Comments

@paulinjo
Copy link

paulinjo commented Apr 17, 2023

First check

  • I added a descriptive title to this issue.
  • I used the GitHub search to find a similar issue and didn't find it.
  • I searched the Prefect documentation for this issue.
  • I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

We run Prefect on an EKS cluster made primarily of EC2 spot instances. After receiving a BidEvictedEvent, the aws-node-termination-handler will drain the node gracefully, killing any Prefect job pods which may be running on it.

Even though the Prefect agent raises an error that the job container cannot be found, Prefect cloud will leave the job in a running state instead of marking it as crashed.

The flow run is using a Kubernetes infrastructure block.

Reproduction

N/A

Logs

[
    {
        "@timestamp": "2023-04-17 12:03:34.902",
        "@message": {
            "az": "us-east-1b",
            "ec2_instance_id": "i-0bfbe6ed3e71b9d24",
            "log": "2023-04-17T12:03:34.902154444Z stdout F 2023/04/17 12:03:34 INF Adding new event to the event store event={\"AutoScalingGroupName\":\"\",\"Description\":\"Spot ITN received. Instance will be interrupted at 2023-04-17T12:05:31Z \\n\",\"EndTime\":\"0001-01-01T00:00:00Z\",\"EventID\":\"spot-itn-aca1aaae362f8bf5b28dcf1b0912c5ea65982e0dd63e647c80a2f78678d55334\",\"InProgress\":false,\"InstanceID\":\"\",\"IsManaged\":false,\"Kind\":\"SPOT_ITN\",\"Monitor\":\"SPOT_ITN_MONITOR\",\"NodeLabels\":null,\"NodeName\":\"ip-10-160-20-154.ec2.internal\",\"NodeProcessed\":false,\"Pods\":null,\"ProviderID\":\"\",\"StartTime\":\"2023-04-17T12:05:31Z\",\"State\":\"\"}"
        },
        "@logStream": "/fluentbit-default",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/dataplane"
    },
    {
        "@timestamp": "2023-04-17 12:03:36.168",
        "@message": {
            "az": "us-east-1b",
            "ec2_instance_id": "i-0bfbe6ed3e71b9d24",
            "hostname": "ip-10-160-20-154.ec2.internal",
            "message": "I0417 12:03:36.167940    4497 kuberuntime_container.go:702] \"Killing container with a grace period\" pod=\"prefect-orion/electric-pigeon-hmbp6-69q59\" podUID=66249ddb-ac0a-4d30-bde3-33e4e5cf2bb4 containerName=\"prefect-job\" containerID=\"containerd://a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0\" gracePeriod=30",
            "systemd_unit": "kubelet.service"
        },
        "@logStream": "kubelet.service-ip-10-160-20-154.ec2.internal",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/dataplane"
    },
    {
        "@timestamp": "2023-04-17 12:04:05.782",
        "@message": {
            "kubernetes": {
                "container_hash": "650551417061.dkr.ecr.us-east-1.amazonaws.com/mercury@sha256:5426e855ad378c8c7be0cfd6c2cabe850a3b4879b5118f6f8ed791d8b539c62d",
                "container_image": "650551417061.dkr.ecr.us-east-1.amazonaws.com/mercury:data-prefect.prefect-runtime",
                "container_name": "prefect-job",
                "docker_id": "a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0",
                "host": "ip-10-160-20-154.ec2.internal",
                "labels": {
                    "controller-uid": "00456970-bc69-4346-ba7a-b7a5529004bb",
                    "job-name": "electric-pigeon-hmbp6"
                },
                "namespace_name": "prefect-orion",
                "pod_id": "66249ddb-ac0a-4d30-bde3-33e4e5cf2bb4",
                "pod_name": "electric-pigeon-hmbp6-69q59"
            },
            "log": "2023-04-17T12:04:05.782652673Z stderr F 12:04:05.781 | INFO    | Task run 'extract_signups_and_revisions-149' - Finished in state Completed()"
        },
        "@logStream": "electric-pigeon-hmbp6-69q59_prefect-orion_prefect-job-a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/application"
    },
    {
        "@timestamp": "2023-04-17 12:04:25.712",
        "@message": {
            "kubernetes": {
                "container_hash": "docker.io/prefecthq/prefect@sha256:e9f83df992b718a1f1a03c2567f3dbba120e6ef70ae9ba62efcdbbc0ef1a37d3",
                "container_image": "docker.io/prefecthq/prefect:2.9.0-python3.9",
                "container_name": "agent",
                "docker_id": "528412c4de7d466e22dd71f54072d6e02be599c8606fd6289eb82f3c9c5c1365",
                "host": "ip-10-160-10-242.ec2.internal",
                "labels": {
                    "app.kubernetes.io/instance": "prefect-agent-orion",
                    "app.kubernetes.io/name": "prefect-orion-agent",
                    "pod-template-hash": "8c7b86d78"
                },
                "namespace_name": "prefect-orion",
                "pod_id": "b805558a-85fd-429a-a013-2867847b1b30",
                "pod_name": "prefect-agent-orion-prefect-orion-agent-8c7b86d78-th67x"
            },
            "log": "2023-04-17T12:04:25.712695276Z stdout F rpc error: code = NotFound desc = an error occurred when try to find container \"a56248c957372eeef0ce3fa7d26de3725d98ce21f4d34ea4b37a92f46836b2e0\": not found"
        },
        "@logStream": "prefect-agent-orion-prefect-orion-agent-8c7b86d78-th67x_prefect-orion_agent-528412c4de7d466e22dd71f54072d6e02be599c8606fd6289eb82f3c9c5c1365",
        "@log": "650551417061:/aws/containerinsights/atropos-butter-prod/application"
    }
]

Versions

2.10.4

Additional context

No response

@paulinjo paulinjo added bug Something isn't working status:triage labels Apr 17, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. To keep this issue open remove stale label or comment.

@paulinjo
Copy link
Author

Have there been any updates which would have addressed this?

@zanieb
Copy link
Contributor

zanieb commented May 17, 2023

Can you provide a reproduction that does not rely on spot instance eviction? We will need to be able to test changes to resolve this. Ideally the example would not require AWS.

A possible solution is to report flow runs as CRASHED if the infrastructure cannot be found to report a status.

@zanieb zanieb added needs:design Blocked by a need for an implementation outline status:accepted needs:mre Needs minimal reproduction and removed status:stale labels May 17, 2023
@paulinjo
Copy link
Author

Unfortunately I cannot provide a reproduction outside of spot instance eviction.

All of our EKS clusters use exclusively spot instances for ETL jobs to cut back on cost, so this is entirely representative of our workloads.

We also had another instance of this last night which proved to be very disruptive, since some external systems rely on accurate flow run state.

@zangell44
Copy link
Collaborator

@paulinjo I think the handling added for STOPPED jobs and missing containers in this PR #10125 should resolve the issue you're seeing.

After the release today, could you try upgrading your agent and runtime environment to 2.10.21?

@paulinjo
Copy link
Author

Sure thing.

@zangell44
Copy link
Collaborator

Despite #10125, there are still reports of flow runs not being marked as Crashed correctly when spot instances are revoked. We are continuing to investigate

@zhen0
Copy link
Member

zhen0 commented Aug 3, 2023

Possibly connected to #10141

@zangell44
Copy link
Collaborator

After testing internally, we think the issue after #10125 is the pod status does not have termination information, resulting in this error

An error occurred while monitoring flow run '3baff259-cfac-4685-a4b0-2fc33504685e'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 530, in run
    status_code = await run_sync_in_worker_thread(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 857, in _watch_job
    return first_container_status.state.terminated.exit_code
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'exit_code'

The fix in PrefectHQ/prefect-kubernetes#85 should resolve the issue and we will backport to Prefect Agents too.

@zangell44
Copy link
Collaborator

This issue should be resolved with the release of Prefect 2.11.4 today. Please let us know if you still experience issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs:design Blocked by a need for an implementation outline needs:mre Needs minimal reproduction
Projects
None yet
7 participants