Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod OOM sometimes causes a workflow to get stuck in "Running" state #8456

Closed
3 tasks done
acj opened this issue Apr 22, 2022 · 21 comments · Fixed by #8478
Closed
3 tasks done

Pod OOM sometimes causes a workflow to get stuck in "Running" state #8456

acj opened this issue Apr 22, 2022 · 21 comments · Fixed by #8478
Assignees
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@acj
Copy link

acj commented Apr 22, 2022

Checklist

  • Double-checked my configuration.
  • Tested using the latest version.
  • Used the Emissary executor.

Summary

This follows up on #7457.

What happened/what you expected to happen?

A small percentage of our pods fail with an OOMKilled status and should be restarted (per our workflow config) but aren't. The pod (and workflow) hangs around indefinitely without appearing to do anything. It seems like the controller loses track of the pod until it's either manually deleted or its hosting node is shut down. If either of those things happens, then the pod is restarted and the workflow usually succeeds.

It's expected that sometimes we'll run out of memory when running this workload. What's surprising is that most OOMKilled pods restart automatically, but some don't.

For a better sense of proportion, we run roughly 90,000 workflows per day. A few dozen of them get stuck in this state and have to be fixed manually.

The expected behavior is that Argo restarts the OOMKilled pod automatically until the retry limit is reached.

What version are you running?

3.3.2 for both the controller and executor.

Diagnostics

We're running on EKS 1.21 and use cluster-autoscaler. I don't think the problem is related to scaling, though, since we see the issue in pools of one node as well as pools of many nodes.

The stuck pods look like this:

$ kubectl -n argo-workflows get po | grep OOMKilled
my-workflow-8acc183e-ffrf2-1452349024        1/2     OOMKilled         0          38m

Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.

The below workflow matches our config as closely as possible -- TTL, retries, etc. The resource limits are slightly lower. If you run it enough times, it should be possible to reproduce the bug.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  namespace: argo-workflows
  name: oom-test
spec:
  ttlStrategy:
    secondsAfterSuccess: 60
    secondsAfterFailure: 300
  entrypoint: oom-test
  templates:
    - name: oom-test
      retryStrategy:
        limit: "3"
        retryPolicy: Always
      container:
        image: ruby:3.0-alpine
        imagePullPolicy: IfNotPresent
        env:
        command: ["ruby", "-e", "foo = []; while true; foo << \"abcdefgh\"; end"]
        resources:
          limits:
            cpu: 500m
            memory: 128Mi
# Logs from the workflow controller:
$ kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 
Found 3 pods, using pod/workflow-controller-6dd5db7b67-7wbkr
time="2022-04-22T19:21:28.908Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.908Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.908Z" level=info msg="node unchanged" nodeID=my-workflow-8acc183e-ffrf2-1452349024
time="2022-04-22T19:21:28.909Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2
time="2022-04-22T19:21:28.909Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=my-workflow-8acc183e-ffrf2

# The workflow's pods that are problematic:
$ kubectl -n argo-workflows describe po my-workflow-8acc183e-ffrf2-1452349024
Name:         my-workflow-8acc183e-ffrf2-1452349024
Namespace:    argo-workflows
Priority:     0
Node:         ip-10-161-69-226.ec2.internal/10.161.69.226
Start Time:   Fri, 22 Apr 2022 14:39:19 -0400
Labels:       workflows.argoproj.io/completed=false
              workflows.argoproj.io/workflow=my-workflow-8acc183e-ffrf2
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              cni.projectcalico.org/containerID: 4295e8d5db4862b7ec687add0197b23e86795f2cc52885eb279976b00a64d98c
              cni.projectcalico.org/podIP: 192.168.80.248/32
              cni.projectcalico.org/podIPs: 192.168.80.248/32
              kubectl.kubernetes.io/default-container: main
              kubernetes.io/psp: eks.privileged
              workflows.argoproj.io/node-id: my-workflow-8acc183e-ffrf2-1452349024
              workflows.argoproj.io/node-name: my-workflow-8acc183e-ffrf2(0)
Status:       Running
IP:           192.168.80.248
IPs:
  IP:           192.168.80.248
Controlled By:  Workflow/my-workflow-8acc183e-ffrf2
Init Containers:
  init:
    Container ID:  containerd://c7096309368e2dfc3ec809ec7dfbe9f5c164b3780b7fe0fe125023efddf93d1b
    Image:         quay.io/argoproj/argoexec:v3.3.2
    Image ID:      quay.io/argoproj/argoexec@sha256:817a790f1bbd2fd31b71c2126e5489e06b49a98936352e4851fd0f325e3332c3
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      init
      --loglevel
      info
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 22 Apr 2022 14:39:48 -0400
      Finished:     Fri, 22 Apr 2022 14:39:48 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:                      my-workflow-8acc183e-ffrf2-1452349024 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:    emissary
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 my-workflow-8acc183e-ffrf2
      ARGO_CONTAINER_NAME:                init
      ARGO_TEMPLATE:                      ...
      ARGO_NODE_ID:                       my-workflow-8acc183e-ffrf2-1452349024
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
Containers:
  wait:
    Container ID:  containerd://51b5e17ea47c919f6d4cc60ef359587c1da93d3e007b23312e43637c88abb78e
    Image:         quay.io/argoproj/argoexec:v3.3.2
    Image ID:      quay.io/argoproj/argoexec@sha256:817a790f1bbd2fd31b71c2126e5489e06b49a98936352e4851fd0f325e3332c3
    Port:          <none>
    Host Port:     <none>
    Command:
      argoexec
      wait
      --loglevel
      info
    State:          Running
      Started:      Fri, 22 Apr 2022 14:39:55 -0400
    Ready:          True
    Restart Count:  0
    Environment:
      ARGO_POD_NAME:                      my-workflow-8acc183e-ffrf2-1452349024 (v1:metadata.name)
      ARGO_POD_UID:                        (v1:metadata.uid)
      ARGO_CONTAINER_RUNTIME_EXECUTOR:    emissary
      GODEBUG:                            x509ignoreCN=0
      ARGO_WORKFLOW_NAME:                 my-workflow-8acc183e-ffrf2
      ARGO_CONTAINER_NAME:                wait
      ARGO_TEMPLATE:                      ...
      ARGO_NODE_ID:                       my-workflow-8acc183e-ffrf2-1452349024
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /mainctrfs/tmp from ephemeral-work-volume (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
  main:
    Container ID:  containerd://6abf9a62c12400543e2e358c074af5bbe0249f2781af699da2025fd19dee61f1
    Image:         ....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images:my-app-my-workflow-default-8acc183e
    Image ID:      ....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images@sha256:8bc8de084174cd09b7f9271e0468416d8354057f5227dea72028c656e79be981
    Port:          <none>
    Host Port:     <none>
    Command:
      /var/run/argo/argoexec
      emissary
      --
      ...
    Args:
      ...
    State:          Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Fri, 22 Apr 2022 14:39:55 -0400
      Finished:     Fri, 22 Apr 2022 14:41:01 -0400
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:     250m
      memory:  512Mi
    Environment:
      AWS_REGION:                         us-east-1
      AWS_ACCOUNT_NUMBER:                 ...
      AWS_SCHEME:                         https://
      ENVIRONMENT:                        production
      LOG_LEVEL:                          info
      RUNNING_ON_ARGO_WORKER:             true
      ARGO_CONTAINER_NAME:                main
      ARGO_TEMPLATE:                      [...]
      ARGO_NODE_ID:                       my-workflow-8acc183e-ffrf2-1452349024
      ARGO_INCLUDE_SCRIPT_OUTPUT:         false
      ARGO_DEADLINE:                      0001-01-01T00:00:00Z
      ARGO_PROGRESS_FILE:                 /var/run/argo/progress
      ARGO_PROGRESS_PATCH_TICK_DURATION:  1m0s
      ARGO_PROGRESS_FILE_TICK_DURATION:   3s
    Mounts:
      /tmp from ephemeral-work-volume (rw)
      /var/run/argo from var-run-argo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-p6tnt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  var-run-argo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  ephemeral-work-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-p6tnt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              ...
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  39m   default-scheduler  Successfully assigned argo-workflows/my-workflow-8acc183e-ffrf2-1452349024 to ip-10-161-69-226.ec2.internal
  Normal  Pulled     39m   kubelet            Container image "quay.io/argoproj/argoexec:v3.3.2" already present on machine
  Normal  Created    39m   kubelet            Created container init
  Normal  Started    39m   kubelet            Started container init
  Normal  Pulled     39m   kubelet            Container image "quay.io/argoproj/argoexec:v3.3.2" already present on machine
  Normal  Created    39m   kubelet            Created container wait
  Normal  Started    39m   kubelet            Started container wait
  Normal  Pulled     39m   kubelet            Container image "....dkr.ecr.us-east-1.amazonaws.com/argo-workflow-images:my-app-my-workflow-default-8acc183e" already present on machine
  Normal  Created    39m   kubelet            Created container main
  Normal  Started    39m   kubelet            Started container main

# Logs from in your workflow's wait container, something like:
$ kubectl -n argo-workflows logs my-workflow-8acc183e-ffrf2-1452349024 wait
time="2022-04-22T18:39:55.721Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.3.2
time="2022-04-22T18:39:55.724Z" level=info msg="Creating a emissary executor"
time="2022-04-22T18:39:55.725Z" level=info msg="Using executor retry strategy" Duration=1s Factor=1.6 Jitter=0.5 Steps=5
time="2022-04-22T18:39:55.725Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=false namespace=argo-workflows podName=my-workflow-8acc183e-ffrf2-1452349024 template="..."
time="2022-04-22T18:39:55.725Z" level=info msg="Starting deadline monitor"
time="2022-04-22T18:44:55.726Z" level=info msg="Alloc=4954 TotalAlloc=10471 Sys=19922 NumGC=5 Goroutines=7"
time="2022-04-22T18:49:55.726Z" level=info msg="Alloc=4976 TotalAlloc=10582 Sys=19922 NumGC=7 Goroutines=7"
time="2022-04-22T18:54:55.726Z" level=info msg="Alloc=4957 TotalAlloc=10692 Sys=19922 NumGC=10 Goroutines=7"
time="2022-04-22T18:59:55.726Z" level=info msg="Alloc=4978 TotalAlloc=10800 Sys=19922 NumGC=12 Goroutines=7"
time="2022-04-22T19:04:55.726Z" level=info msg="Alloc=4958 TotalAlloc=10909 Sys=19922 NumGC=15 Goroutines=7"
time="2022-04-22T19:09:55.726Z" level=info msg="Alloc=4979 TotalAlloc=11017 Sys=19922 NumGC=17 Goroutines=7"
time="2022-04-22T19:14:55.726Z" level=info msg="Alloc=4959 TotalAlloc=11126 Sys=19922 NumGC=20 Goroutines=7"
time="2022-04-22T19:19:55.725Z" level=info msg="Alloc=4979 TotalAlloc=11235 Sys=19922 NumGC=22 Goroutines=7"
time="2022-04-22T19:24:55.726Z" level=info msg="Alloc=4959 TotalAlloc=11343 Sys=19922 NumGC=25 Goroutines=7"
time="2022-04-22T19:29:55.726Z" level=info msg="Alloc=4979 TotalAlloc=11451 Sys=19922 NumGC=27 Goroutines=7"
time="2022-04-22T19:34:55.725Z" level=info msg="Alloc=4959 TotalAlloc=11560 Sys=19922 NumGC=30 Goroutines=7"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Apr 25, 2022

#7457

@alexec
Copy link
Contributor

alexec commented Apr 25, 2022

I think this is an emissary only issue:

  • The main container is OOMKilled.
  • The wait container is waiting for /var/run/argo/ctr/main/exitcode to appear. This never happen.
  • The controller should SIGTERM the wait container when the main container terminates.

@alexec alexec added area/controller Controller issues, panics type/regression Regression from previous behavior (a specific type of bug) labels Apr 25, 2022
@alexec
Copy link
Contributor

alexec commented Apr 25, 2022

fix(controller): Only clean-up pod when both main and wait containers have terminated. Fixes #5981 (#6033)

@alexec alexec self-assigned this Apr 25, 2022
alexec added a commit to alexec/argo-workflows that referenced this issue Apr 25, 2022
alexec added a commit to alexec/argo-workflows that referenced this issue Apr 25, 2022
@hadim
Copy link

hadim commented Apr 26, 2022

We are also seeing this issue with 3.2.9. Is that possible/planned to backport the fix in #8478 to 3.2.x?

@alexec
Copy link
Contributor

alexec commented Apr 26, 2022

We are also seeing this issue with 3.2.9. Is that possible/planned to backport the fix in #8478 to 3.2.x?

Would you be interested in testing the fix? Given that is happen irregularly, it would be great is someone can soak it.

@hadim
Copy link

hadim commented Apr 26, 2022

It only happened once so far and after a very long run (~12h). I checked the OOM killed by looking at the memory vs time graph and I confirm it has been stopped during a memory spike at 100%.

I don't think it will be easy to reproduce, and our only cluster is in prod currently (so I can't really test a fix easily).

I will post here if I can reproduce it with a script that trigger the bug fast.

@hadim
Copy link

hadim commented Apr 26, 2022

In case it helps for reproducing, the workload was a bash script calling a python script.

alexec added a commit that referenced this issue Apr 26, 2022
@hadim
Copy link

hadim commented Apr 26, 2022

@alexec I am planning to work on a minimal reproducible example. Is there is any plan to cherry-pick that PR for 3.2.x?

@hadim
Copy link

hadim commented Apr 30, 2022

For the record I tried to reproduce it with the following template:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: memory-oom-killed-
  labels:
    organization: internal
    project: pantagruel-tests
spec:
  entrypoint: entrypoint
  templates:
    - name: entrypoint
      script:
        image: mambaorg/micromamba:0.23.0
        command: [bash]
        source: |
          set -e

          eval "$(micromamba shell hook --shell=bash)"
          micromamba activate
          micromamba install -q -y "python=3.9" pip git numpy joblib -c conda-forge

          PYTHON_CODE=$(cat <<END

          import numpy as np
          from joblib import Parallel, delayed

          def leak():
            a = np.zeros(1_000_000_000_000)
            for i in range(1_000_000_000_000):
                a[i] = i

          Parallel(n_jobs=2)(leak for i in range(10))

          END
          )

          echo "START"
          python -c "$PYTHON_CODE"
          echo "DONE"

        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "64Mi"
            cpu: "250m"

But I can't reproduce the bug and the workflow is correctly stopped with an "OOM Killed" error and mark as Failed.

I'll keep monitoring this and try to reproduce it.

@alexec alexec mentioned this issue May 3, 2022
alexec added a commit that referenced this issue May 3, 2022
alexec added a commit that referenced this issue May 3, 2022
@acj
Copy link
Author

acj commented May 5, 2022

@alexec I was going to try 3.3.5 today since the fix was included, but the changelog also mentions that it was reverted (93cb050). Are you going to reintroduce the fix later in 3.3.x?

@hadim
Copy link

hadim commented May 6, 2022

We have been able to reproduce the bug (but we can't share the workflow). We are currently experiencing the bug on 3.2.11. I am happy to test it with 3.3.* if it contains the fix.

Let me know.

@alexec
Copy link
Contributor

alexec commented May 6, 2022

This fix is on master, not on v3.3. You'll need to test that.

@hadim
Copy link

hadim commented May 6, 2022

Thanks @alexec. I'll keep you updated here. Do you build docker images on master so I can test it?


In case it helps I might have a guess (at least for my use case): I cannot reproduce the issue by simply running a python script that saturate the memory. Here the executor behaves as expected by catching the OOM signal.

The error happens, when we are running a deeply nested Python code using nested children processes that consume a lot of memory. My guess is that since the OOM signal is coming from a child process, the executor cannot detect it.

@alexec
Copy link
Contributor

alexec commented May 6, 2022

Tip of master is published as :latest. You can run that.

@hadim
Copy link

hadim commented May 6, 2022

I still see the issue using :latest. I will try to reproduce it in a workflow I can share.

@hadim
Copy link

hadim commented May 7, 2022

Continuing my investigation (let me know if I should open a new issue).

I am still working on a reproducible example. In the meantime, I noticed that when the workflow is stuck in Running phase (while the logs show /argo/staging/script: line 52: 492 Killed. If I ssh to the main pod, I can see zombie python processes. If I kill all those zombie processes, then argo is able to detect it and turn to Fail phase with the following correct error: OOMKilled (exit code 137)

It only happens with the emissary executor, but we never got that issue with the docker executor. This is a major backward incompat for us.

I'll try to come with a workflow you can use to reproduce (but it's not that easy apparently).

@hadim
Copy link

hadim commented May 8, 2022

It seems like my issue is similar but not exactly the same as the original one here. I have opened a new one here #8680

@rajaie-sg
Copy link

FYI we are also running into this with 3.4.1 using a containerSet where the first container that runs (not the init, but the first defined by us) gets OOMKilled and then the Workflow just hangs in the Running state until the activeDeadlineSeconds timeout is reached. This happens when I set a memory request/limit of 30Mi for the container.

If I set that container's memory request/limit to 60Mi, it still gets OOMKilled, but Argo accurately detects that and marks the Workflow as "Faile".

@louisnow
Copy link

@alexec why was the fix reverted in 3.3.5 ? 93cb050

Is there any particular reason? We are currently on 3.3 and are facing some challenges upgrading to 3.4.x.
If there are no negative consequences can we run a 3.3.X version with that patch?

@ebuildy
Copy link

ebuildy commented Jan 29, 2024

awesome feature would be to retry X times by increasing memory !

@agilgur5
Copy link

@ebuildy that's already possible per #12482 (comment).

But this issue was entirely unrelated to retries, it's about a Workflow getting stuck when an OOM occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants