-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod OOM sometimes causes a workflow to get stuck in "Running" state #8456
Comments
I think this is an emissary only issue:
|
Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: Alex Collins <[email protected]>
We are also seeing this issue with 3.2.9. Is that possible/planned to backport the fix in #8478 to 3.2.x? |
Would you be interested in testing the fix? Given that is happen irregularly, it would be great is someone can soak it. |
It only happened once so far and after a very long run (~12h). I checked the OOM killed by looking at the memory vs time graph and I confirm it has been stopped during a memory spike at 100%. I don't think it will be easy to reproduce, and our only cluster is in prod currently (so I can't really test a fix easily). I will post here if I can reproduce it with a script that trigger the bug fast. |
In case it helps for reproducing, the workload was a bash script calling a python script. |
Signed-off-by: Alex Collins <[email protected]>
@alexec I am planning to work on a minimal reproducible example. Is there is any plan to cherry-pick that PR for 3.2.x? |
For the record I tried to reproduce it with the following template: apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: memory-oom-killed-
labels:
organization: internal
project: pantagruel-tests
spec:
entrypoint: entrypoint
templates:
- name: entrypoint
script:
image: mambaorg/micromamba:0.23.0
command: [bash]
source: |
set -e
eval "$(micromamba shell hook --shell=bash)"
micromamba activate
micromamba install -q -y "python=3.9" pip git numpy joblib -c conda-forge
PYTHON_CODE=$(cat <<END
import numpy as np
from joblib import Parallel, delayed
def leak():
a = np.zeros(1_000_000_000_000)
for i in range(1_000_000_000_000):
a[i] = i
Parallel(n_jobs=2)(leak for i in range(10))
END
)
echo "START"
python -c "$PYTHON_CODE"
echo "DONE"
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "250m" But I can't reproduce the bug and the workflow is correctly stopped with an "OOM Killed" error and mark as Failed. I'll keep monitoring this and try to reproduce it. |
Signed-off-by: Alex Collins <[email protected]>
This reverts commit f2b075c. Signed-off-by: Alex Collins <[email protected]>
We have been able to reproduce the bug (but we can't share the workflow). We are currently experiencing the bug on 3.2.11. I am happy to test it with 3.3.* if it contains the fix. Let me know. |
This fix is on master, not on v3.3. You'll need to test that. |
Thanks @alexec. I'll keep you updated here. Do you build docker images on In case it helps I might have a guess (at least for my use case): I cannot reproduce the issue by simply running a python script that saturate the memory. Here the executor behaves as expected by catching the OOM signal. The error happens, when we are running a deeply nested Python code using nested children processes that consume a lot of memory. My guess is that since the OOM signal is coming from a child process, the executor cannot detect it. |
Tip of master is published as |
I still see the issue using |
Continuing my investigation (let me know if I should open a new issue). I am still working on a reproducible example. In the meantime, I noticed that when the workflow is stuck in Running phase (while the logs show It only happens with the I'll try to come with a workflow you can use to reproduce (but it's not that easy apparently). |
It seems like my issue is similar but not exactly the same as the original one here. I have opened a new one here #8680 |
FYI we are also running into this with 3.4.1 using a containerSet where the first container that runs (not the init, but the first defined by us) gets OOMKilled and then the Workflow just hangs in the Running state until the activeDeadlineSeconds timeout is reached. This happens when I set a memory request/limit of 30Mi for the container. If I set that container's memory request/limit to 60Mi, it still gets OOMKilled, but Argo accurately detects that and marks the Workflow as "Faile". |
awesome feature would be to retry X times by increasing memory ! |
@ebuildy that's already possible per #12482 (comment). But this issue was entirely unrelated to retries, it's about a Workflow getting stuck when an OOM occurs. |
Checklist
Summary
This follows up on #7457.
What happened/what you expected to happen?
A small percentage of our pods fail with an
OOMKilled
status and should be restarted (per our workflow config) but aren't. The pod (and workflow) hangs around indefinitely without appearing to do anything. It seems like the controller loses track of the pod until it's either manually deleted or its hosting node is shut down. If either of those things happens, then the pod is restarted and the workflow usually succeeds.It's expected that sometimes we'll run out of memory when running this workload. What's surprising is that most OOMKilled pods restart automatically, but some don't.
For a better sense of proportion, we run roughly 90,000 workflows per day. A few dozen of them get stuck in this state and have to be fixed manually.
The expected behavior is that Argo restarts the OOMKilled pod automatically until the retry limit is reached.
What version are you running?
3.3.2 for both the controller and executor.
Diagnostics
We're running on EKS 1.21 and use cluster-autoscaler. I don't think the problem is related to scaling, though, since we see the issue in pools of one node as well as pools of many nodes.
The stuck pods look like this:
Paste the smallest workflow that reproduces the bug. We must be able to run the workflow.
The below workflow matches our config as closely as possible -- TTL, retries, etc. The resource limits are slightly lower. If you run it enough times, it should be possible to reproduce the bug.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: