-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge memory leak and processes do not restart automatically #4345
Comments
I have similar issues with memory usage. Roughly 60 |
Dask should be managing memory itself. You can find more in the documentation here. If everything is stalled that suggests something is wrong with the way this is happening. A reproducer would be really helpful in tracking this down. You may also be interested to read the best practices guide on submitting tasks. |
@jacobtomlinson I believe this would be a reproducer: I think there are at least two problems: Severe memory leaks caused by many common workloads and the particular deadlock that happens in this instance. |
Thanks that's helpful. So something outside of Dask is allocating memory and leaking. It makes sense that Dask cannot manage memory correctly then. I'm a little surprised the OOM Killer doesn't kick in and kill the worker. This is typically what I've seen in the past with workers running out of memory. |
The OOM Killer does indeed kick in if one allocates enough memory to exhaust the physical RAM. But I don't believe the way the memory is leaked affects this specific deadlock. I think it should have to do with this piece of code: distributed/distributed/worker.py Line 2649 in 76ef459
|
Sounds like this is similar to #6110. In that case, so much memory was used that (we think) the OS started swapping heavily to disk, degrading performance so much that basically nothing could run—including the Nanny's check for whether the worker process was over its memory limit. I think #6177 would resolve this issue for you.
See #6110 (comment) for explanation of why this isn't happening. If you're not actually fully OOM, just really really close to being OOM, things get bad. |
I have very long running task with thousands of tasks being send to the workers.
Since there are known problems related to memory leakage, at some point the RAM is full and at that point the
dask
workers are still alive but idle, they are stuck.Looking into the source code I see that, since I use
processes=True
the workers should be automatically created asNanny
processes and there is a parameter calledauto_restart=True
meaning that when the process reaches its memory limit it should be restarted automatically.However this is not happening as I explained above.
Any idea of why? (I'll try to work on a minimal reproducible example asap)
Many thanks
Gio
The text was updated successfully, but these errors were encountered: