Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge memory leak and processes do not restart automatically #4345

Open
gioxc88 opened this issue Dec 10, 2020 · 8 comments
Open

Huge memory leak and processes do not restart automatically #4345

gioxc88 opened this issue Dec 10, 2020 · 8 comments

Comments

@gioxc88
Copy link

gioxc88 commented Dec 10, 2020

I have very long running task with thousands of tasks being send to the workers.
Since there are known problems related to memory leakage, at some point the RAM is full and at that point the dask workers are still alive but idle, they are stuck.

Looking into the source code I see that, since I use processes=True the workers should be automatically created as Nanny processes and there is a parameter called auto_restart=True meaning that when the process reaches its memory limit it should be restarted automatically.
However this is not happening as I explained above.

Any idea of why? (I'll try to work on a minimal reproducible example asap)
Many thanks
Gio

@crayonfu
Copy link

crayonfu commented Jan 5, 2021

I have similar issues with memory usage. Roughly 6080GB memory consumption per hour with 200 worker processes, 500k1M tasks. Initially dask can restart individual workers approaching limit (4GB/worker), but as memory fills up, in the end dask reports no errors, just idles there (maybe it can't find anywhere to move data to). Interestingly if I use multiprocessing pool on a single machine, the memory usage is stable around 40~50GB >24 hours. Tried gc collect and aggressively del futures, but no help.
Also tried --lifetime, but found it is not completely graceful. There is a small chance the cluster would wait for lost futures indefinitely after a scheduled worker restart.
For now I split my tasks into chunks before the memory fills up, dump intermediate results to disk, restart whole cluster and start fresh. Fortunately my tasks are embarrassingly parallel, so chunking is easy and lost efficiency due to dump/reload is tiny. Still wondering what the proper way is to handle this, or scenarios where dumping intermediate results are not feasible.

@jacobtomlinson
Copy link
Member

Dask should be managing memory itself. You can find more in the documentation here. If everything is stalled that suggests something is wrong with the way this is happening. A reproducer would be really helpful in tracking this down.

You may also be interested to read the best practices guide on submitting tasks.

@Zaharid
Copy link

Zaharid commented Jan 6, 2021

@jacobtomlinson I believe this would be a reproducer:

#2835 (comment)

I think there are at least two problems: Severe memory leaks caused by many common workloads and the particular deadlock that happens in this instance.

@jacobtomlinson
Copy link
Member

Thanks that's helpful.

So something outside of Dask is allocating memory and leaking. It makes sense that Dask cannot manage memory correctly then.

I'm a little surprised the OOM Killer doesn't kick in and kill the worker. This is typically what I've seen in the past with workers running out of memory.

@quasiben
Copy link
Member

quasiben commented Jan 6, 2021

@crayonfu @gioxc88 you might be interested in this PR:
#4221

@Zaharid
Copy link

Zaharid commented Jan 6, 2021

The OOM Killer does indeed kick in if one allocates enough memory to exhaust the physical RAM. But I don't believe the way the memory is leaked affects this specific deadlock.

I think it should have to do with this piece of code:

If we rise above 80% memory use, stop execution of new tasks

@gioxc88
Copy link
Author

gioxc88 commented Jan 20, 2021

@crayonfu @gioxc88 you might be interested in this PR:
#4221

Thanks for the suggestion I already looked into this PR a while ago.
Unfortunately I don't think I have the time to work on that at the moment

@gjoseph92
Copy link
Collaborator

Sounds like this is similar to #6110. In that case, so much memory was used that (we think) the OS started swapping heavily to disk, degrading performance so much that basically nothing could run—including the Nanny's check for whether the worker process was over its memory limit.

I think #6177 would resolve this issue for you.

I'm a little surprised the OOM Killer doesn't kick in and kill the worker

See #6110 (comment) for explanation of why this isn't happening. If you're not actually fully OOM, just really really close to being OOM, things get bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants