Huge memory leak and processes do not restart automatically #4345

gioxc88 · 2020-12-10T12:05:33Z

I have very long running task with thousands of tasks being send to the workers.
Since there are known problems related to memory leakage, at some point the RAM is full and at that point the dask workers are still alive but idle, they are stuck.

Looking into the source code I see that, since I use processes=True the workers should be automatically created as Nanny processes and there is a parameter called auto_restart=True meaning that when the process reaches its memory limit it should be restarted automatically.
However this is not happening as I explained above.

Any idea of why? (I'll try to work on a minimal reproducible example asap)
Many thanks
Gio

The text was updated successfully, but these errors were encountered:

crayonfu · 2021-01-05T19:33:47Z

I have similar issues with memory usage. Roughly 60~~80GB memory consumption per hour with 200 worker processes, 500k~~1M tasks. Initially dask can restart individual workers approaching limit (4GB/worker), but as memory fills up, in the end dask reports no errors, just idles there (maybe it can't find anywhere to move data to). Interestingly if I use multiprocessing pool on a single machine, the memory usage is stable around 40~50GB >24 hours. Tried gc collect and aggressively del futures, but no help.
Also tried --lifetime, but found it is not completely graceful. There is a small chance the cluster would wait for lost futures indefinitely after a scheduled worker restart.
For now I split my tasks into chunks before the memory fills up, dump intermediate results to disk, restart whole cluster and start fresh. Fortunately my tasks are embarrassingly parallel, so chunking is easy and lost efficiency due to dump/reload is tiny. Still wondering what the proper way is to handle this, or scenarios where dumping intermediate results are not feasible.

jacobtomlinson · 2021-01-06T10:15:40Z

Dask should be managing memory itself. You can find more in the documentation here. If everything is stalled that suggests something is wrong with the way this is happening. A reproducer would be really helpful in tracking this down.

You may also be interested to read the best practices guide on submitting tasks.

Zaharid · 2021-01-06T10:55:39Z

@jacobtomlinson I believe this would be a reproducer:

#2835 (comment)

I think there are at least two problems: Severe memory leaks caused by many common workloads and the particular deadlock that happens in this instance.

jacobtomlinson · 2021-01-06T14:14:06Z

Thanks that's helpful.

So something outside of Dask is allocating memory and leaking. It makes sense that Dask cannot manage memory correctly then.

I'm a little surprised the OOM Killer doesn't kick in and kill the worker. This is typically what I've seen in the past with workers running out of memory.

quasiben · 2021-01-06T14:36:35Z

@crayonfu @gioxc88 you might be interested in this PR:
#4221

Zaharid · 2021-01-06T14:40:18Z

The OOM Killer does indeed kick in if one allocates enough memory to exhaust the physical RAM. But I don't believe the way the memory is leaked affects this specific deadlock.

I think it should have to do with this piece of code:

distributed/distributed/worker.py

Line 2649 in 76ef459

If we rise above 80% memory use, stop execution of new tasks

gioxc88 · 2021-01-20T10:15:44Z

@crayonfu @gioxc88 you might be interested in this PR:
#4221

Thanks for the suggestion I already looked into this PR a while ago.
Unfortunately I don't think I have the time to work on that at the moment

gjoseph92 · 2022-04-25T21:22:30Z

Sounds like this is similar to #6110. In that case, so much memory was used that (we think) the OS started swapping heavily to disk, degrading performance so much that basically nothing could run—including the Nanny's check for whether the worker process was over its memory limit.

I think #6177 would resolve this issue for you.

I'm a little surprised the OOM Killer doesn't kick in and kill the worker

See #6110 (comment) for explanation of why this isn't happening. If you're not actually fully OOM, just really really close to being OOM, things get bad.

Zaharid mentioned this issue Feb 12, 2021

RFC: explicit shared memory #4497

Open

gjoseph92 mentioned this issue Apr 25, 2022

Computation deadlocks due to worker rapidly running out of memory instead of spilling #6110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge memory leak and processes do not restart automatically #4345

Huge memory leak and processes do not restart automatically #4345

gioxc88 commented Dec 10, 2020 •

edited

Loading

crayonfu commented Jan 5, 2021

jacobtomlinson commented Jan 6, 2021

Zaharid commented Jan 6, 2021

jacobtomlinson commented Jan 6, 2021

quasiben commented Jan 6, 2021 •

edited

Loading

Zaharid commented Jan 6, 2021

gioxc88 commented Jan 20, 2021 •

edited

Loading

gjoseph92 commented Apr 25, 2022

Huge memory leak and processes do not restart automatically #4345

Huge memory leak and processes do not restart automatically #4345

Comments

gioxc88 commented Dec 10, 2020 • edited Loading

crayonfu commented Jan 5, 2021

jacobtomlinson commented Jan 6, 2021

Zaharid commented Jan 6, 2021

jacobtomlinson commented Jan 6, 2021

quasiben commented Jan 6, 2021 • edited Loading

Zaharid commented Jan 6, 2021

gioxc88 commented Jan 20, 2021 • edited Loading

gjoseph92 commented Apr 25, 2022

gioxc88 commented Dec 10, 2020 •

edited

Loading

quasiben commented Jan 6, 2021 •

edited

Loading

gioxc88 commented Jan 20, 2021 •

edited

Loading