You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This graph is not parallel. It's an incremental, serial reduction. Each reducer requires the previous reducer to finish before it can run. I've set up the tasks so that reducers are significantly slower than data producers.
Therefore, there's no need to load all the inputs into memory up front. It's going to be a long time until the final input task can be used. If we load it right away, it'll just take up memory.
As you can see, even though the load tasks were queued, far more data was loaded into memory than we can process at once.
With larger data sizes, or if there was some other computation going on at the same time, this probably could have killed the cluster.
This was motivated by playing around with dask-ml and incremental training. AFAIU point of incremental training is to be able to train on a larger-than-memory dataset by training on it chunk-by-chunk. But it seems this scheduling behavior might defeat the purpose, since all the data will end up loaded into distributed memory anyway (as long as training is slower than data loading; quite possible with a big ML model). Hopefully spilling will save you in the real-world, but it still doesn't seem like great behavior.
No ideas yet how to address this; just interesting to think about in the context of other scheduling questions like #7531
This graph is not parallel. It's an incremental, serial reduction. Each reducer requires the previous reducer to finish before it can run. I've set up the tasks so that reducers are significantly slower than data producers.
Therefore, there's no need to load all the inputs into memory up front. It's going to be a long time until the final input task can be used. If we load it right away, it'll just take up memory.
As you can see, even though the
load
tasks were queued, far more data was loaded into memory than we can process at once.With larger data sizes, or if there was some other computation going on at the same time, this probably could have killed the cluster.
This was motivated by playing around with dask-ml and incremental training. AFAIU point of incremental training is to be able to train on a larger-than-memory dataset by training on it chunk-by-chunk. But it seems this scheduling behavior might defeat the purpose, since all the data will end up loaded into distributed memory anyway (as long as training is slower than data loading; quite possible with a big ML model). Hopefully spilling will save you in the real-world, but it still doesn't seem like great behavior.
No ideas yet how to address this; just interesting to think about in the context of other scheduling questions like #7531
Minimal reproducer:
Dask-ml example:
The text was updated successfully, but these errors were encountered: