Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device memory spill support #35
Device memory spill support #35
Changes from 16 commits
1aa68a8
7349974
377368a
197e653
de5c3e9
42c0686
c92bbc9
ce4ba37
b3cbf2c
855876c
07fccfa
6d48805
1dd2c16
df6cb95
5a7ceef
8659fb6
2c24f03
1d06c75
deff18f
2d9c150
ce5c650
c5fbb6f
dcc6a6a
9bc21e7
9eb2dfd
4324439
388c677
c942438
d968b0f
1abb1eb
eb70191
806ac8b
f35826e
f308943
1b9d38b
6d5b714
358f194
c81f763
f28cbd1
6163359
a3c89fb
8f59141
1e3acce
882bf31
1960b79
eb86d5d
8153b27
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also keep this in the
distributed
namespace if we wanted to. It might be nice to have all of the worker config together.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinions/objections. I think we can keep it in the same namespace but need a different file to prevent dask-cuda and dask-distributed from overwriting one another's configurations. Would you like me to change it to
distributed
namespace?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I didn't know that Numba could do this.
Also, for general awareness, there is also this: https://github.com/gpuopenanalytics/pynvml/ though I haven't used it much myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about that one. I don't have a strong opinion, but since Numba is a package more likely to be already installed and we only check memory for the time being, I think it makes more sense to just use Numba for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question regarding , what is the data that is stored in
CUDAWorker.data.device.fast:
?I was trying to persist a data frame bigger than the gpu memory with individual chunks that comfortably fit in device memory but it seems to get paused and throw the following warning on both workers and get paused:
Also, is there a way to access the
CUDAWorker
object from theclient
/cluster
object. I tried theCUDAWorker
atclient.cluster.workers[0].Worker
but it does not seem contain theself.data
attribute and other attributes.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDAWorker.data.device.fast
is a dict that storesdask.array
device chunks. SimilarlyCUDAWorker.data.device.slow
(== CUDAWorker.data.host.fast
) is a dict that storesdask.array
host chunks or device chunks that have been spilled to host, andCUDAWorker.data.host.slow
contains the path to all chunks on disk.IMO, this is one of the trickiest parts. We can't guarantee that
CUDAWorker
has full control over device memory, depending on what you're using to create/load your data, it may have its own memory pool, for example. This is why I had to explicitly disable CuPy's memory pooling to write a reliable test in https://github.com/rapidsai/dask-cuda/pull/35/files#diff-fa83f19df929827ec523fd00a21599b8R38.I think this isn't possible, why would you like to access that directly? I don't know if this is by design or is just something that was never implemented, @mrocklin could you clarify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The workers are in separate processes, so there is no way to access them from Python. You can ask Dask to run functions on them to inspect state if you like with
(See the run docstring for more information).
You could also try the
start_ipython_workers
function with the keywordqtconsole=True
, but I woudln't be surprised if it has fallen out of use with recent Tornado library upgrades.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when @pentschev says 'dask.array device chunks" he also means any piece of GPU allocated data, which could be a dask array device chunk as he's dealing with, or a cudf dataframe as you're dealing with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mrocklin for the function to inspect state. I used your function to debug whats happening. It seems that the data is being evicted from
data.device.fast
but the gpu memory is still not being cleared.Error:
May be the del here is not clearing memory. Don't know. https://github.com/rapidsai/dask-cuda/pull/35/files#diff-c87f0866b277f959dc7c5d1e4b0ff015R243
Will add a small minimal reproducible example here soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a major issue, that's why I'm concerned with it. In particular, I think pausing is something that can't be enabled under these circumstances, if it is, then when Dask spills memory to host but can't really release that, the worked will get stuck.
I have two proposals (not necessarily mutually exclusive) until we come up with something else:
CUDAWorker.device_memory_monitor
, triggering memory managers' release of memory.We can also disable pausing by default, which I'm inclined to think should be the default to prevent this sort of situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1 see dask/distributed#2453
Disable pausing by default seems fine to me. This is just a config value change at this point, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be something similar, if not exactly that (sorry, I can't understand all the details without diving in a bit deeper).
Any thoughts on item 2. as well?
Yes. I just don't know if disabling pause has no other consequences. On the host, I guess this is to prevent the host from running out of memory and eventually getting killed, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how it would solve the problem. I guess I'm hoping that medium term it's not necessary. My inclination is to wait until we have a real-world problem that needs this before adding it. I won't be surprised if that problem occurs quickly, but I'd still rather put it off and get this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me. That said, I have no further changes to be added, from my side, it's ready for more reviews or merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.
However there is enough uncertainty here that, for maintenance reasons, I'd prefer that we not promise anything until someone is asking us for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, the
worker_class
keyword exists because different groups have their own custom Worker classes that they use. Exposing a keyword like this but making no move to allow them to plug in seems against the spirit of the keyword.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.
Unfortunately, this is necessary for us to reenable the old
CUDA_VISIBLE_DEVICES
test. UsingCUDAWorker
, which was hardcoded before, prevents us from launching a process on a single-GPU machine to mock test as if it had several and check for the ordering of GPUs of each process.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to monitor device memory externally from the use of
sizeof
? If objects' reliably tell us how much device memory they take up then we may not need to track device memory separately.Dask workers operated this way for a long time before we finally needed to give in and have a separate periodic callback that tracked system memory. I'm inclined to try the simpler approach first and see if it breaks down or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, they do track how much memory they take. However, tracking the device memory lets us decide when it's time to spill memory. Isn't that what
memory_target_fraction
andmemory_spill_fraction
(prependdevice_
for the cases in this PR) are for?The block https://github.com/rapidsai/dask-cuda/pull/35/files#diff-a77f0c6f19d8d34d59aede5e31455719R282 controls the spilling, and this is why we needed to subclass
Worker
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that diff. You'll also want to add the
data=
keyword toLocalCUDACluster
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation on memory monitor.
Yes, and I also want to create the object before. :)
But ok, I can probably have it quickly done by tomorrow. There's a few more things that need to be ported to allow it to work (like finding out how much memory the device has in total), and also some test(s), which shouldn't be too difficult now that there's already one that works with the monitoring mechanism and I can base it on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the
Worker
docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll check that.
We need to identify how much memory there is available for the device, regardless. I can probably use the same numba code from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, mostly I want to say lets not add a new
device_memory_foo=
keyword yet if we can avoid it.