-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device memory spill support #35
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle this seems like a fine design. I'm excited about trying it out.
At some point it would be useful to develop some tests. Perhaps some computations that persist an array that is larger than our amount of device memory, and then sum them, just to make sure that we can push things down and bring them back up.
My guess is that you've already done this manually. How did it perform?
dask_cuda/cuda_worker.py
Outdated
def get_device_used_memory(): | ||
""" Return used memory of CUDA device from current context """ | ||
memory_info = cuda.current_context().get_memory_info() # (free, total) | ||
return memory_info[1] - memory_info[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I didn't know that Numba could do this.
Also, for general awareness, there is also this: https://github.com/gpuopenanalytics/pynvml/ though I haven't used it much myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know about that one. I don't have a strong opinion, but since Numba is a package more likely to be already installed and we only check memory for the time being, I think it makes more sense to just use Numba for now.
Also cc @randerzander |
Yes, I did it manually, but it was like a trial-and-error approach, messing with configurations and array sizes to make sure that memory will be spilled eventually, and monitoring the worker directories to verify that files are being created/deleted at runtime. Performance is greatly reduced due to device/host communication and disk IO, which is unavoidable. I'm not sure how we can account objectively for the performance loss. On the accounts of adding full system tests (scheduler + worker + client code), I'm not sure how we can do it, it's probably not possible to do it deterministically, but as of now, I couldn't come up with a way to actually verify it works correctly, other than adding a unit test for Also, any suggestions or ideas on how to do a full system test and/or how to measure performance objectively are greatly appreciated! |
See #35 (comment)
We actually keep track of all storage times, so this is something that we can quantify if we want to. |
@mrocklin to test this, we'll need the unreleased change from dask/distributed#2625, any chance we could get a Dask distributed release anytime soon? |
I can push something out at the end of the week. In the meantime, I
recommend that we test against git master
…On Tue, Apr 23, 2019 at 11:07 AM Peter Andreas Entschev < ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> to test this, we'll need the
unreleased change from dask/distributed#2625
<dask/distributed#2625>, any chance we could get
a Dask distributed release anytime soon?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTFYBU7JY4K55WXTOMLPR4X5PANCNFSM4HG6BVHA>
.
|
I think that you can probably add something here:
https://github.com/rapidsai/dask-cuda/blob/branch-0.7/ci/gpu/build.sh
…On Tue, Apr 23, 2019 at 11:10 AM Matthew Rocklin ***@***.***> wrote:
I can push something out at the end of the week. In the meantime, I
recommend that we test against git master
On Tue, Apr 23, 2019 at 11:07 AM Peter Andreas Entschev <
***@***.***> wrote:
> @mrocklin <https://github.com/mrocklin> to test this, we'll need the
> unreleased change from dask/distributed#2625
> <dask/distributed#2625>, any chance we could get
> a Dask distributed release anytime soon?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#35 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AACKZTFYBU7JY4K55WXTOMLPR4X5PANCNFSM4HG6BVHA>
> .
>
|
Oh yeah, this is what I was looking for now, thanks for the direction! :) |
* Install upstream packages after conda environment activation * Install NumPy/Numba packages with conda recipes
When trying to run tests I run into errors like the following: def serialize_bytes(x, **kwargs):
L = serialize_bytelist(x, **kwargs)
if PY2:
L = [bytes(y) for y in L]
> return b"".join(L)
E TypeError: sequence item 2: expected a bytes-like object, cupy.core.core.ndarray found Are there some other libraries that I'm supposed to have around in order for things to pass here? I'm using the |
@pentschev I'd be happy to take a look at the non-deterministic failure to see if I can see something (sometimes it's nice to have a second perspective). Which test is it? |
Yeah, I've been working on tests, the tests that are in the current branch are buggy, I have literally just now fixed the last bug that I know of. I'll push the fixes in some time, but I'll need to push also one fix to dask/distributed, which is related to the 64MB limit I mentioned last week on our call. This https://github.com/dask/distributed/blob/master/distributed/protocol/serialize.py#L476 returns only the first frame, while it should return a join of frames, as they're broken down into 64MB, if chunks are larger than that, they'll simply be ignored. Also note that for this PR, you'll need all the following unreleased commits: https://github.com/rapidsai/dask-cuda/pull/35/files#diff-e19d03ed8e05824bbc937b07c4fddd6fR45 |
@mrocklin I believe all issues have been addressed. Please take a look and let me know otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does it make sense to use the mainline Worker class, drop CUDAWorker/CUDANanny, and just pass in the DeviceHostFile
mapping through the Worker(..., data=)
keyword? This would reduce code complexity and keep us close to mainline, but we would lose several features. What are those features and how much do we care about them.
For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs. For CPU data in Python we need the memory monitor because our estimates of the size of data are poor, largely because some workloads have highly irregular Python data structures, for which our calls to sizeof
are incorrect. On the GPU, currently we're able to estimate things much more accurately, just due to the kinds of libraries that we see using the GPU.
We lose some setup logic, but maybe that can go into deployment tools like LocalCUDACluster
and dask-cuda-worker
.
Thoughts on this @pentschev ? What else do we lose?
help="The Worker class to be used. " | ||
"Choosing a non-default worker may result in limited functionality, " | ||
"such as no device memory spilling support.", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.
However there is enough uncertainty here that, for maintenance reasons, I'd prefer that we not promise anything until someone is asking us for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, the worker_class
keyword exists because different groups have their own custom Worker classes that they use. Exposing a keyword like this but making no move to allow them to plug in seems against the spirit of the keyword.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs.
That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.
I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.
Unfortunately, this is necessary for us to reenable the old CUDA_VISIBLE_DEVICES
test. Using CUDAWorker
, which was hardcoded before, prevents us from launching a process on a single-GPU machine to mock test as if it had several and check for the ordering of GPUs of each process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs.
That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.
Why do we need to monitor device memory externally from the use of sizeof
? If objects' reliably tell us how much device memory they take up then we may not need to track device memory separately.
Dask workers operated this way for a long time before we finally needed to give in and have a separate periodic callback that tracked system memory. I'm inclined to try the simpler approach first and see if it breaks down or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, they do track how much memory they take. However, tracking the device memory lets us decide when it's time to spill memory. Isn't that what memory_target_fraction
and memory_spill_fraction
(prepend device_
for the cases in this PR) are for?
The block https://github.com/rapidsai/dask-cuda/pull/35/files#diff-a77f0c6f19d8d34d59aede5e31455719R282 controls the spilling, and this is why we needed to subclass Worker
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that diff. You'll also want to add the data=
keyword to LocalCUDACluster
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation on memory monitor.
Yes, something like that diff. You'll also want to add the
data=
keyword toLocalCUDACluster
.
Yes, and I also want to create the object before. :)
But ok, I can probably have it quickly done by tomorrow. There's a few more things that need to be ported to allow it to work (like finding out how much memory the device has in total), and also some test(s), which shouldn't be too difficult now that there's already one that works with the monitoring mechanism and I can base it on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and I also want to create the object before. :)
It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the Worker
docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.
like finding out how much memory the device has in total)
I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the
Worker
docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.
Ok, I'll check that.
I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).
We need to identify how much memory there is available for the device, regardless. I can probably use the same numba code from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, mostly I want to say lets not add a new device_memory_foo=
keyword yet if we can avoid it.
I've been negligent in reviewing this. I plan to take another look at this tomorrow. Building on #35 (review) , I think that my first attempt will be to pull out just the DeviceHostFile approach and merge that in to see if it solves short term needs around spilling during merge computations. I hope that it will. If so, I plan to leave the rest of this in limbo for a bit and wait until the simpler approach fails and we need more active monitoring. |
And why do this? |
If it solves the problem with fewer moving pieces then it might have fewer complications and require less maintenance in the future. Also, did you happen to catch the comment earlier here:
|
Yes, I responded in #35 (comment) I realize I didn't give you a direct answer to the Question block, but I justified why that can't be done. Unless I misunderstood something the short answer is: no. |
NOTE: This PR won't be merged for now, in favor of the simplified mechanism from #51.
This PR adds the ability to spill device<->host<->disk memory. As of now, I didn't add support for multiple levels on the disk side (e.g., nvme<->ssd<->NFS), but adding more layers isn't difficult, we only need to generalize instantiation and configuration of such levels.
Before merging, there are a few more things that need to be done, notably:
Nanny
, just like withCUDAWorker
)Note: requires dask/dask#4715.
@mrocklin @kkraus14 FYI