Device memory spill support #35

pentschev · 2019-04-18T15:24:34Z

NOTE: This PR won't be merged for now, in favor of the simplified mechanism from #51.

This PR adds the ability to spill device<->host<->disk memory. As of now, I didn't add support for multiple levels on the disk side (e.g., nvme<->ssd<->NFS), but adding more layers isn't difficult, we only need to generalize instantiation and configuration of such levels.

Before merging, there are a few more things that need to be done, notably:

Add missing documentation to some functions
Write tests
Decide whether we want to terminate if CUDA device memory usage is too high (requires subclassing Nanny, just like with CUDAWorker)
Define if/how to compute device memory use reliably (e.g.., how to separate Dask use from CuPy memory pools?)

Note: requires dask/dask#4715.

@mrocklin @kkraus14 FYI

mrocklin

In principle this seems like a fine design. I'm excited about trying it out.

At some point it would be useful to develop some tests. Perhaps some computations that persist an array that is larger than our amount of device memory, and then sum them, just to make sure that we can push things down and bring them back up.

My guess is that you've already done this manually. How did it perform?

dask_cuda/dask_cuda.yaml

dask_cuda/device_host_file.py

mrocklin · 2019-04-18T16:05:38Z

dask_cuda/cuda_worker.py

+def get_device_used_memory():
+    """ Return used memory of CUDA device from current context """
+    memory_info = cuda.current_context().get_memory_info()  # (free, total)
+    return memory_info[1] - memory_info[0]


This looks great. I didn't know that Numba could do this.

Also, for general awareness, there is also this: https://github.com/gpuopenanalytics/pynvml/ though I haven't used it much myself.

Didn't know about that one. I don't have a strong opinion, but since Numba is a package more likely to be already installed and we only check memory for the time being, I think it makes more sense to just use Numba for now.

dask_cuda/cuda_worker.py

mrocklin · 2019-04-18T16:10:52Z

Also cc @randerzander

pentschev · 2019-04-18T16:50:37Z

At some point it would be useful to develop some tests. Perhaps some computations that persist an array that is larger than our amount of device memory, and then sum them, just to make sure that we can push things down and bring them back up.

My guess is that you've already done this manually. How did it perform?

Yes, I did it manually, but it was like a trial-and-error approach, messing with configurations and array sizes to make sure that memory will be spilled eventually, and monitoring the worker directories to verify that files are being created/deleted at runtime.

Performance is greatly reduced due to device/host communication and disk IO, which is unavoidable. I'm not sure how we can account objectively for the performance loss.

On the accounts of adding full system tests (scheduler + worker + client code), I'm not sure how we can do it, it's probably not possible to do it deterministically, but as of now, I couldn't come up with a way to actually verify it works correctly, other than adding a unit test for DeviceHostFile. I haven't checked yet at existing Dask distributed tests, I'm hoping that will give me ideas.

Also, any suggestions or ideas on how to do a full system test and/or how to measure performance objectively are greatly appreciated!

mrocklin · 2019-04-18T16:54:15Z

Also, any suggestions or ideas on how to do a full system test and/or how to measure performance objectively are greatly appreciated!

See #35 (comment)

Performance is greatly reduced due to device/host communication and disk IO, which is unavoidable. I'm not sure how we can account objectively for the performance loss.

We actually keep track of all storage times, so this is something that we can quantify if we want to.

pentschev · 2019-04-23T16:07:50Z

@mrocklin to test this, we'll need the unreleased change from dask/distributed#2625, any chance we could get a Dask distributed release anytime soon?

mrocklin · 2019-04-23T16:10:48Z

I can push something out at the end of the week. In the meantime, I recommend that we test against git master

…

On Tue, Apr 23, 2019 at 11:07 AM Peter Andreas Entschev < ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> to test this, we'll need the unreleased change from dask/distributed#2625 <dask/distributed#2625>, any chance we could get a Dask distributed release anytime soon? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTFYBU7JY4K55WXTOMLPR4X5PANCNFSM4HG6BVHA> .

mrocklin · 2019-04-23T16:19:50Z

I think that you can probably add something here: https://github.com/rapidsai/dask-cuda/blob/branch-0.7/ci/gpu/build.sh

…

On Tue, Apr 23, 2019 at 11:10 AM Matthew Rocklin ***@***.***> wrote: I can push something out at the end of the week. In the meantime, I recommend that we test against git master On Tue, Apr 23, 2019 at 11:07 AM Peter Andreas Entschev < ***@***.***> wrote: > @mrocklin <https://github.com/mrocklin> to test this, we'll need the > unreleased change from dask/distributed#2625 > <dask/distributed#2625>, any chance we could get > a Dask distributed release anytime soon? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#35 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AACKZTFYBU7JY4K55WXTOMLPR4X5PANCNFSM4HG6BVHA> > . >

pentschev · 2019-04-23T16:21:01Z

I think that you can probably add something here: https://github.com/rapidsai/dask-cuda/blob/branch-0.7/ci/gpu/build.sh

Oh yeah, this is what I was looking for now, thanks for the direction! :)

dask_cuda/tests/test_spill.py

* Install upstream packages after conda environment activation * Install NumPy/Numba packages with conda recipes

dask_cuda/dask_cuda_worker.py

mrocklin · 2019-04-29T14:36:10Z

When trying to run tests I run into errors like the following:

    def serialize_bytes(x, **kwargs):
        L = serialize_bytelist(x, **kwargs)
        if PY2:
            L = [bytes(y) for y in L]
>       return b"".join(L)
E       TypeError: sequence item 2: expected a bytes-like object, cupy.core.core.ndarray found

Are there some other libraries that I'm supposed to have around in order for things to pass here? I'm using the quasiben/tom-ucx branch of distributed and dask/master of dask.

mrocklin · 2019-04-29T14:48:34Z

@pentschev I'd be happy to take a look at the non-deterministic failure to see if I can see something (sometimes it's nice to have a second perspective). Which test is it?

pentschev · 2019-04-29T14:59:10Z

Yeah, I've been working on tests, the tests that are in the current branch are buggy, I have literally just now fixed the last bug that I know of. I'll push the fixes in some time, but I'll need to push also one fix to dask/distributed, which is related to the 64MB limit I mentioned last week on our call.

This https://github.com/dask/distributed/blob/master/distributed/protocol/serialize.py#L476 returns only the first frame, while it should return a join of frames, as they're broken down into 64MB, if chunks are larger than that, they'll simply be ignored.

Also note that for this PR, you'll need all the following unreleased commits: https://github.com/rapidsai/dask-cuda/pull/35/files#diff-e19d03ed8e05824bbc937b07c4fddd6fR45

dask_cuda/tests/test_dask_cuda_worker.py

dask_cuda/tests/test_device_host_file.py

dask_cuda/utils_test.py

…-memory-spill

pentschev · 2019-05-07T19:10:09Z

@mrocklin I believe all issues have been addressed. Please take a look and let me know otherwise.

mrocklin

Question: Does it make sense to use the mainline Worker class, drop CUDAWorker/CUDANanny, and just pass in the DeviceHostFile mapping through the Worker(..., data=) keyword? This would reduce code complexity and keep us close to mainline, but we would lose several features. What are those features and how much do we care about them.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs. For CPU data in Python we need the memory monitor because our estimates of the size of data are poor, largely because some workloads have highly irregular Python data structures, for which our calls to sizeof are incorrect. On the GPU, currently we're able to estimate things much more accurately, just due to the kinds of libraries that we see using the GPU.

We lose some setup logic, but maybe that can go into deployment tools like LocalCUDACluster and dask-cuda-worker.

Thoughts on this @pentschev ? What else do we lose?

mrocklin · 2019-05-08T14:52:47Z

dask_cuda/dask_cuda_worker.py

+    help="The Worker class to be used. "
+    "Choosing a non-default worker may result in limited functionality, "
+    "such as no device memory spilling support.",
+)


I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.

However there is enough uncertainty here that, for maintenance reasons, I'd prefer that we not promise anything until someone is asking us for this.

For context, the worker_class keyword exists because different groups have their own custom Worker classes that they use. Exposing a keyword like this but making no move to allow them to plug in seems against the spirit of the keyword.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs.

That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.

I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.

Unfortunately, this is necessary for us to reenable the old CUDA_VISIBLE_DEVICES test. Using CUDAWorker, which was hardcoded before, prevents us from launching a process on a single-GPU machine to mock test as if it had several and check for the ordering of GPUs of each process.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs.

That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.

Why do we need to monitor device memory externally from the use of sizeof? If objects' reliably tell us how much device memory they take up then we may not need to track device memory separately.

Dask workers operated this way for a long time before we finally needed to give in and have a separate periodic callback that tracked system memory. I'm inclined to try the simpler approach first and see if it breaks down or not.

Sure, they do track how much memory they take. However, tracking the device memory lets us decide when it's time to spill memory. Isn't that what memory_target_fraction and memory_spill_fraction (prepend device_ for the cases in this PR) are for?

The block https://github.com/rapidsai/dask-cuda/pull/35/files#diff-a77f0c6f19d8d34d59aede5e31455719R282 controls the spilling, and this is why we needed to subclass Worker.

Yes, something like that diff. You'll also want to add the data= keyword to LocalCUDACluster.

Thanks for the explanation on memory monitor.

Yes, something like that diff. You'll also want to add the data= keyword to LocalCUDACluster.

Yes, and I also want to create the object before. :)

But ok, I can probably have it quickly done by tomorrow. There's a few more things that need to be ported to allow it to work (like finding out how much memory the device has in total), and also some test(s), which shouldn't be too difficult now that there's already one that works with the monitoring mechanism and I can base it on that.

Yes, and I also want to create the object before. :)

It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the Worker docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.

like finding out how much memory the device has in total)

I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).

It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the Worker docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.

Ok, I'll check that.

I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).

We need to identify how much memory there is available for the device, regardless. I can probably use the same numba code from before.

Right, mostly I want to say lets not add a new device_memory_foo= keyword yet if we can avoid it.

mrocklin · 2019-05-14T19:04:31Z

I've been negligent in reviewing this. I plan to take another look at this tomorrow.

Building on #35 (review) , I think that my first attempt will be to pull out just the DeviceHostFile approach and merge that in to see if it solves short term needs around spilling during merge computations. I hope that it will. If so, I plan to leave the rest of this in limbo for a bit and wait until the simpler approach fails and we need more active monitoring.

pentschev · 2019-05-14T19:32:41Z

And why do this?

mrocklin · 2019-05-14T19:34:16Z

If it solves the problem with fewer moving pieces then it might have fewer complications and require less maintenance in the future. Also, did you happen to catch the comment earlier here:

Question: Does it make sense to use the mainline Worker class, drop CUDAWorker/CUDANanny, and just pass in the DeviceHostFile mapping through the Worker(..., data=) keyword? This would reduce code complexity and keep us close to mainline, but we would lose several features. What are those features and how much do we care about them.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs. For CPU data in Python we need the memory monitor because our estimates of the size of data are poor, largely because some workloads have highly irregular Python data structures, for which our calls to sizeof are incorrect. On the GPU, currently we're able to estimate things much more accurately, just due to the kinds of libraries that we see using the GPU.

We lose some setup logic, but maybe that can go into deployment tools like LocalCUDACluster and dask-cuda-worker.

Thoughts on this @pentschev ? What else do we lose?

pentschev · 2019-05-14T19:36:55Z

Yes, I responded in #35 (comment)

I realize I didn't give you a direct answer to the Question block, but I justified why that can't be done. Unless I misunderstood something the short answer is: no.

pentschev added 4 commits April 18, 2019 14:53

Add support for yaml configurations

1aa68a8

Add DeviceHostFile class to handle memory-spilling in LRU fashion

7349974

Add CUDAWorker class to handle device memory

377368a

Use CUDAWorker by default, add --device-memory-limit parameter

197e653

mrocklin reviewed Apr 18, 2019

View reviewed changes

Rename configurations dask-cuda -> cuda

de5c3e9

pentschev mentioned this pull request Apr 21, 2019

Extend memory spilling to multiple storage media #37

Open

pentschev added 3 commits April 23, 2019 17:30

Use CUDAWorker on LocalCUDACluster

42c0686

Add scheduler_ip argument to CUDAWorker

c92bbc9

Add test_device_spill

ce4ba37

Add assert_device_host_file_size in utils_test module

b3cbf2c

pentschev added 2 commits April 23, 2019 18:25

Temporarily build with dask-distributed master

855876c

Install CuPy for CI builds

07fccfa

mrocklin reviewed Apr 23, 2019

View reviewed changes

dask_cuda/tests/test_spill.py Outdated Show resolved Hide resolved

pentschev added 5 commits April 25, 2019 17:58

Fix DeviceHostFile transfer between host/device dicts

6d48805

Add DeviceHostFile test

1dd2c16

Update numpy and numba requirements

df6cb95

Use Dask master and enable __array_function__ for CI builds

5a7ceef

Fixes for GPU CI

8659fb6

* Install upstream packages after conda environment activation * Install NumPy/Numba packages with conda recipes

mrocklin reviewed Apr 29, 2019

View reviewed changes

dask_cuda/dask_cuda_worker.py Outdated Show resolved Hide resolved