Device memory spill support (LRU-based only) #51

pentschev · 2019-05-14T21:02:59Z

Based on #35, but provides only the simple LRU mechanism (i.e., no device memory monitoring).

pentschev · 2019-05-14T21:57:46Z

@mrocklin let's move the conversation here.

Right, mostly I want to say lets not add a new device_memory_foo= keyword yet if we can avoid it.

We need to handle it somehow. We have to pass that information to DeviceHostFile somehow, and we can create the object before passing to Worker/Nanny (which is fine, IMO). But the user can already pass --nthreads to dask-cuda-worker, or n_workers to LocalCUDACluster, so we should at least handle that appropriately. I'll see how this can be handled in a minimalistic way, probably by just dividing memory evenly.

mrocklin · 2019-05-14T22:06:16Z

Under the assumption that we'll have one worker per GPU I think that we probably don't need to divide at all. We just take the maximum available device memory of the current GPU and use that.

pentschev · 2019-05-15T14:21:15Z

I know you @mrocklin said not to add --device-memory-limit, but this allows us to create the DeviceHostFile object without querying GPUs for memory. This allows the CUDA_VISIBLE_DEVICES test to work properly.

mrocklin

Thanks for doing this @pentschev I've left a few minor comments, but I suspect that we can probably agree on things quickly and merge this soon.

mrocklin · 2019-05-15T14:40:07Z

dask_cuda/tests/test_spill.py

+        xx = x.persist()
+        yield wait(xx)
+
+        print(worker.data.device_buffer)


We might want to remove the print statements for now.

Yes, that was accidental.

dask_cuda/tests/test_spill.py

mrocklin · 2019-05-15T14:46:36Z

dask_cuda/dask_cuda_worker.py

+                memory_limit=parse_memory_limit(
+                    memory_limit, nthreads, total_cores=nprocs
+                ),
+            ),


I recommend the following instead:

data=(DeviceHostFile, {...})

Otherwise we create the data object immediately here, and then need to pass it down to the worker through a process boundary.

This is implemented here, but it looks like it's not documented anywhere (my mistake)

https://github.com/dask/distributed/blob/8e449d392e91eff0a3454ee98ef362de8f78cc4f/distributed/worker.py#L500-L501

Also, if the user specifies device_memory_limit=0 then we might want something simpler. I can imagine wanting to turn off this behavior if things get complex.

We probably also want the same treatment in LocalCUDACluster, though as a first pass we can also include this ourselves.

I was in the process of doing that, only saw your comment now. Indeed, this is a better solution, but either way, the downside is that I had to create the work directory in dask-cuda-worker, which I'm not particularly happy with, so if you have a suggestion on how to avoid that, it would be great!

I'm already working on the LocalCUDACluster, I'm still not done, will soon push those.

the downside is that I had to create the work directory in dask-cuda-worker, which I'm not particularly happy with, so if you have a suggestion on how to avoid that, it would be great!

Hrm, yes I can see how that would be a concern. I don't have a good solution currently unfortunately. I'll think about it though.

pentschev · 2019-05-15T19:56:38Z

@mrocklin I think this is good now, tests in place too. IMO, the only thing to be changed is how to create the worker directory if we can think of a way, but besides that, it's probably good for a review.

mrocklin · 2019-05-15T22:17:23Z

Looks good to me. Thank you for putting in the effort here @pentschev . I'm looking forward to using this :)

pentschev · 2019-05-16T08:46:55Z

Thanks for the review @mrocklin!

pentschev added 3 commits May 14, 2019 22:56

Add DeviceHostFile class to handle memory-spilling in LRU fashion

9e37bd2

Add DeviceHostFile tests

eae1bf8

Pass DeviceHostFile to Worker via data argument

8138305

pentschev mentioned this pull request May 14, 2019

Device memory spill support #35

Closed

4 tasks

pentschev added 2 commits May 14, 2019 23:05

Add CuPy and enable __array_function__ in CI build

49733d7

Update version requirements of dask, distributed and numpy

7ebed48

pentschev added 5 commits May 15, 2019 12:35

Add get_device_total_memory utility function

5f47afa

Pass pre-constructed DeviceHostFile to Worker

c570360

Add memory spilling test

a37d7f6

Add numba as a requirement

18ced44

Add --device-memory-limit paramater to dask-cuda-worker

ac56871

mrocklin reviewed May 15, 2019

View reviewed changes

pentschev added 6 commits May 15, 2019 16:58

Create work directory before DeviceHostFile

f7acb76

Add DeviceHostFile support for LocalCUDACluster

c9b283f

Add LocalCUDACluster device spilling test

6428b54

Add fast to DeviceHostFile for Worker compatibility

be3710d

Fix some setup.py formatting

45d9a26

Fix LocalCUDACluster device_memory_limit parsing and local dir creation

93adc53

mrocklin merged commit 6c4004b into rapidsai:branch-0.8 May 15, 2019

pentschev mentioned this pull request May 16, 2019

Spill device memory to host memory or disk #30

Closed

VibhuJawa mentioned this pull request May 28, 2019

Out of Memory Sort Fails even with Spill over #57

Closed

pentschev deleted the device-memory-spill-lru branch September 9, 2019 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device memory spill support (LRU-based only) #51

Device memory spill support (LRU-based only) #51

pentschev commented May 14, 2019 •

edited

Loading

pentschev commented May 14, 2019

mrocklin commented May 14, 2019

pentschev commented May 15, 2019

mrocklin left a comment

mrocklin May 15, 2019

pentschev May 15, 2019

mrocklin May 15, 2019

pentschev May 15, 2019

mrocklin May 15, 2019

pentschev commented May 15, 2019

mrocklin commented May 15, 2019

pentschev commented May 16, 2019

Device memory spill support (LRU-based only) #51

Device memory spill support (LRU-based only) #51

Conversation

pentschev commented May 14, 2019 • edited Loading

pentschev commented May 14, 2019

mrocklin commented May 14, 2019

pentschev commented May 15, 2019

mrocklin left a comment

Choose a reason for hiding this comment

mrocklin May 15, 2019

Choose a reason for hiding this comment

pentschev May 15, 2019

Choose a reason for hiding this comment

mrocklin May 15, 2019

Choose a reason for hiding this comment

pentschev May 15, 2019

Choose a reason for hiding this comment

mrocklin May 15, 2019

Choose a reason for hiding this comment

pentschev commented May 15, 2019

mrocklin commented May 15, 2019

pentschev commented May 16, 2019

pentschev commented May 14, 2019 •

edited

Loading