Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device memory spill support #35

Closed
wants to merge 47 commits into from

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Apr 18, 2019

NOTE: This PR won't be merged for now, in favor of the simplified mechanism from #51.

This PR adds the ability to spill device<->host<->disk memory. As of now, I didn't add support for multiple levels on the disk side (e.g., nvme<->ssd<->NFS), but adding more layers isn't difficult, we only need to generalize instantiation and configuration of such levels.

Before merging, there are a few more things that need to be done, notably:

  • Add missing documentation to some functions
  • Write tests
  • Decide whether we want to terminate if CUDA device memory usage is too high (requires subclassing Nanny, just like with CUDAWorker)
  • Define if/how to compute device memory use reliably (e.g.., how to separate Dask use from CuPy memory pools?)

Note: requires dask/dask#4715.

@mrocklin @kkraus14 FYI

Copy link
Contributor

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this seems like a fine design. I'm excited about trying it out.

At some point it would be useful to develop some tests. Perhaps some computations that persist an array that is larger than our amount of device memory, and then sum them, just to make sure that we can push things down and bring them back up.

My guess is that you've already done this manually. How did it perform?

def get_device_used_memory():
""" Return used memory of CUDA device from current context """
memory_info = cuda.current_context().get_memory_info() # (free, total)
return memory_info[1] - memory_info[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I didn't know that Numba could do this.

Also, for general awareness, there is also this: https://github.com/gpuopenanalytics/pynvml/ though I haven't used it much myself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about that one. I don't have a strong opinion, but since Numba is a package more likely to be already installed and we only check memory for the time being, I think it makes more sense to just use Numba for now.

@mrocklin
Copy link
Contributor

Also cc @randerzander

@pentschev
Copy link
Member Author

At some point it would be useful to develop some tests. Perhaps some computations that persist an array that is larger than our amount of device memory, and then sum them, just to make sure that we can push things down and bring them back up.

My guess is that you've already done this manually. How did it perform?

Yes, I did it manually, but it was like a trial-and-error approach, messing with configurations and array sizes to make sure that memory will be spilled eventually, and monitoring the worker directories to verify that files are being created/deleted at runtime.

Performance is greatly reduced due to device/host communication and disk IO, which is unavoidable. I'm not sure how we can account objectively for the performance loss.

On the accounts of adding full system tests (scheduler + worker + client code), I'm not sure how we can do it, it's probably not possible to do it deterministically, but as of now, I couldn't come up with a way to actually verify it works correctly, other than adding a unit test for DeviceHostFile. I haven't checked yet at existing Dask distributed tests, I'm hoping that will give me ideas.

Also, any suggestions or ideas on how to do a full system test and/or how to measure performance objectively are greatly appreciated!

@mrocklin
Copy link
Contributor

Also, any suggestions or ideas on how to do a full system test and/or how to measure performance objectively are greatly appreciated!

See #35 (comment)

Performance is greatly reduced due to device/host communication and disk IO, which is unavoidable. I'm not sure how we can account objectively for the performance loss.

We actually keep track of all storage times, so this is something that we can quantify if we want to.

@pentschev
Copy link
Member Author

@mrocklin to test this, we'll need the unreleased change from dask/distributed#2625, any chance we could get a Dask distributed release anytime soon?

@mrocklin
Copy link
Contributor

mrocklin commented Apr 23, 2019 via email

@mrocklin
Copy link
Contributor

mrocklin commented Apr 23, 2019 via email

@pentschev
Copy link
Member Author

pentschev commented Apr 23, 2019

I think that you can probably add something here: https://github.com/rapidsai/dask-cuda/blob/branch-0.7/ci/gpu/build.sh

Oh yeah, this is what I was looking for now, thanks for the direction! :)

@mrocklin
Copy link
Contributor

When trying to run tests I run into errors like the following:

    def serialize_bytes(x, **kwargs):
        L = serialize_bytelist(x, **kwargs)
        if PY2:
            L = [bytes(y) for y in L]
>       return b"".join(L)
E       TypeError: sequence item 2: expected a bytes-like object, cupy.core.core.ndarray found

Are there some other libraries that I'm supposed to have around in order for things to pass here? I'm using the quasiben/tom-ucx branch of distributed and dask/master of dask.

@mrocklin
Copy link
Contributor

@pentschev I'd be happy to take a look at the non-deterministic failure to see if I can see something (sometimes it's nice to have a second perspective). Which test is it?

@pentschev
Copy link
Member Author

Yeah, I've been working on tests, the tests that are in the current branch are buggy, I have literally just now fixed the last bug that I know of. I'll push the fixes in some time, but I'll need to push also one fix to dask/distributed, which is related to the 64MB limit I mentioned last week on our call.

This https://github.com/dask/distributed/blob/master/distributed/protocol/serialize.py#L476 returns only the first frame, while it should return a join of frames, as they're broken down into 64MB, if chunks are larger than that, they'll simply be ignored.

Also note that for this PR, you'll need all the following unreleased commits: https://github.com/rapidsai/dask-cuda/pull/35/files#diff-e19d03ed8e05824bbc937b07c4fddd6fR45

@pentschev
Copy link
Member Author

@mrocklin I believe all issues have been addressed. Please take a look and let me know otherwise.

@mrocklin mrocklin changed the base branch from branch-0.7 to branch-0.8 May 8, 2019 14:48
Copy link
Contributor

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Does it make sense to use the mainline Worker class, drop CUDAWorker/CUDANanny, and just pass in the DeviceHostFile mapping through the Worker(..., data=) keyword? This would reduce code complexity and keep us close to mainline, but we would lose several features. What are those features and how much do we care about them.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs. For CPU data in Python we need the memory monitor because our estimates of the size of data are poor, largely because some workloads have highly irregular Python data structures, for which our calls to sizeof are incorrect. On the GPU, currently we're able to estimate things much more accurately, just due to the kinds of libraries that we see using the GPU.

We lose some setup logic, but maybe that can go into deployment tools like LocalCUDACluster and dask-cuda-worker.

Thoughts on this @pentschev ? What else do we lose?

help="The Worker class to be used. "
"Choosing a non-default worker may result in limited functionality, "
"such as no device memory spilling support.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.

However there is enough uncertainty here that, for maintenance reasons, I'd prefer that we not promise anything until someone is asking us for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, the worker_class keyword exists because different groups have their own custom Worker classes that they use. Exposing a keyword like this but making no move to allow them to plug in seems against the spirit of the keyword.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs.

That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.

I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.

Unfortunately, this is necessary for us to reenable the old CUDA_VISIBLE_DEVICES test. Using CUDAWorker, which was hardcoded before, prevents us from launching a process on a single-GPU machine to mock test as if it had several and check for the ordering of GPUs of each process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs.

That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.

Why do we need to monitor device memory externally from the use of sizeof? If objects' reliably tell us how much device memory they take up then we may not need to track device memory separately.

Dask workers operated this way for a long time before we finally needed to give in and have a separate periodic callback that tracked system memory. I'm inclined to try the simpler approach first and see if it breaks down or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, they do track how much memory they take. However, tracking the device memory lets us decide when it's time to spill memory. Isn't that what memory_target_fraction and memory_spill_fraction (prepend device_ for the cases in this PR) are for?

The block https://github.com/rapidsai/dask-cuda/pull/35/files#diff-a77f0c6f19d8d34d59aede5e31455719R282 controls the spilling, and this is why we needed to subclass Worker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that diff. You'll also want to add the data= keyword to LocalCUDACluster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation on memory monitor.

Yes, something like that diff. You'll also want to add the data= keyword to LocalCUDACluster.

Yes, and I also want to create the object before. :)

But ok, I can probably have it quickly done by tomorrow. There's a few more things that need to be ported to allow it to work (like finding out how much memory the device has in total), and also some test(s), which shouldn't be too difficult now that there's already one that works with the monitoring mechanism and I can base it on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I also want to create the object before. :)

It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the Worker docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.

like finding out how much memory the device has in total)

I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the Worker docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.

Ok, I'll check that.

I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).

We need to identify how much memory there is available for the device, regardless. I can probably use the same numba code from before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, mostly I want to say lets not add a new device_memory_foo= keyword yet if we can avoid it.

@mrocklin
Copy link
Contributor

I've been negligent in reviewing this. I plan to take another look at this tomorrow.

Building on #35 (review) , I think that my first attempt will be to pull out just the DeviceHostFile approach and merge that in to see if it solves short term needs around spilling during merge computations. I hope that it will. If so, I plan to leave the rest of this in limbo for a bit and wait until the simpler approach fails and we need more active monitoring.

@pentschev
Copy link
Member Author

And why do this?

@mrocklin
Copy link
Contributor

If it solves the problem with fewer moving pieces then it might have fewer complications and require less maintenance in the future. Also, did you happen to catch the comment earlier here:

Question: Does it make sense to use the mainline Worker class, drop CUDAWorker/CUDANanny, and just pass in the DeviceHostFile mapping through the Worker(..., data=) keyword? This would reduce code complexity and keep us close to mainline, but we would lose several features. What are those features and how much do we care about them.

For things like the memory monitor, this may not be an issue, both because we have disabled it temporarily, and because it may just not be necessary with the kind of data used on GPUs. For CPU data in Python we need the memory monitor because our estimates of the size of data are poor, largely because some workloads have highly irregular Python data structures, for which our calls to sizeof are incorrect. On the GPU, currently we're able to estimate things much more accurately, just due to the kinds of libraries that we see using the GPU.

We lose some setup logic, but maybe that can go into deployment tools like LocalCUDACluster and dask-cuda-worker.

Thoughts on this @pentschev ? What else do we lose?

@pentschev
Copy link
Member Author

Yes, I responded in #35 (comment)

I realize I didn't give you a direct answer to the Question block, but I justified why that can't be done. Unless I misunderstood something the short answer is: no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants