Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] Add deferrable memory management enchancement proposal #112

Conversation

pentschev
Copy link
Member

This PR is an RMM enhancement proposal, introducing a deferrable memory management, allowing C++ code to expose all CUDA buffers to Python that can be later controlled by some application on the Python side (e.g., Dask), allowing that application to spill the memory to another storage media (such as host memory).

The motivation for this proposal comes from several use cases where the memory available on a device doesn't suffice for sufficiently large problems, and device memory allocated in C++ largely outsizes that allocated in Python. We have a couple of issues describe the problem:

The proposal contains a high-level overview and a C++ implementation suggestion. I preferred to delay details on Python implementation in favor of getting a discussion started more quickly.

cc @harrism @kkraus14 @VibhuJawa @randerzander @galipremsagar @mrocklin

Includes a high-level overview and suggested C++ implementation, still missing
details on Python implementation.
@jrhemstad
Copy link
Contributor

So to be clear, are you proposing that libraries that depend on RMM, such as libcudf, would require wrapping all kernel invocations in something functionally equivalent to:

template<typename Lambda, typename... Targs>
void lockAndLaunchAsync(Lambda&& func, Targs... Fargs, cudaStream_t stream)
{
    retrieveAndLockDeferrablesStream(stream, Fargs...);
    std::forward<Lambda>(func)(stream, Fargs...)
}

?

@pentschev
Copy link
Member Author

So to be clear, are you proposing that libraries that depend on RMM, such as libcudf, would require wrapping all kernel invocations in something functionally equivalent to:

template<typename Lambda, typename... Targs>
void lockAndLaunchAsync(Lambda&& func, Targs... Fargs, cudaStream_t stream)
{
    retrieveAndLockDeferrablesStream(stream, Fargs...);
    std::forward<Lambda>(func)(stream, Fargs...)
}

?

Yes, if they are to allow memory to be controlled by a python application, such as Dask. Libraries that don't opt to do it, would have to do spilling themselves, or not at all.

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 6, 2019

Yes, if they are to allow memory to be controlled by a python application, such as Dask. Libraries that don't opt to do it, would have to do spilling themselves, or not at all.

In that case, I do not think this is a tenable proposal at the level it is currently being proposed.

Primarily, it would be a violation of separation of concerns and pushes too much complexity too low down the stack.

For example, consider libcudf which depends on RMM. It is designed with the fundamental assumption that all functions are running on a single GPU, and that all data fits in GPU memory. It is unconcerned about multi-GPU, memory spilling, etc. The current proposal would break those fundamental assumptions.

Furthermore, this would push a significant new level of complexity down into the internals of the library (i.e., wrapping every kernel invocation), and only in support of some of the libraries end users (i.e., only the users who are using a system that provides memory spilling like Dask). For users who don't care about memory spilling, there's no way to opt-out of the extra complexity. The same would be true of libcuml and libcugraph.

As such, this functionality should be handled by a layer above RMM or libcudf/cuml/cugraph. Instead of wrapping every internal kernel call of these libraries with a function to ensure all data is available, it should be the responsibility of the caller of these libraries to wrap the external APIs to ensure the necessary input data is available before the function is invoked. In this way, the core libraries are isolated from being concerned about memory spilling as it is instead the concern of a higher layer.

@pentschev
Copy link
Member Author

Furthermore, this would push a significant new level of complexity down into the internals of the library (i.e., wrapping every kernel invocation), and only in support of some of the libraries end users (i.e., only the users who are using a system that provides memory spilling like Dask). For users who don't care about memory spilling, there's no way to opt-out of the extra complexity. The same would be true of libcuml and libcugraph.

Indeed, this would incur additional complexity, but I disagree with the opt-out part. We could certainly add compile-time check(s) to disable this functionality completely, or even runtime ones that would basically just ignore the possibility of data being deferred, for example.

As such, this functionality should be handled by a layer above RMM or libcudf/cuml/cugraph. Instead of wrapping every internal kernel call of these libraries with a function to ensure all data is available, it should be the responsibility of the caller of these libraries to wrap the external APIs to ensure the necessary input data is available before the function is invoked. In this way, the core libraries are isolated from being concerned about memory spilling as it is instead the concern of a higher layer.

Maybe here a bit more context would be interesting to talk about the motivation for this proposal. We currently have cases where cuDF (I think mostly/entirely on nvstrings functions) where the amount of memory consumed by cuDF (through internal allocations, possibly for temporary memory) goes beyond the device limit, even after Dask has spilled all Python-sided memory to the host. The proposal here would allow even memory that is used for temporary buffers on the C++ side to be spilled to disk, helping applications to be more resilient to memory shortages (at the cost of slowing it down due to spilling).

We currently have some issues open where just on the Python side we can't do anything other than letting the application crash and forcing the user to chunk data in smaller pieces:

rapidsai/dask-cuda#57
rapidsai/dask-cuda#99
rapidsai/cudf#2321

To be honest, I don't want to claim that the proposal here is the best or only solution, but currently I can't think of a different one. The fact is that on the Python side of things we really can't do more than what we already do, but I am open to discussing further alternatives, should anyone think of any.

@jrhemstad
Copy link
Contributor

We currently have cases where cuDF (I think mostly/entirely on nvstrings functions)

If the issue is limited to NVStrings functions, then we should wait until NVStrings is refactored into libcudf before making any decisions.

The proposal here would allow even memory that is used for temporary buffers on the C++ side to be spilled to disk

Okay, so the reason you're saying its insufficient to keep the spilling logic a layer above these libraries is that you want to be able to spill temporary memory. However, I'm not sure if that's actually useful.

The nature of temporary memory allocations is that they're usually allocated and then used immediately. So let's imagine you're at a point where a temp allocation will cause OOM. As you've said, you've already spilled everything you can. There's two cases:

  1. Temp memory cannot be spilled (status quo)

    • The allocation fails, throwing a bad_alloc exception (this may not be 100% the case today, but it eventually will be)
  2. Temp memory can be spilled

    • The temp allocation is "deferred", doesn't actually allocate anything, and succeeds
    • Temp allocation is necessary for some kernel, we call lockAndLaunchAsync to make sure the data we need is actually available and then launch the kernel
    • Calling retrieveAndLockDeferrablesStream on the temp allocation will attempt to actually allocate the memory, which will then cause OOM (like you said, everything that can be spilled already has been)
    • You're back in the same situation as 1.

Perhaps this is orthogonal, but if you're interested in memory spilling, why not just try using RMM in managed memory mode (pool or no-pool)?

@pentschev
Copy link
Member Author

The proposal in itself is not deferring allocation, but valid memory buffers with some data that are not currently in use (just like Dask does, keep an LRU cache and start spilling the least recently used).

To exemplify, imagine a complex function that will need an input (d_in), two temporary buffers (d_tmp1, d_tmp2) and an output (d_out), and takes the input, runs two different kernels (each storing to a different temporary buffer) and finally merges outputs. Consider in this case we have 16GB of device memory and each buffer consumes 5GB:

DeferrableMemory complex_function(DeferrableMemory& d_in)
{
    // 5GB used by d_in

    DeferrableMemory d_tmp1; // 10GB used
    lockAndLaunch(k1, d_tmp1, d_in);

    DeferrableMemory d_tmp2; // 15GB used
    lockAndLaunch(k2, d_tmp2, d_in);

    DeferrableMemory d_out; // 20GB used.
    //  If we track memory with an LRU, d_in gets spilled and consumption is only 15GB.
    lockAndLaunch(k_merge, d_out, d_tmp1, d_tmp2);

    return d_out;
}

I hope this makes things clearer. In a situation like this, I don't think a memory pool would help.

@jrhemstad
Copy link
Contributor

I hope this makes things clearer. In a situation like this, I don't think a memory pool would help.

Yes, that is clearer.

In a situation like this, I don't think a memory pool would help.

Correct, a memory pool would not help. However, if everything was allocated with managed memory, then your example would "just work" and the GPU would take care of paging the memory as necessary.

I wonder how much we should be relying on Dask to attempt to do memory spilling vs. just using managed memory to do the spilling for us.

@pentschev
Copy link
Member Author

However, if everything was allocated with managed memory, then your example would "just work" and the GPU would take care of paging the memory as necessary.

By paging here, are you saying that RMM could potentially move that memory to host memory, is that correct? If so, is this implemented today? I looked through the code about a month ago and I don't think I saw that.

I wonder how much we should be relying on Dask to attempt to do memory spilling vs. just using managed memory to do the spilling for us.

This could be on either end. I only fear that once we have memory managed on the Python side too, we could end up with both sides not knowing of the tracking and policies from the other. In other words, it would be best to have a single memory manager, or memory managers would have to be able to communicate with each other and sync on a policy to do things like memory spilling, which could be even more complex.

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 6, 2019

By paging here, are you saying that RMM could potentially move that memory to host memory, is that correct?

What I'm talking about is independent of RMM, it's using cudaMallocManaged, which goes by a few names: managed memory, UVM, unified memory.

The basic idea is that you allocate some memory that is accessible to both host/device, and it will automatically be paged back and forth between host/device memory. It also provides the option for explicit user control over paging. It has negative performance implications, but so does the manual spilling being done in Dask.

More info:
https://devblogs.nvidia.com/beyond-gpu-memory-limits-unified-memory-pascal/
https://devblogs.nvidia.com/unified-memory-cuda-beginners/

@pentschev
Copy link
Member Author

But will cudaMallocManaged migrate the data automatically if the device is running OOM? What I know is that it will move from GPU to CPU if CPU tries to access it, in our case, CPU will not actively try to access that data, will it still be moved in that case? The first link gives me the impression that it does, but I never tried using it in that fashion.

If the case above is true, should nvstrings not crash if we just enable managed memory?

@jrhemstad
Copy link
Contributor

But will cudaMallocManaged migrate the data automatically if the device is running OOM? What I know is that it will move from GPU to CPU if CPU tries to access it, in our case, CPU will not actively try to access that data, will it still be moved in that case? The first link gives me the impression that it does, but I never tried using it in that fashion.

Yes, it will page out unused memory allocated with cudaMallocManaged to make room for new memory on a per page basis, enabling over-subscription. The CPU does not need be involved.

It's just like standard virtual memory w/ CPUs where you can allocate and use more memory than the actual RAM that you have. (Hence UVM, Unified Virtual Memory).

@pentschev
Copy link
Member Author

It's just like standard virtual memory w/ CPUs where you can allocate and use more memory than the actual RAM that you have. (Hence UVM, Unified Virtual Memory).

Yes, but even on the CPU, this implies that memory was allocated but was not written to (or more broadly, paged). The example I posted before assumes that the memory has been paged (after each kernel call). I guess cudaMallocManaged wouldn't work like that, would it? Note that spilling moves data from paged device memory to host and then back to device when requested, not only relies on some page fault mechanism.

@pentschev
Copy link
Member Author

Thanks for the links @jrhemstad. In the meantime, I was also looking at the documentation which states:

Managed memory on such GPUs may be evicted from device memory to host memory at any time by the Unified Memory driver in order to make room for other allocations.

This indeed sounds like what I wanted, possibly also improves our case. Let's see wait to see some results. :)

@harrism
Copy link
Member

harrism commented Aug 6, 2019

@pentschev you seem to have put a lot of work into this proposal, and it's really well-documented. Thanks for that. However I think in the future you should probably start a discussion like this first to save you effort. Jake and I both had the same reaction to the proposal: virtual memory is the ideal solution here; adding complexity to every kernel launch in every library is not really an option.

Note one limitation of Unified Memory is you can't IPC addresses allocated with cudaMallocManaged today. Because of this it's an open issue to create the ability for RMM to support multiple pools that use different underlying allocators. Presumably IPC buffers could be kept small enough so that most memory could be allocated to the managed pool.

@pentschev
Copy link
Member Author

@pentschev you seem to have put a lot of work into this proposal, and it's really well-documented. Thanks for that. However I think in the future you should probably start a discussion like this first to save you effort.

I think the effort of putting a proposal in place to start a discussion worth it, even if it ends up being rejected. It has the effect of the issues and goals being clearer to everyone, and presenting a potential solution too.

Jake and I both had the same reaction to the proposal: virtual memory is the ideal solution here; adding complexity to every kernel launch in every library is not really an option.

In general I agree, virtual memory is a good and simple option. In the memory spilling case, it has some shortcomings, such as not being able to eventually spill to disk, but these shortcomings may never pose a problem, so let's worry about them if they ever become a problem.

Note one limitation of Unified Memory is you can't IPC addresses allocated with cudaMallocManaged today. Because of this it's an open issue to create the ability for RMM to support multiple pools that use different underlying allocators. Presumably IPC buffers could be kept small enough so that most memory could be allocated to the managed pool.

I am guessing here you're talking about IPC because of the concerned I raised earlier in #112 (comment), where I mentioned Python and C++ will not know about each other's memory consumption, is that right?

In that case, I'm not necessarily talking about IPC to transfer memory, but rather just saying that they need to know there's other work taking device memory, otherwise, the easy assumption is that they have the device entirely for themselves. In the dask-cuda side, we can actively watch memory consumption and start spilling/evicting once the device reaches a threshold, but that still may have effects on the C++ side if it's much faster than dask-cuda.

Regardless, I'll try out memory pooling today for the issues I know of and see how that behaves. Hopefully it will just handle our case beautifully and no further work will be necessary.

Thanks both @jrhemstad and @harrism for the comments!

@jrhemstad
Copy link
Contributor

jrhemstad commented Aug 7, 2019

I am guessing here you're talking about IPC because of the concerned I raised earlier in #112 (comment), where I mentioned Python and C++ will not know about each other's memory consumption, is that right?

@harrism is just saying that managed memory (today) cannot be shared via IPC. So if you for whatever reason may want to share some memory with another process via IPC, it cannot be done if was allocated as managed memory and would need to be copied to non-managed before being able to be IPC'ed. With the usage of UCX, I do not think IPC is as important as it once was, but I may be mistaken.

Regardless, I'll try out memory pooling today for the issues I know of and see how that behaves. Hopefully it will just handle our case beautifully and no further work will be necessary.

Using managed memory should "just work" for oversubscription, however, if you just rely on the GPU to do all the page faulting for you, performance can be quite poor. In general, I suspect we'll want to build logic into Dask to explicitly pre-fetch pages to/from device/host memory in order to get the best performance (like you're doing now). However, instead of doing it manually with cudaMemcpys it's done using cudaMemPrefetchAsync.

@pentschev
Copy link
Member Author

@harrism is just saying that managed memory (today) cannot be shared via IPC. So if you for whatever reason may want to share some memory with another process via IPC, it cannot be done if was allocated as managed memory and would need to be copied to non-managed before being able to be IPC'ed. With the usage of UCX, I do not think IPC is as important as it once was, but I may be mistaken.

I am also not absolutely sure, but I tend to agree that IPC won't be as important after UCX, and we're mostly converging to using UCX anyway. That said, I think IPC is not as big a concern at the moment.

Using managed memory should "just work" for oversubscription, however, if you just rely on the GPU to do all the page faulting for you, performance can be quite poor. In general, I suspect we'll want to build logic into Dask to explicitly pre-fetch pages to/from device/host memory in order to get the best performance (like you're doing now). However, instead of doing it manually with cudaMemcpys it's done using cudaMemPrefetchAsync.

Yes, Dask does this explicitly, but it doesn't directly control how the copy occurs and it is done by whatever library produces the device array (usually Numba or CuPy). In that sense, Dask won't also ever know or care about page faulting, but the libraries that it builds upon will (or should).

Is there any mechanism to improve page faulting in libraries such as cuDF when using RMM's managed memory? If not, is this something that is doable?

@jrhemstad
Copy link
Contributor

Is there any mechanism to improve page faulting in libraries such as cuDF when using RMM's managed memory? If not, is this something that is doable?

cuDF & al. should remain agnostic to the spilling/page faulting. In order to ensure best performance, the caller of a cuDF function (in this case, Dask) should ensure that the input memory is resident on the device before calling the function. This can be done via the cudaMemPrefetchAsync API I mentioned. If at any time the GPU memory is too full, you can evict memory from the GPU to Host by also using cudaMemPrefetchAsync to prefetch from GPU to CPU.

@pentschev
Copy link
Member Author

cuDF & al. should remain agnostic to the spilling/page faulting. In order to ensure best performance, the caller of a cuDF function (in this case, Dask) should ensure that the input memory is resident on the device before calling the function.

Right, you were talking about the caller, I was thinking of the cuDF's internal memory. Yeah, I think we can assume the Python caller will pass already a device pointer, so we don't really have to worry about page faults in that situation. Thanks for clarifying.

@nsakharnykh
Copy link

@pentschev let us know how the Unified Memory testing is going. The last presentation referenced by Jake from GTC’19 is exactly about using RMM with managed memory for RAPIDS workloads. Would love to hear any feedback, and see how we can improve performance in scenarios oversubscribing GPU memory with prefetching and other hints as necessary.

@harrism harrism changed the title Add deferrable memory management enchancement proposal [DISCUSSION] Add deferrable memory management enchancement proposal Aug 8, 2019
@harrism harrism added the question Further information is requested label Aug 8, 2019
@harrism
Copy link
Member

harrism commented Dec 6, 2019

@pentschev is this PR still needed / relevant?

@pentschev
Copy link
Member Author

I don't think so. I've tested RMM managed memory with dask-cuda and that seemed to solve the issue. Unfortunately I haven't heard back from users if they've got the chance to test that as well. Thanks everyone for the input here!

@pentschev pentschev closed this Dec 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants