-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Add deferrable memory management enchancement proposal #112
[DISCUSSION] Add deferrable memory management enchancement proposal #112
Conversation
Includes a high-level overview and suggested C++ implementation, still missing details on Python implementation.
So to be clear, are you proposing that libraries that depend on RMM, such as libcudf, would require wrapping all kernel invocations in something functionally equivalent to:
? |
Yes, if they are to allow memory to be controlled by a python application, such as Dask. Libraries that don't opt to do it, would have to do spilling themselves, or not at all. |
In that case, I do not think this is a tenable proposal at the level it is currently being proposed. Primarily, it would be a violation of separation of concerns and pushes too much complexity too low down the stack. For example, consider Furthermore, this would push a significant new level of complexity down into the internals of the library (i.e., wrapping every kernel invocation), and only in support of some of the libraries end users (i.e., only the users who are using a system that provides memory spilling like Dask). For users who don't care about memory spilling, there's no way to opt-out of the extra complexity. The same would be true of libcuml and libcugraph. As such, this functionality should be handled by a layer above RMM or libcudf/cuml/cugraph. Instead of wrapping every internal kernel call of these libraries with a function to ensure all data is available, it should be the responsibility of the caller of these libraries to wrap the external APIs to ensure the necessary input data is available before the function is invoked. In this way, the core libraries are isolated from being concerned about memory spilling as it is instead the concern of a higher layer. |
Indeed, this would incur additional complexity, but I disagree with the opt-out part. We could certainly add compile-time check(s) to disable this functionality completely, or even runtime ones that would basically just ignore the possibility of data being deferred, for example.
Maybe here a bit more context would be interesting to talk about the motivation for this proposal. We currently have cases where cuDF (I think mostly/entirely on nvstrings functions) where the amount of memory consumed by cuDF (through internal allocations, possibly for temporary memory) goes beyond the device limit, even after Dask has spilled all Python-sided memory to the host. The proposal here would allow even memory that is used for temporary buffers on the C++ side to be spilled to disk, helping applications to be more resilient to memory shortages (at the cost of slowing it down due to spilling). We currently have some issues open where just on the Python side we can't do anything other than letting the application crash and forcing the user to chunk data in smaller pieces: rapidsai/dask-cuda#57 To be honest, I don't want to claim that the proposal here is the best or only solution, but currently I can't think of a different one. The fact is that on the Python side of things we really can't do more than what we already do, but I am open to discussing further alternatives, should anyone think of any. |
If the issue is limited to NVStrings functions, then we should wait until NVStrings is refactored into libcudf before making any decisions.
Okay, so the reason you're saying its insufficient to keep the spilling logic a layer above these libraries is that you want to be able to spill temporary memory. However, I'm not sure if that's actually useful. The nature of temporary memory allocations is that they're usually allocated and then used immediately. So let's imagine you're at a point where a temp allocation will cause OOM. As you've said, you've already spilled everything you can. There's two cases:
Perhaps this is orthogonal, but if you're interested in memory spilling, why not just try using RMM in managed memory mode (pool or no-pool)? |
The proposal in itself is not deferring allocation, but valid memory buffers with some data that are not currently in use (just like Dask does, keep an LRU cache and start spilling the least recently used). To exemplify, imagine a complex function that will need an input ( DeferrableMemory complex_function(DeferrableMemory& d_in)
{
// 5GB used by d_in
DeferrableMemory d_tmp1; // 10GB used
lockAndLaunch(k1, d_tmp1, d_in);
DeferrableMemory d_tmp2; // 15GB used
lockAndLaunch(k2, d_tmp2, d_in);
DeferrableMemory d_out; // 20GB used.
// If we track memory with an LRU, d_in gets spilled and consumption is only 15GB.
lockAndLaunch(k_merge, d_out, d_tmp1, d_tmp2);
return d_out;
} I hope this makes things clearer. In a situation like this, I don't think a memory pool would help. |
Yes, that is clearer.
Correct, a memory pool would not help. However, if everything was allocated with managed memory, then your example would "just work" and the GPU would take care of paging the memory as necessary. I wonder how much we should be relying on Dask to attempt to do memory spilling vs. just using managed memory to do the spilling for us. |
By paging here, are you saying that RMM could potentially move that memory to host memory, is that correct? If so, is this implemented today? I looked through the code about a month ago and I don't think I saw that.
This could be on either end. I only fear that once we have memory managed on the Python side too, we could end up with both sides not knowing of the tracking and policies from the other. In other words, it would be best to have a single memory manager, or memory managers would have to be able to communicate with each other and sync on a policy to do things like memory spilling, which could be even more complex. |
What I'm talking about is independent of RMM, it's using The basic idea is that you allocate some memory that is accessible to both host/device, and it will automatically be paged back and forth between host/device memory. It also provides the option for explicit user control over paging. It has negative performance implications, but so does the manual spilling being done in Dask. More info: |
But will If the case above is true, should nvstrings not crash if we just enable managed memory? |
Yes, it will page out unused memory allocated with It's just like standard virtual memory w/ CPUs where you can allocate and use more memory than the actual RAM that you have. (Hence UVM, Unified Virtual Memory). |
Yes, but even on the CPU, this implies that memory was allocated but was not written to (or more broadly, paged). The example I posted before assumes that the memory has been paged (after each kernel call). I guess |
@pentschev You may enjoy these wonderful presentations from @nsakharnykh to learn more about managed memory: |
Thanks for the links @jrhemstad. In the meantime, I was also looking at the documentation which states:
This indeed sounds like what I wanted, possibly also improves our case. Let's see wait to see some results. :) |
@pentschev you seem to have put a lot of work into this proposal, and it's really well-documented. Thanks for that. However I think in the future you should probably start a discussion like this first to save you effort. Jake and I both had the same reaction to the proposal: virtual memory is the ideal solution here; adding complexity to every kernel launch in every library is not really an option. Note one limitation of Unified Memory is you can't IPC addresses allocated with cudaMallocManaged today. Because of this it's an open issue to create the ability for RMM to support multiple pools that use different underlying allocators. Presumably IPC buffers could be kept small enough so that most memory could be allocated to the managed pool. |
I think the effort of putting a proposal in place to start a discussion worth it, even if it ends up being rejected. It has the effect of the issues and goals being clearer to everyone, and presenting a potential solution too.
In general I agree, virtual memory is a good and simple option. In the memory spilling case, it has some shortcomings, such as not being able to eventually spill to disk, but these shortcomings may never pose a problem, so let's worry about them if they ever become a problem.
I am guessing here you're talking about IPC because of the concerned I raised earlier in #112 (comment), where I mentioned Python and C++ will not know about each other's memory consumption, is that right? In that case, I'm not necessarily talking about IPC to transfer memory, but rather just saying that they need to know there's other work taking device memory, otherwise, the easy assumption is that they have the device entirely for themselves. In the dask-cuda side, we can actively watch memory consumption and start spilling/evicting once the device reaches a threshold, but that still may have effects on the C++ side if it's much faster than dask-cuda. Regardless, I'll try out memory pooling today for the issues I know of and see how that behaves. Hopefully it will just handle our case beautifully and no further work will be necessary. Thanks both @jrhemstad and @harrism for the comments! |
@harrism is just saying that managed memory (today) cannot be shared via IPC. So if you for whatever reason may want to share some memory with another process via IPC, it cannot be done if was allocated as managed memory and would need to be copied to non-managed before being able to be IPC'ed. With the usage of UCX, I do not think IPC is as important as it once was, but I may be mistaken.
Using managed memory should "just work" for oversubscription, however, if you just rely on the GPU to do all the page faulting for you, performance can be quite poor. In general, I suspect we'll want to build logic into Dask to explicitly pre-fetch pages to/from device/host memory in order to get the best performance (like you're doing now). However, instead of doing it manually with |
I am also not absolutely sure, but I tend to agree that IPC won't be as important after UCX, and we're mostly converging to using UCX anyway. That said, I think IPC is not as big a concern at the moment.
Yes, Dask does this explicitly, but it doesn't directly control how the copy occurs and it is done by whatever library produces the device array (usually Numba or CuPy). In that sense, Dask won't also ever know or care about page faulting, but the libraries that it builds upon will (or should). Is there any mechanism to improve page faulting in libraries such as cuDF when using RMM's managed memory? If not, is this something that is doable? |
|
Right, you were talking about the caller, I was thinking of the cuDF's internal memory. Yeah, I think we can assume the Python caller will pass already a device pointer, so we don't really have to worry about page faults in that situation. Thanks for clarifying. |
@pentschev let us know how the Unified Memory testing is going. The last presentation referenced by Jake from GTC’19 is exactly about using RMM with managed memory for RAPIDS workloads. Would love to hear any feedback, and see how we can improve performance in scenarios oversubscribing GPU memory with prefetching and other hints as necessary. |
@pentschev is this PR still needed / relevant? |
I don't think so. I've tested RMM managed memory with dask-cuda and that seemed to solve the issue. Unfortunately I haven't heard back from users if they've got the chance to test that as well. Thanks everyone for the input here! |
This PR is an RMM enhancement proposal, introducing a deferrable memory management, allowing C++ code to expose all CUDA buffers to Python that can be later controlled by some application on the Python side (e.g., Dask), allowing that application to spill the memory to another storage media (such as host memory).
The motivation for this proposal comes from several use cases where the memory available on a device doesn't suffice for sufficiently large problems, and device memory allocated in C++ largely outsizes that allocated in Python. We have a couple of issues describe the problem:
The proposal contains a high-level overview and a C++ implementation suggestion. I preferred to delay details on Python implementation in favor of getting a discussion started more quickly.
cc @harrism @kkraus14 @VibhuJawa @randerzander @galipremsagar @mrocklin