-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device memory spill support #35
Changes from 38 commits
1aa68a8
7349974
377368a
197e653
de5c3e9
42c0686
c92bbc9
ce4ba37
b3cbf2c
855876c
07fccfa
6d48805
1dd2c16
df6cb95
5a7ceef
8659fb6
2c24f03
1d06c75
deff18f
2d9c150
ce5c650
c5fbb6f
dcc6a6a
9bc21e7
9eb2dfd
4324439
388c677
c942438
d968b0f
1abb1eb
eb70191
806ac8b
f35826e
f308943
1b9d38b
6d5b714
358f194
c81f763
f28cbd1
6163359
a3c89fb
8f59141
1e3acce
882bf31
1960b79
eb86d5d
8153b27
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
from .local_cuda_cluster import LocalCUDACluster | ||
from . import config |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
import yaml | ||
import os | ||
|
||
import dask | ||
|
||
config = dask.config.config | ||
|
||
|
||
fn = os.path.join(os.path.dirname(__file__), "cuda.yaml") | ||
dask.config.ensure_file(source=fn) | ||
with open(fn) as f: | ||
dask_cuda_defaults = yaml.safe_load(f) | ||
|
||
dask.config.update_defaults(dask_cuda_defaults) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
distributed: | ||
worker: | ||
# Fractions of device memory at which we take action to avoid memory blowup | ||
# Set any of the lower three values to False to turn off the behavior entirely | ||
device-memory: | ||
target: 0.60 # target fraction to stay below | ||
spill: 0.70 # fraction at which we spill to host | ||
pause: 0.0 # fraction at which we pause worker threads - Disabled (0.0) by default, preventing worker from stalling when third-party memory managers are used and can't be disabled |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
from zict import Buffer, File, Func | ||
from zict.common import ZictBase | ||
from distributed.protocol import deserialize_bytes, serialize_bytes | ||
from distributed.worker import weight | ||
|
||
from functools import partial | ||
import os | ||
|
||
|
||
def _is_device_object(obj): | ||
""" | ||
Check if obj is a device object, by checking if it has a | ||
__cuda_array_interface__ attributed | ||
""" | ||
return hasattr(obj, "__cuda_array_interface__") | ||
|
||
|
||
def _serialize_if_device(obj): | ||
""" Serialize an object if it's a device object """ | ||
if _is_device_object(obj): | ||
return serialize_bytes(obj, on_error="raise") | ||
else: | ||
return obj | ||
|
||
|
||
def _deserialize_if_device(obj): | ||
""" Deserialize an object if it's an instance of bytes """ | ||
if isinstance(obj, bytes): | ||
return deserialize_bytes(obj) | ||
else: | ||
return obj | ||
|
||
|
||
class DeviceHostFile(ZictBase): | ||
pentschev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
""" Manages serialization/deserialization of objects. | ||
|
||
Three LRU cache levels are controlled, for device, host and disk. | ||
Each level takes care of serializing objects once its limit has been | ||
reached and pass it to the subsequent level. Similarly, each cache | ||
may deserialize the object, but storing it back in the appropriate | ||
cache, depending on the type of object being deserialized. | ||
|
||
Parameters | ||
---------- | ||
device_memory_limit: int | ||
Number of bytes of CUDA device memory for device LRU cache, | ||
spills to host cache once filled. | ||
memory_limit: int | ||
Number of bytes of host memory for host LRU cache, spills to | ||
disk once filled. | ||
local_dir: path | ||
Path where to store serialized objects on disk | ||
""" | ||
|
||
def __init__( | ||
self, device_memory_limit=None, memory_limit=None, local_dir="dask-worker-space" | ||
): | ||
path = os.path.join(local_dir, "storage") | ||
|
||
self.host_func = dict() | ||
self.disk_func = Func( | ||
partial(serialize_bytes, on_error="raise"), deserialize_bytes, File(path) | ||
) | ||
self.host_buffer = Buffer( | ||
self.host_func, self.disk_func, memory_limit, weight=weight | ||
) | ||
|
||
self.device_func = dict() | ||
self.device_host_func = Func( | ||
_serialize_if_device, _deserialize_if_device, self.host_buffer | ||
) | ||
self.device_buffer = Buffer( | ||
self.device_func, self.device_host_func, device_memory_limit, weight=weight | ||
) | ||
|
||
self.device = self.device_buffer.fast | ||
self.host = self.host_buffer.fast | ||
self.disk = self.host_buffer.slow | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like these are LRU objects. maybe self.device = self.device_buffer.fast.d
self.host = self.host_buffer.fast.d
self.disk = self.host_buffer.slow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sorry, I thought the LRU is what we wanted give access to the users, no? They can give more information, such as the used and total memory from the LRU, not just what objects are in there. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By the way, print There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that most users will be somewhat confused by the Open to counter-arguments though. I don't have strong confidence in what will happen. Users are confused today by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I see your point, and I think there are two benefits of keeping it the way it is:
Getting the LRU if they only have the dict is not possible, therefore, I would be more fond of having another alias for the dict (
I understand that now (there was no documentation for this). In that sense, I would suggest having the same aliases ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think it's unlikley that the average user will know to add If they want LRU information (which I think will be the uncommon case) then they can go up to the full
Agreed. (though I would rename Want to submit a short PR or issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Fine by me, I will do the change.
That's what I had mind, more or less the same I did here.
Sure, let me first finish the changes here. |
||
|
||
def __setitem__(self, key, value): | ||
if _is_device_object(value): | ||
self.device_buffer[key] = value | ||
else: | ||
self.host_buffer[key] = value | ||
|
||
def __getitem__(self, key): | ||
if key in self.host_buffer: | ||
obj = self.host_buffer[key] | ||
del self.host_buffer[key] | ||
self.device_buffer[key] = _deserialize_if_device(obj) | ||
|
||
if key in self.device_buffer: | ||
return self.device_buffer[key] | ||
else: | ||
raise KeyError | ||
|
||
def __len__(self): | ||
return len(self.device_buffer) | ||
|
||
def __iter__(self): | ||
return iter(self.device_buffer) | ||
|
||
def __delitem__(self, i): | ||
del self.device_buffer[i] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
from distributed import Nanny | ||
|
||
from .worker import CUDAWorker | ||
|
||
|
||
class CUDANanny(Nanny): | ||
""" A process to manage CUDAWorker processes | ||
|
||
This is a subclass of Nanny, with the only difference | ||
being worker_class=CUDAWorker. | ||
""" | ||
|
||
def __init__(self, *args, worker_class=CUDAWorker, **kwargs): | ||
Nanny.__init__(self, *args, worker_class=worker_class, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to avoid this until someone asks for it if possible. I'm somewhat against an API like this because it forces us to enumerate the possible options in code. If we were to do something like this I think that we would probably provide the full namespace of the class and then try to import it.
However there is enough uncertainty here that, for maintenance reasons, I'd prefer that we not promise anything until someone is asking us for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, the
worker_class
keyword exists because different groups have their own custom Worker classes that they use. Exposing a keyword like this but making no move to allow them to plug in seems against the spirit of the keyword.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not true. We solely disabled pausing the worker, to control spilling from the device, we still need to monitor the device memory. And this is why I needed to subclass it.
Unfortunately, this is necessary for us to reenable the old
CUDA_VISIBLE_DEVICES
test. UsingCUDAWorker
, which was hardcoded before, prevents us from launching a process on a single-GPU machine to mock test as if it had several and check for the ordering of GPUs of each process.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to monitor device memory externally from the use of
sizeof
? If objects' reliably tell us how much device memory they take up then we may not need to track device memory separately.Dask workers operated this way for a long time before we finally needed to give in and have a separate periodic callback that tracked system memory. I'm inclined to try the simpler approach first and see if it breaks down or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, they do track how much memory they take. However, tracking the device memory lets us decide when it's time to spill memory. Isn't that what
memory_target_fraction
andmemory_spill_fraction
(prependdevice_
for the cases in this PR) are for?The block https://github.com/rapidsai/dask-cuda/pull/35/files#diff-a77f0c6f19d8d34d59aede5e31455719R282 controls the spilling, and this is why we needed to subclass
Worker
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like that diff. You'll also want to add the
data=
keyword toLocalCUDACluster
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation on memory monitor.
Yes, and I also want to create the object before. :)
But ok, I can probably have it quickly done by tomorrow. There's a few more things that need to be ported to allow it to work (like finding out how much memory the device has in total), and also some test(s), which shouldn't be too difficult now that there's already one that works with the monitoring mechanism and I can base it on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually valid to pass just the class. Dask will construct it. I think that this is explained in the
Worker
docstring. This is better because you're using a Nanny and don't want to pass it through a process boundary.I recommend that we start with just using the full memory or a config value by default and not mess with any user inputs (which will get messy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll check that.
We need to identify how much memory there is available for the device, regardless. I can probably use the same numba code from before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, mostly I want to say lets not add a new
device_memory_foo=
keyword yet if we can avoid it.