-
-
Notifications
You must be signed in to change notification settings - Fork 873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed ndarray #7881
Add distributed ndarray #7881
Conversation
cupy/_core/_reduction.pyx
Outdated
@@ -581,6 +581,10 @@ cdef class _SimpleReductionKernel(_AbstractReductionKernel): | |||
def __call__(self, object a, axis=None, dtype=None, _ndarray_base out=None, | |||
bint keepdims=False): | |||
|
|||
if hasattr(a, '__cupy_override_reduction_kernel__'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently only working for SimpleReductionKernel
There are other more generic Reduction kernels, I wonder if we can easily support them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__cupy_override_reduction_kernel__
has to be called here, before type checks which rule out DistributedArray
.
I believe we can support ReductionKernel
in the same way, by adding this hook in its __call__
method.
cupy/_core/core.pyx
Outdated
@@ -995,6 +995,9 @@ cdef class _ndarray_base: | |||
:meth:`numpy.ndarray.max` | |||
|
|||
""" | |||
if hasattr(self, '__cupy_override_reduction_kernel__'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would only like to have this check in the ReductionKernel
machinery. Do you think it should be possible to do that? thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove this check, cupy._core._routines_statistics._ndarray_max
tries CUB and cuTENSOR before falling back to cupy.max
.
So this check should be done before reaching cupy.max.__call__
, I wonder where this check should be placed.
https://github.com/shino16/cupy/blob/main/cupy/_core/_routines_statistics.pyx#L25-L43
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call. Maybe we should rewrite those checks to make it feasible.
Like having an attribute in the ndarray that's overridden by derived classes.
array.support_cub_for_routines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now DistributedArray
overrides all the methods of ndarray
including max/min/sum/prod, so those hooks in cupy._core.core.ndarray
are removed. Still your idea sounds reasonable.
cupyx/distributed/_nccl_comm.py
Outdated
@@ -43,6 +43,18 @@ | |||
_nccl_ops = {} | |||
|
|||
|
|||
def _get_nccl_dtype_and_count(array, count=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
cupyx/distributed/array/_array.py
Outdated
shape (tuple of ints): Length of axes. | ||
dtype: Data type. It must be an argument of :class:`numpy.dtype`. | ||
mode (str or mode object): Mode that determines how overlaps of | ||
chunks are interpreted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a description of the available models in the docstring under a .. note:
section :)
This pull request is now in conflicts. Could you fix it @shino16? 🙏 |
/test mini |
/test mncore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick drive-by comments 🙂
examples/distributed/elementwise.py
Outdated
INDEX_MAP_A = { | ||
0: (slice(200)), # arr[:200] | ||
1: (slice(200, None)), # arr[200:] | ||
} | ||
|
||
INDEX_MAP_B = { | ||
0: (slice(None), slice(None, None, 3)), # arr[:, ::2] | ||
1: (slice(None), slice(1, None, 2)), # arr[:, 1::2] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to comment why we need these index maps, currently this sample assumes that users have the concept of sharding, which may or may not be the case.
examples/distributed/elementwise.py
Outdated
with Device(0): | ||
s0 = Stream() | ||
s0.use() | ||
|
||
with Device(1): | ||
s1 = Stream() | ||
s1.use() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quality of life change: This sample assumes there exists 2 GPUs, we need a device count check at the top and do early exit.
examples/distributed/matmul.py
Outdated
# 0 M/3 M | ||
# 0 +-----+-----+ | ||
# | 0 1 | 2 | | ||
# N/2 +-----+-----+ | ||
# | 0 3 | 1 2 | | ||
# N +-----+-----+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Does this sample require 4 GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to add a runtime check here! great catch :)
cupyx/distributed/array/_array.py
Outdated
corresponding to slices of the original array. Note that one device can | ||
hold multiple chunks. | ||
|
||
This direct constructor is designed for internal calls. Users should create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This direct constructor is designed for internal calls. Users should create | |
This direct constructor is intended as a private API. Users should create |
cupyx/distributed/array/_array.py
Outdated
|
||
obj._streams = {} | ||
obj._comms = comms if comms is not None else {} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several of the assignments here could be moved to init
cupyx/distributed/array/_array.py
Outdated
def distributed_array( | ||
array: ArrayLike, | ||
index_map: dict[int, Any], | ||
mode: str = 'replica', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe better to use enum?
cupyx/distributed/array/_array.py
Outdated
|
||
if isinstance(array, (numpy.ndarray, ndarray)): | ||
if mode != 'replica': | ||
array = array.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: If the array is already in the device, making a copy can cause an OOM, in some cases we may be able to copy only the chunk in the device if the array is contiguous.
cupyx/distributed/array/_chunk.py
Outdated
|
||
|
||
class _ArrayPlaceholder: | ||
# Mocks ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self note, this is used when we are waiting chunks to arrive, to the current device
/test full |
@shino16 can you repush your changes and open again the PR? thanks! |
Adds
array.DistributedArray
tocupyx.distributed
.It provides initial support of
ndarray
ufunc
,ElementwiseKernel
)matmul
)in a multi-GPU setting.