-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
addressing the case when output region for repeat operation is too big #386
base: branch-24.03
Are you sure you want to change the base?
Conversation
) | ||
if repeats < 0: | ||
return ValueError( | ||
"'repeats' should not be negative: {}".format(repeats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use f-strings in new code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and use raise
to raise an exception instead of return
# check if size of the output array is less 8GB. In this case we can | ||
# use output regions, otherwise we will use statcally allocated | ||
# array | ||
if size < 8589934592 / 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bunch of comments about this, going from lower- to higher-level:
This seems to be testing for 4GB, not 8GB?
This should be a named constant, and ideally written as 8 * (2 ** 30)
, so it's obvious we're talking about 8GB.
This limit is not considering the available memory. 8GB may be too large or too little depending on the memory. This number should probably be a percentage of the relevant available memory.
This is considering the full size of the array, not the size of each chunk. E.g. 16GB may be totally fine if split across 8 GPUs.
It seems to me that the only real decision we're making here is whether to perform the operation using an eager output or a deferred output. Therefore, we want to also be querying the (relative) sizes of the eager and deferred pools. Ideally we would also consider the current/projected load on each pool, which is not possible right now, but might be possible in the future, if legate.core takes over more instance management responsibilities.
Finally, AFAIK the unification of eager and deferred pools is on the Legion roadmap. If that happens, then we could safely always use the more efficient eager implementation. @lightsighter how far in the future do you think this is? If nobody has complained about this issue, we may want to wait until unification lands.
for i in range(a._shape[axis]): | ||
a_index[axis] = slice(i, i + 1) | ||
res_index[axis] = slice(offset, offset + repeats) | ||
result[res_index] = array[a_index] | ||
offset += repeats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this. We are emitting a separate operation for each slice. It would be more efficient if we could manually partition the result
array on axis
using a custom coloring, following a partitioning of a
on the same axis:
p_a = axis_equal_partition(a, axis, colors)
for c in colors:
result_coloring[c] = Rect(
lo=p_a[c].domain.lo * repeats,
hi=p_a[c].domain.hi * repeats
)
p_result = partition_by_coloring(result, result_coloring)
for c in colors:
repeat(p_result[c], p_a[c]) # index launch
However, this would require support for manual-coloring partitioning from the core. @magnatelee is this something that's reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think anything that does coloring would be any more scalable than the original code. and it is in some sense worse as it takes away the core's ability to reason about the partitioning. for example, if the core saw multiple tasks operating on disjoint parts of the same array, it could potentially partition and map them in a way that those tasks are distributed in a balanced manner.
@ipdemes What is the status of this PR? |
The 'slices' implementation is taken from Cupy