-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PoC]: Implement cuda::experimental::uninitialized_async_buffer
#1854
Conversation
🟨 CI finished in 12h 37m: Pass: 99%/361 | Total: 3d 04h | Avg: 12m 40s | Max: 59m 34s | Hits: 62%/520475
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
🟨 CI finished in 8h 46m: Pass: 99%/361 | Total: 2d 15h | Avg: 10m 34s | Max: 1h 03m | Hits: 77%/520475
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
b587381
to
717c085
Compare
cuda::uninitialized_async_buffer
cuda::experimental::uninitialized_async_buffer
717c085
to
38a2151
Compare
2f2f049
to
41ee97a
Compare
🟩 CI finished in 7h 20m: Pass: 100%/55 | Total: 2h 23m | Avg: 2m 37s | Max: 8m 06s | Hits: 95%/1748
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 55)
# | Runner |
---|---|
41 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
38a2151
to
8981c0d
Compare
41ee97a
to
4cf40b9
Compare
🟩 CI finished in 1h 15m: Pass: 100%/55 | Total: 2h 25m | Avg: 2m 38s | Max: 8m 05s | Hits: 91%/1748
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 55)
# | Runner |
---|---|
41 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
4cf40b9
to
e9bfaca
Compare
8981c0d
to
061ae52
Compare
061ae52
to
1a23a28
Compare
🟨 CI finished in 7h 36m: Pass: 99%/417 | Total: 3d 15h | Avg: 12m 38s | Max: 1h 09m | Hits: 44%/34308
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
c301037
to
853f325
Compare
🟩 CI finished in 1h 05m: Pass: 100%/55 | Total: 2h 57m | Avg: 3m 13s | Max: 12m 10s | Hits: 69%/126
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 55)
# | Runner |
---|---|
40 | linux-amd64-cpu16 |
9 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
853f325
to
d4d8247
Compare
🟩 CI finished in 2h 49m: Pass: 100%/55 | Total: 3h 00m | Avg: 3m 16s | Max: 11m 49s | Hits: 71%/126
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 55)
# | Runner |
---|---|
40 | linux-amd64-cpu16 |
9 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
d4d8247
to
f5b852d
Compare
🟩 CI finished in 1h 33m: Pass: 100%/54 | Total: 2h 39m | Avg: 2m 56s | Max: 9m 08s | Hits: 80%/206
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
🏃 Runner counts (total jobs: 54)
# | Runner |
---|---|
40 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
f5b852d
to
b6d93e8
Compare
🟩 CI finished in 6h 43m: Pass: 100%/58 | Total: 2h 56m | Avg: 3m 02s | Max: 10m 05s | Hits: 82%/206
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
🏃 Runner counts (total jobs: 58)
# | Runner |
---|---|
44 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a few more doc fixes.
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__memory_resource/async_memory_resource.cuh
Outdated
Show resolved
Hide resolved
//! @brief Causes the buffer to be treated as a span when passed to cudax::launch. | ||
//! @pre The buffer must have the cuda::mr::device_accessible property. | ||
_CCCL_NODISCARD_FRIEND _CUDA_VSTD::span<_Tp> | ||
__cudax_launch_transform(::cuda::stream_ref, uninitialized_async_buffer& __self) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure, in case the streams are different do we want to synchronize here or in a central place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could lead to unnecessary extra synchronization. You don't know if the stream the buffer was last allocated/written on needs to be synchronized. It may have been already.
E.g.
auto buf = buffer(size, stream_a);
launch(kernel, stream_a, buffer); // initialize buffer with computation in kernel (no sync)
stream_a.wait(); // sync stream_a
// Launch 4 instances of kernel to operate on 4 different buffers on 4 streams.
// All kernels read `buf` as an input.
// The suggested sync in `__cudax_launch_transform()` would synchronize all 4 streams before launching
// no streams need to be synced in this loop
for (int i = 0; i < 4; i++) {
launch(kernel, streams[i], buffers[i], buf)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeahy, but that is the same discussion as about lifetimes. We dont know whether a resource might go out of scope, so we need to do the safe thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a good example where the user of the buffer is unable to synchronize themselves? If one chooses to use an async
buffer, they should be aware that they may need to do some synchronization. If we assume that the user doesn't know what they are doing then we don't give them the ability to hit SOL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the protocol for __cudax_launch_transform
is to return a wrapper here that can covert to a span. Then synchronize here and synchronize back in the destructor of that wrapper.
We also definitely need an opt-out of the synchronization, but not sure how it would look like. Something like cudax::skip_sync(buffer)
, we should try to come up with something generic for other similar cases.
🟨 CI finished in 2h 59m: Pass: 96%/58 | Total: 2h 43m | Avg: 2m 49s | Max: 6m 56s
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
🏃 Runner counts (total jobs: 58)
# | Runner |
---|---|
44 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
🟩 CI finished in 52m 05s: Pass: 100%/58 | Total: 2h 52m | Avg: 2m 57s | Max: 8m 19s | Hits: 80%/208
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
🏃 Runner counts (total jobs: 58)
# | Runner |
---|---|
44 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Show resolved
Hide resolved
cudax/include/cuda/experimental/__container/uninitialized_async_buffer.cuh
Show resolved
Hide resolved
//! @brief Causes the buffer to be treated as a span when passed to cudax::launch. | ||
//! @pre The buffer must have the cuda::mr::device_accessible property. | ||
_CCCL_NODISCARD_FRIEND _CUDA_VSTD::span<_Tp> | ||
__cudax_launch_transform(::cuda::stream_ref, uninitialized_async_buffer& __self) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the protocol for __cudax_launch_transform
is to return a wrapper here that can covert to a span. Then synchronize here and synchronize back in the destructor of that wrapper.
We also definitely need an opt-out of the synchronization, but not sure how it would look like. Something like cudax::skip_sync(buffer)
, we should try to come up with something generic for other similar cases.
This uninitialized buffer provides a stream ordered allocation of N elements of type T utilitzing a cuda::mr::async_resource to allocate the storage. The buffer takes care of alignment and deallocation of the storage. The user is required to ensure that the lifetime of the memory resource exceeds the lifetime of the buffer.
Co-authored-by: Mark Harris <[email protected]>
a92af5d
to
8779ce6
Compare
🟩 CI finished in 4h 29m: Pass: 100%/58 | Total: 2h 38m | Avg: 2m 43s | Max: 13m 04s | Hits: 84%/208
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda | |
CUDA C Core Library |
🏃 Runner counts (total jobs: 58)
# | Runner |
---|---|
44 | linux-amd64-cpu16 |
8 | linux-amd64-gpu-v100-latest-1 |
4 | linux-arm64-cpu16 |
2 | windows-amd64-cpu16 |
The
uninitialized_async_buffer
provides a stream ordered allocation of N elements of type T utilizing acuda::mr::async_resource
to allocate the storage.The buffer takes care of alignment and deallocation of the storage. The user is required to ensure that the lifetime of the memory resource exceeds the lifetime of the buffer.
Note this is based on #1637