-
Notifications
You must be signed in to change notification settings - Fork 449
[RFE] Allow init_value
to be a device pointer in cub::DeviceScan::ExclusiveScan
#294
Comments
I assume this is to support a usecase where |
@allisonvacanti Yes, correct. |
Gotcha. I'm not sure when something like this would happen, it would need to go into a new overload and require some invasive changes to the implementation, but I'll keep it in mind when as we start updating algorithm implementations in the coming months. |
The use case in PyTorch is to manually split tensors larger than 2^31 elements into small tensors. In this use case, we will run the exclusive scan on the first tensor, and use the last element in the first tensor as the |
Turns out we can use custom type for
As far as I could tell inside the kernel |
Thanks for pointing this out @ngimel! I think that's a much cleaner solution that works well within the design of the library. Closing this issue. The suggestion above is preferable to adding a new overload for this usecase. |
@allisonvacanti this solution relies on cub internal implementation, i.e., that the first and last thing it does with |
Now that you mention it...this usecase may have actually been broken in a CUB 1.11.0 / CTK 11.3 by #201. That PR changes the behavior of the scan to match the suggestion by https://wg21.link/P0571, which uses I'll reopen this, we'll need to think about this some more. |
Yeah, I was using 11.0 toolkit and cub version that comes with it. |
Oops, yeah 😅 Looks like that change may have broken your usecase... Pinging @brycelelbach for visibility. |
I agree that accumulating in InitValueT is the standard-compliant thing to do, but that's back to original question for us - what can we do to use device ptr to get initial value? |
This doesn't quite solve the immediate problem, but for similar problems I've used a combination of a
Problem is that CUB scan expects the initial value to be passed by value and not as a iterator/pointer-like thing. |
Can we add a special class to cub template<typename InitValueT>
struct InitValueFromDevicePointer {
InitValueT *ptr;
InitValueFromDevicePointer(InitValueT *ptr): ptr(ptr) {}
__device__ operator InitValueT() {
return *ptr;
}
}; And internally when template<typename InitValueT>
struct IsDevicePointer {
static constexpr bool VALUE = false;
using TYPE = InitValueT;
};
template<typename InitValueT>
struct IsDevicePointer<InitValueFromDevicePointer<InitValueT>> {
static constexpr bool VALUE = true;
using TYPE = InitValueT;
}; and using RealInitValueT = typename IsDevicePointer<InitValueT>::TYPE;
// The output value type -- used as the intermediate accumulator
// Per https://wg21.link/P0571, use InitValueT if provided, otherwise the
// input iterator's value type.
using OutputT =
typename If<Equals<InitValueT, NullType>::VALUE, InputT, RealInitValueT>::Type; See the prototype PR: #305 |
@zasdfgbnm I like this idea! I think we should generalize the implementation a bit, since this would probably be useful in other contexts. I'll start a discussion on the PR. |
This was fixed in #305. |
per title
The text was updated successfully, but these errors were encountered: