Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

[RFE] Allow init_value to be a device pointer in cub::DeviceScan::ExclusiveScan #294

Closed
zasdfgbnm opened this issue May 3, 2021 · 15 comments
Labels
helps: pytorch Helps or needed by PyTorch. type: enhancement New feature or request.

Comments

@zasdfgbnm
Copy link
Contributor

zasdfgbnm commented May 3, 2021

per title

@alliepiper alliepiper added type: enhancement New feature or request. helps: pytorch Helps or needed by PyTorch. labels May 4, 2021
@alliepiper
Copy link
Collaborator

I assume this is to support a usecase where init_value isn't yet computed when launching the scan kernel? That would definitely be nice to have.

@zasdfgbnm
Copy link
Contributor Author

@allisonvacanti Yes, correct.

@alliepiper
Copy link
Collaborator

Gotcha. I'm not sure when something like this would happen, it would need to go into a new overload and require some invasive changes to the implementation, but I'll keep it in mind when as we start updating algorithm implementations in the coming months.

@zasdfgbnm
Copy link
Contributor Author

The use case in PyTorch is to manually split tensors larger than 2^31 elements into small tensors. In this use case, we will run the exclusive scan on the first tensor, and use the last element in the first tensor as the init_value of the second tensor.

@ngimel
Copy link

ngimel commented May 7, 2021

Turns out we can use custom type for InitValueT to do what we want

template <typename T>
struct init_val {
    __host__ __device__ init_val(T* init_ptr){
        init_p = init_ptr;
    }
    __host__ __device__ operator T() const {return *init_p;}
    T * init_p;
};

As far as I could tell inside the kernel InitValueT is implicitly converted to OutputValueT so the above struct makes it possible.

@alliepiper
Copy link
Collaborator

Thanks for pointing this out @ngimel! I think that's a much cleaner solution that works well within the design of the library.

Closing this issue. The suggestion above is preferable to adding a new overload for this usecase.

@ngimel
Copy link

ngimel commented May 10, 2021

@allisonvacanti this solution relies on cub internal implementation, i.e., that the first and last thing it does with init_value is implicitly or explicitly convert it to OutputValueT. It seems to work now, but cub could choose to do something internally that would break the above code, so it would be good to explicitly list the requirements that you have for InitValueT.
And yes, thanks for making InitValueT and OutputValueT different template types, that was a good design decision that makes cub much more flexible!

@alliepiper
Copy link
Collaborator

Now that you mention it...this usecase may have actually been broken in a CUB 1.11.0 / CTK 11.3 by #201. That PR changes the behavior of the scan to match the suggestion by https://wg21.link/P0571, which uses InitTypeT to accumulate the intermediate scan results instead of OutputValueT. Are you using an earlier version of CUB?

I'll reopen this, we'll need to think about this some more.

@alliepiper alliepiper reopened this May 11, 2021
@ngimel
Copy link

ngimel commented May 11, 2021

Yeah, I was using 11.0 toolkit and cub version that comes with it.

@alliepiper
Copy link
Collaborator

Oops, yeah 😅 Looks like that change may have broken your usecase...

Pinging @brycelelbach for visibility.

@ngimel
Copy link

ngimel commented May 11, 2021

I agree that accumulating in InitValueT is the standard-compliant thing to do, but that's back to original question for us - what can we do to use device ptr to get initial value?

@jrhemstad
Copy link
Collaborator

I agree that accumulating in InitValueT is the standard-compliant thing to do, but that's back to original question for us - what can we do to use device ptr to get initial value?

This doesn't quite solve the immediate problem, but for similar problems I've used a combination of a thrust::permutation_iterator and thrust::constant_iterator like:

int * p = // device pointer to some to-be-computed value
auto iter = thrust::make_permutation_iterator(p, thrust::make_constant_iterator(0));

// Dereferencing `iter` will return p[0], aka `*p`

Problem is that CUB scan expects the initial value to be passed by value and not as a iterator/pointer-like thing.

@zasdfgbnm
Copy link
Contributor Author

zasdfgbnm commented May 12, 2021

Can we add a special class to cub cub::InitValueFromDevicePointer, something like:

template<typename InitValueT>
struct InitValueFromDevicePointer {
    InitValueT *ptr;
    InitValueFromDevicePointer(InitValueT *ptr): ptr(ptr) {}
    __device__ operator InitValueT() {
        return *ptr;
    }
};

And internally when InitValueT is an InitValueFromDevicePointer, we extract the real type for computation? Something like:

template<typename InitValueT>
struct IsDevicePointer {
    static constexpr bool VALUE = false;
    using TYPE = InitValueT;
};

template<typename InitValueT>
struct IsDevicePointer<InitValueFromDevicePointer<InitValueT>> {
    static constexpr bool VALUE = true;
    using TYPE = InitValueT;
};

and

    using RealInitValueT = typename IsDevicePointer<InitValueT>::TYPE;

    // The output value type -- used as the intermediate accumulator
    // Per https://wg21.link/P0571, use InitValueT if provided, otherwise the
    // input iterator's value type.
    using OutputT =
      typename If<Equals<InitValueT, NullType>::VALUE, InputT, RealInitValueT>::Type;

See the prototype PR: #305

@alliepiper
Copy link
Collaborator

@zasdfgbnm I like this idea! I think we should generalize the implementation a bit, since this would probably be useful in other contexts.

I'll start a discussion on the PR.

@alliepiper
Copy link
Collaborator

This was fixed in #305.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
helps: pytorch Helps or needed by PyTorch. type: enhancement New feature or request.
Projects
None yet
Development

No branches or pull requests

4 participants