Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added warp::shfl functionality. #1273

Merged
merged 5 commits into from
Mar 29, 2021
Merged

Conversation

frobnitzem
Copy link
Contributor

These changes implement calls to:

I only added __shfl for now, but there are also __shfl_up, etc. which could be cut/paste in a similar way. Apparently these were also needed by #18 .

include/alpaka/warp/WarpSingleThread.hpp Show resolved Hide resolved
include/alpaka/warp/WarpSingleThread.hpp Outdated Show resolved Hide resolved
include/alpaka/warp/WarpUniformCudaHipBuiltIn.hpp Outdated Show resolved Hide resolved
test/unit/warp/src/Shfl.cpp Outdated Show resolved Hide resolved
Copy link
Member

@sbastrakov sbastrakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Looking at the HIP and CUDA documentation, they also have the last parameter width which is by default = warpSize. What is the reason for omitting it from your API? If you need access to this value, there is getSize() in Traits.hpp.

One thing that we discussed quite a bit when bringing in the support for warp functions #1003 (and also intrinsics #1004, #1018 ) was to go for fix-width types vs. just int. Here it's always a bit awkward as alpaka tries using fixed-width everywhere, but the backend APIs like CUDA operate with ints. @psychocoderHPC what do you think?

@frobnitzem
Copy link
Contributor Author

Thanks for the PR!

Looking at the HIP and CUDA documentation, they also have the last parameter width which is by default = warpSize. What is the reason for omitting it from your API? If you need access to this value, there is getSize() in Traits.hpp.

One thing that we discussed quite a bit when bringing in the support for warp functions #1003 (and also intrinsics #1004, #1018 ) was to go for fix-width types vs. just int. Here it's always a bit awkward as alpaka tries using fixed-width everywhere, but the backend APIs like CUDA operate with ints. @psychocoderHPC what do you think?

These can be set to int32_t, since the CUDA documentation specifies "Template type T" with 4 or 8-byte values, while HIP says int or float. Templates seem to give awful errors from inputs other than int or float, however, since the casting is not unique.

With regard to warpsize, I had not thought about using that feature, but it is possible to use groups of consecutive threads smaller than the warpsize as "mini-warps", so I'll add it.

@psychocoderHPC
Copy link
Member

One thing that we discussed quite a bit when bringing in the support for warp functions #1003 (and also intrinsics #1004, #1018 ) was to go for fix-width types vs. just int. Here it's always a bit awkward as alpaka tries using fixed-width everywhere, but the backend APIs like CUDA operate with ints. @psychocoderHPC what do you think?

Yes please expose always fixed with types to the user. If needed please cast the input values to whatever the native API function call requires. Fixed with types are required to guarentee portability between different compiler and platforms.

@psychocoderHPC
Copy link
Member

With regard to warpsize, I had not thought about using that feature, but it is possible to use groups of consecutive threads smaller than the warpsize as "mini-warps", so I'll add it.

Yes CUDA allows shuffling between less than warp size threads. The problem in apaka would be that our CPU backends has a warpsize of one. One case where it is useful to shuffle between 4 lanes is if you run some kind of emulated sse4 instructions on the GPU e.g. for cryptographic algorithms. This would not be portable to alpaka CPU backends, therefore I would maybe restrict alpaka to support for shuffling in a full warp only.

//-----------------------------------------------------------------------------
//! Broadcasts data from one thread to all members of the warp.
//! Similar to MPI_Bcast, but using srcLane instead of root.
//!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we need to add to the documentation that this function shfl is collective what means all threads need to call the function and also from the same code branch.
The reason is that for CUDA the implementation is using activemask and for HIP all threads in a warp needs to call the function. Using activemask means if threads from the if and else branch call the function they will not see each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated these docs to include this warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I forgot to add a similar warning to the previously existing warp collectives. You comment also alllies to those, right @psychocoderHPC ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbastrakov Yes this should be added to other warp functions too. Currently, only CUDA allows calling warp functions from different branches. It is fine if all threads of the warp are in the same branch but as soon as the threads diverge the behavior is undefined (for HIP and CUDA devices before sm_70) .

Comment on lines 35 to 37
float ans = alpaka::warp::shfl(acc, 3.3f, 0);
float expect = 3.3;
ALPAKA_CHECK(*success, CAST(ans) == CAST(expect));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that may work on most compilers, it is certainly undefined behavior because we are reading a float value as an int32. At that point a memcmp is probably the more correct solution.

However, I think I prefer disabling the warning here. I think we know what we are doing here and we expect the exact same float to be shuffled around, so the comparison should not need an epsilon.

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
ALPAKA_CHECK(*success, ans == expect);
#pragma GCC diagnostic pop

I don't know if we need a similar escape hatch for the other compilers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compiler specific pragmas should be guarded: see

#if BOOST_COMP_GNUC
# pragma GCC diagnostic push
# pragma GCC diagnostic ignored "-Wconversion"
#endif

@frobnitzem frobnitzem force-pushed the shfl branch 2 times, most recently from 722fd44 to 3bd10d3 Compare March 18, 2021 22:50
test/unit/warp/src/Shfl.cpp Show resolved Hide resolved
@frobnitzem
Copy link
Contributor Author

It seems one of the tests spuriously failed on a network error.

@psychocoderHPC
Copy link
Member

It seems one of the tests spuriously failed on a network error.

We mostly need to ignore these tests. Github is not providing us with a feature to restart only a single job :-(

@sbastrakov
Copy link
Member

So should we merge this one then?

@psychocoderHPC
Copy link
Member

@frobnitzem Could you please address #1273 (comment) IMO it should be a separate function else it looks like a line with many magic numbers.

Copy link
Member

@psychocoderHPC psychocoderHPC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to block this PR until #1273 (comment) is implemented.

Remove magic number which was epsilon by `std::numeric_limits` eplsilon.
@bernhardmgruber bernhardmgruber merged commit 77220d9 into alpaka-group:develop Mar 29, 2021
@sbastrakov sbastrakov mentioned this pull request Apr 20, 2021
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants