-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added warp::shfl functionality. #1273
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Looking at the HIP and CUDA documentation, they also have the last parameter width
which is by default = warpSize
. What is the reason for omitting it from your API? If you need access to this value, there is getSize()
in Traits.hpp
.
One thing that we discussed quite a bit when bringing in the support for warp functions #1003 (and also intrinsics #1004, #1018 ) was to go for fix-width types vs. just int
. Here it's always a bit awkward as alpaka tries using fixed-width everywhere, but the backend APIs like CUDA operate with int
s. @psychocoderHPC what do you think?
These can be set to int32_t, since the CUDA documentation specifies "Template type T" with 4 or 8-byte values, while HIP says int or float. Templates seem to give awful errors from inputs other than int or float, however, since the casting is not unique. With regard to warpsize, I had not thought about using that feature, but it is possible to use groups of consecutive threads smaller than the warpsize as "mini-warps", so I'll add it. |
Yes please expose always fixed with types to the user. If needed please cast the input values to whatever the native API function call requires. Fixed with types are required to guarentee portability between different compiler and platforms. |
Yes CUDA allows shuffling between less than warp size threads. The problem in apaka would be that our CPU backends has a warpsize of one. One case where it is useful to shuffle between 4 lanes is if you run some kind of emulated sse4 instructions on the GPU e.g. for cryptographic algorithms. This would not be portable to alpaka CPU backends, therefore I would maybe restrict alpaka to support for shuffling in a full warp only. |
//----------------------------------------------------------------------------- | ||
//! Broadcasts data from one thread to all members of the warp. | ||
//! Similar to MPI_Bcast, but using srcLane instead of root. | ||
//! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we need to add to the documentation that this function shfl
is collective what means all threads need to call the function and also from the same code branch.
The reason is that for CUDA the implementation is using activemask
and for HIP all threads in a warp needs to call the function. Using activemask
means if threads from the if and else branch call the function they will not see each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated these docs to include this warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I forgot to add a similar warning to the previously existing warp collectives. You comment also alllies to those, right @psychocoderHPC ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbastrakov Yes this should be added to other warp functions too. Currently, only CUDA allows calling warp functions from different branches. It is fine if all threads of the warp are in the same branch but as soon as the threads diverge the behavior is undefined (for HIP and CUDA devices before sm_70) .
test/unit/warp/src/Shfl.cpp
Outdated
float ans = alpaka::warp::shfl(acc, 3.3f, 0); | ||
float expect = 3.3; | ||
ALPAKA_CHECK(*success, CAST(ans) == CAST(expect)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While that may work on most compilers, it is certainly undefined behavior because we are reading a float value as an int32. At that point a memcmp
is probably the more correct solution.
However, I think I prefer disabling the warning here. I think we know what we are doing here and we expect the exact same float to be shuffled around, so the comparison should not need an epsilon.
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfloat-equal"
ALPAKA_CHECK(*success, ans == expect);
#pragma GCC diagnostic pop
I don't know if we need a similar escape hatch for the other compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compiler specific pragmas should be guarded: see
alpaka/include/alpaka/atomic/Op.hpp
Lines 31 to 34 in a78a70a
#if BOOST_COMP_GNUC | |
# pragma GCC diagnostic push | |
# pragma GCC diagnostic ignored "-Wconversion" | |
#endif |
722fd44
to
3bd10d3
Compare
It seems one of the tests spuriously failed on a network error. |
We mostly need to ignore these tests. Github is not providing us with a feature to restart only a single job :-( |
So should we merge this one then? |
@frobnitzem Could you please address #1273 (comment) IMO it should be a separate function else it looks like a line with many magic numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to block this PR until #1273 (comment) is implemented.
Remove magic number which was epsilon by `std::numeric_limits` eplsilon.
These changes implement calls to:
I only added __shfl for now, but there are also __shfl_up, etc. which could be cut/paste in a similar way. Apparently these were also needed by #18 .