-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hip-Cpu support initial #233
Conversation
On the margin of not being the biggest fan of When some users The same spirit was followed when trying to auto-detect in HIP-CPU whether someone is compiling with GCC or Clang and using libstdc++ instead of libc++, because the implicit dependence on TBB is STL specific, not compiler specific. The symbol check we introduced in HIP-CPU for GCC 9 is already broken using GCC 10, hence here we moved to just checking for libstdc++ regardless of versions or internals and link to TBB and keep our fingers crossed. Keeping these up to date is tedious, breaks often and (IMHO) doesn't provide much value to users/customers. Managing dependencies should be done by projects dedicated to doing just that. (Why GCC doesn't default to a command-line that compiles the STL itself is beyond me, but it's what we have to live with.) I'll maintain |
First, many thanks for doing this, I think it's extremely nice (then again, I'm biased so...). In what regards the above, that's not actually guaranteed, it just so happens that our current HW does that... maybe... sometimes. In practice, and per the CUDA spec that HIP tries to stay in tune with, the only guarantee you get for shared is that it's uninitialised, and thus it's UB to rely on it having any particular value that you did not explicitly write in there yourself. Thus, rocPRIM's reliance on this behaviour should be corrected, as it's a latent bug waiting to manifest strangely. |
My statement was slightly strong, the compiler warns about potential use of uninitialized storage. Having taken a closer look, I couldn't 100% outrule that some elements of the array remain uncopied to. It may be a false positive, given how we declare the shared array, and invoke a copy on the next line which takes a non-const pointer to said storage as an argument, and this is what the compiler warns about. For the time being I added a memset operation until I can confidently rule out that no storage remains uncopied to. The API surface is large and so are the internals. Once it became apparent that the changeset cannot be dragged until the port matures, I stopped updating all uses of shared and deferred investigation til algos are revisited on a case-by-case basis to make the unit tests, the compielrs (and us devs) happy. |
Rebased the branch and changed the hip-cpu url.
|
I noticed the gfx1030 job was failing. Two things I fixed (hopefully): fully adding the gfx1030 target to CMakeLists.txt, and also I noticed there was compilation failure with device_segmented_radix_sort, namely there was still a hardcoded reference to warp size 64U, so I switched it to device_warp_size(). Feel free to change my fix if it is not appropriate. Hopefully this gets CI passing and then I will merge this in. |
Thanks for the fix. |
@MathiasMagnus :
I have implemented experimental HIP-CPU support for rocPRIM, which has the following properties currently:
The change set is massive, history is already cleaned up and mostly speaks for itself. Some noteworthy changes:
DISABLE_WERROR
option from the build script and moved it over to the CI script. If AMD wants is badly, it can be reinstantiated, but I would argue it's better this way.cl : Command line warning D9025 : overriding '/W3' with '/W4'
type messages.)cli.parse()
methods withsize_t
args. (this specialization doesn't exist in our parser)__shared__
memory on GPUs are scratchpads and by default zero initialized (something we actively build on) whereas this isn't true for uninitialized CPU memory.Dependency.cmake
received a massive facelift, partly due to how annoyingly hard it is to auto-magically compile parallelSTL code with libstdc++.libm
andlibstdfs++
for the STL filesystem library) but a version which doesn't build using CMake whichDownloadProject.cmake
is mostly centered around. The version which builds with CMake is missing required types. The suitable version exposes a module which helps to build it, hence that's what I use in our scripts.Dependency.cmake
to the ground. It's the source of so much aggravation.The reason why most of the tests are passing is due to a limitation of HIP-CPU which causes some kernel codes to be updated when using this back-end: namely HIP-CPU lacks the lock-step execution of warps which device execution exhibits. This is a subtle difference, yet very important. When warp operations appear in divergent control-flow, HIP-CPU breaks. A typical patch would look like this:
We're executing warp-level algos in divergent control-flow. In HIP-CPU all warp-level intrinsics (
__shfl()
, etc.) also act as block-level sync instructions (their implementation issues__sycnthreads()
), therefore they synchronize at a higher level than just a warp and some threads missing a sync-instruction is "bad, mkay"? HIP-CPU doesn't crash, but the kernel goes out-of-sync and starts returning garbage results.This is the current status of tests using HIP-CPU: (green: pass, red: fail, grey: hang)
All block and warp-level algos need to be revised one after the other and be patched up. This MR does not address any of these issues, as finding out this limitation introduced a whole bunch of work we currently don't have the capacity to implement. However, the HIP-CPU team was approached by a customer who have code directly using rocPRIM (yaaaay!) and wish to use it with HIP-CPU. They would be willing to put in the extra effort and fixup the algos they use. (It's free work for us to leverage from the community.)
Carrying this set of patches in a fork is a lot of work, so I would like to get this upstreamed as soon as possible.