Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hip-Cpu support initial #233

Merged
merged 75 commits into from
Jun 19, 2021
Merged

Add Hip-Cpu support initial #233

merged 75 commits into from
Jun 19, 2021

Conversation

neon60
Copy link
Collaborator

@neon60 neon60 commented Jun 4, 2021

@MathiasMagnus :
I have implemented experimental HIP-CPU support for rocPRIM, which has the following properties currently:

  • Supports Linux and Windows
  • Supports GCC, Clang and MSVC host compilers
  • All tests and benchmarks build, simple tests also pass

The change set is massive, history is already cleaned up and mostly speaks for itself. Some noteworthy changes:

  • I have removed the DISABLE_WERROR option from the build script and moved it over to the CI script. If AMD wants is badly, it can be reinstantiated, but I would argue it's better this way.
    • This was done in order to avoid code bloat for an facility with (arguably) minimal value which would double with the increase of supported compilers. (Handling clang, g++, cl.exe, clang-cl.exe gets hairy and is needless bloat.)
    • Moreover it easily conflicts with user-provided values on the command-line and toolchain files. (Can easily result in unsilence-able cl : Command line warning D9025 : overriding '/W3' with '/W4' type messages.)
  • Needed to move to CUDA style indexing, as HIP's docs are deemed outdated in referring to CUDA indexing as something that's still supported. The symbols are simply missing from HIP-CPU
  • Introduced new macros for controlling (no)unroll pragmas, as they aren't uniformly recognized by all compilers.
  • Many ISO conformance tweaks around:
    • RNG distributions being undefined for 8-bit integral types
    • type alignment specification
    • the overload control (metaprogram) used in the sorting tests for the comparator implementation was Clang-only (both GCC and MSVC choked on it)
    • ambiguous call to cli.parse() methods with size_t args. (this specialization doesn't exist in our parser)
  • Some GPU/CPU related differences:
    • uninitialized __shared__ memory on GPUs are scratchpads and by default zero initialized (something we actively build on) whereas this isn't true for uninitialized CPU memory.
    • dpp primitives missing (front-end couldn't even parse the template bodies)
  • Some compiler differences
    • many Clang built-ins missing or being called different in GCC/MSVC
    • attributes not allowed on function definitions
    • working around some SFINAE bugs in MSVC with auto-deduced return types
  • Dependency.cmake received a massive facelift, partly due to how annoyingly hard it is to auto-magically compile parallelSTL code with libstdc++.
    • The PSTL implementation used in GCC 9-10 uses TBB as an implicit dependency (GCC won't add it to the linker flags, just like libm and libstdfs++ for the STL filesystem library) but a version which doesn't build using CMake which DownloadProject.cmake is mostly centered around. The version which builds with CMake is missing required types. The suitable version exposes a module which helps to build it, hence that's what I use in our scripts.
    • If anyone asks, I want to burn Dependency.cmake to the ground. It's the source of so much aggravation.

The reason why most of the tests are passing is due to a limitation of HIP-CPU which causes some kernel codes to be updated when using this back-end: namely HIP-CPU lacks the lock-step execution of warps which device execution exhibits. This is a subtle difference, yet very important. When warp operations appear in divergent control-flow, HIP-CPU breaks. A typical patch would look like this:

    // Scan the warp reduction results to calculate warp prefixes
    if(flat_id < warps_no)
    {
        unsigned int prefix = storage_.warp_prefixes[flat_id];
        warp_scan_prefix_type().inclusive_scan(prefix, prefix, ::rocprim::plus<unsigned int>());
        storage_.warp_prefixes[flat_id] = prefix;
    }
#ifdef __HIP_CPU_RT__
    else
    {
        // HIP-CPU doesn't implement lockstep behavior. Need to invoke the same number sync ops in divergent branch.
        empty_type empty;
        ::rocprim::detail::warp_scan_crosslane<empty_type, detail::next_power_of_two(warps_no)>().inclusive_scan(empty, empty, empty_binary_op{});
    }
#endif
    ::rocprim::syncthreads();

We're executing warp-level algos in divergent control-flow. In HIP-CPU all warp-level intrinsics (__shfl(), etc.) also act as block-level sync instructions (their implementation issues __sycnthreads()), therefore they synchronize at a higher level than just a warp and some threads missing a sync-instruction is "bad, mkay"? HIP-CPU doesn't crash, but the kernel goes out-of-sync and starts returning garbage results.

This is the current status of tests using HIP-CPU: (green: pass, red: fail, grey: hang)

hip-cpu-tests

All block and warp-level algos need to be revised one after the other and be patched up. This MR does not address any of these issues, as finding out this limitation introduced a whole bunch of work we currently don't have the capacity to implement. However, the HIP-CPU team was approached by a customer who have code directly using rocPRIM (yaaaay!) and wish to use it with HIP-CPU. They would be willing to put in the extra effort and fixup the algos they use. (It's free work for us to leverage from the community.)

Carrying this set of patches in a fork is a lot of work, so I would like to get this upstreamed as soon as possible.

@MathiasMagnus
Copy link
Contributor

On the margin of not being the biggest fan of Dependencies.cmake: I understand the sentiment of wanting to provide a 'clone-build-run' experience to users, but in the case of having to support Linux/Windows with hipcc/Clang/GCC/MSVC and having to handle GTest, GBench, HIP-CPU, TBB, pthreads... some apt installed, some user provided... it starts to get messy and turns into maintaining a poor man's version of Vcpkg and/or Conan. The maintenance cost may be higher than the value provided.

When some users apt install their deps, others build them on their own (usually via Vcpkg or Conan) having to support all combinations becomes a lot. In CI we test going through Dependencies.cmake only, but developers who don't want to rebuild the world for every clean CMake configure. There's a lot of conflict. The DownloadProject.cmake we rely on isn't multi-config generator friendly, so people can't use Ninja Multi-config or Visual Studio build files at all, it will try to link debug deps to every build. (Debug and Release builds of these libs are link incompatible on Windows.) Once hipcc comes to Windows, the GBench compiler override currently in place will become a mini-project of it's own, overriding the compiler to something which may not even be on the PATH or installed at all (Clang + libc++).

The same spirit was followed when trying to auto-detect in HIP-CPU whether someone is compiling with GCC or Clang and using libstdc++ instead of libc++, because the implicit dependence on TBB is STL specific, not compiler specific. The symbol check we introduced in HIP-CPU for GCC 9 is already broken using GCC 10, hence here we moved to just checking for libstdc++ regardless of versions or internals and link to TBB and keep our fingers crossed.

Keeping these up to date is tedious, breaks often and (IMHO) doesn't provide much value to users/customers. Managing dependencies should be done by projects dedicated to doing just that. (Why GCC doesn't default to a command-line that compiles the STL itself is beyond me, but it's what we have to live with.) I'll maintain Dependencies.cmake for as long as requested, but I feel I have to mention that it may have outlived it's usefulness or scope, at least in its current form.

@AlexVlx AlexVlx requested a review from bensander June 4, 2021 11:48
@AlexVlx
Copy link

AlexVlx commented Jun 4, 2021

  • uninitialized __shared__ memory on GPUs are scratchpads and by default zero initialized (something we actively build on) whereas this isn't true for uninitialized CPU memory.

First, many thanks for doing this, I think it's extremely nice (then again, I'm biased so...). In what regards the above, that's not actually guaranteed, it just so happens that our current HW does that... maybe... sometimes. In practice, and per the CUDA spec that HIP tries to stay in tune with, the only guarantee you get for shared is that it's uninitialised, and thus it's UB to rely on it having any particular value that you did not explicitly write in there yourself. Thus, rocPRIM's reliance on this behaviour should be corrected, as it's a latent bug waiting to manifest strangely.

@MathiasMagnus
Copy link
Contributor

MathiasMagnus commented Jun 5, 2021

Thus, rocPRIM's reliance on this behaviour should be corrected, as it's a latent bug waiting to manifest strangely.

My statement was slightly strong, the compiler warns about potential use of uninitialized storage. Having taken a closer look, I couldn't 100% outrule that some elements of the array remain uncopied to. It may be a false positive, given how we declare the shared array, and invoke a copy on the next line which takes a non-const pointer to said storage as an argument, and this is what the compiler warns about. For the time being I added a memset operation until I can confidently rule out that no storage remains uncopied to.

The API surface is large and so are the internals. Once it became apparent that the changeset cannot be dragged until the port matures, I stopped updating all uses of shared and deferred investigation til algos are revisited on a case-by-case basis to make the unit tests, the compielrs (and us devs) happy.

@neon60
Copy link
Collaborator Author

neon60 commented Jun 13, 2021

Rebased the branch and changed the hip-cpu url.
I will create a separate issue for this internally:

  • Zero the uninitialized shared memory on GPUs when it's necessary and do not build on the default zero initializatiuon.

@stanleytsang-amd
Copy link
Collaborator

I noticed the gfx1030 job was failing. Two things I fixed (hopefully): fully adding the gfx1030 target to CMakeLists.txt, and also I noticed there was compilation failure with device_segmented_radix_sort, namely there was still a hardcoded reference to warp size 64U, so I switched it to device_warp_size(). Feel free to change my fix if it is not appropriate. Hopefully this gets CI passing and then I will merge this in.

@stanleytsang-amd stanleytsang-amd merged commit 7f2bba5 into develop Jun 19, 2021
@neon60
Copy link
Collaborator Author

neon60 commented Jun 20, 2021

I noticed the gfx1030 job was failing. Two things I fixed (hopefully): fully adding the gfx1030 target to CMakeLists.txt, and also I noticed there was compilation failure with device_segmented_radix_sort, namely there was still a hardcoded reference to warp size 64U, so I switched it to device_warp_size(). Feel free to change my fix if it is not appropriate. Hopefully this gets CI passing and then I will merge this in.

Thanks for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants