Add Hip-Cpu support initial #233

neon60 · 2021-06-04T07:24:58Z

@MathiasMagnus :
I have implemented experimental HIP-CPU support for rocPRIM, which has the following properties currently:

Supports Linux and Windows
Supports GCC, Clang and MSVC host compilers
All tests and benchmarks build, simple tests also pass

The change set is massive, history is already cleaned up and mostly speaks for itself. Some noteworthy changes:

I have removed the DISABLE_WERROR option from the build script and moved it over to the CI script. If AMD wants is badly, it can be reinstantiated, but I would argue it's better this way.
- This was done in order to avoid code bloat for an facility with (arguably) minimal value which would double with the increase of supported compilers. (Handling clang, g++, cl.exe, clang-cl.exe gets hairy and is needless bloat.)
- Moreover it easily conflicts with user-provided values on the command-line and toolchain files. (Can easily result in unsilence-able cl : Command line warning D9025 : overriding '/W3' with '/W4' type messages.)
Needed to move to CUDA style indexing, as HIP's docs are deemed outdated in referring to CUDA indexing as something that's still supported. The symbols are simply missing from HIP-CPU
Introduced new macros for controlling (no)unroll pragmas, as they aren't uniformly recognized by all compilers.
Many ISO conformance tweaks around:
- RNG distributions being undefined for 8-bit integral types
- type alignment specification
- the overload control (metaprogram) used in the sorting tests for the comparator implementation was Clang-only (both GCC and MSVC choked on it)
- ambiguous call to cli.parse() methods with size_t args. (this specialization doesn't exist in our parser)
Some GPU/CPU related differences:
- uninitialized __shared__ memory on GPUs are scratchpads and by default zero initialized (something we actively build on) whereas this isn't true for uninitialized CPU memory.
- dpp primitives missing (front-end couldn't even parse the template bodies)
Some compiler differences
- many Clang built-ins missing or being called different in GCC/MSVC
- attributes not allowed on function definitions
- working around some SFINAE bugs in MSVC with auto-deduced return types
Dependency.cmake received a massive facelift, partly due to how annoyingly hard it is to auto-magically compile parallelSTL code with libstdc++.
- The PSTL implementation used in GCC 9-10 uses TBB as an implicit dependency (GCC won't add it to the linker flags, just like libm and libstdfs++ for the STL filesystem library) but a version which doesn't build using CMake which DownloadProject.cmake is mostly centered around. The version which builds with CMake is missing required types. The suitable version exposes a module which helps to build it, hence that's what I use in our scripts.
- If anyone asks, I want to burn Dependency.cmake to the ground. It's the source of so much aggravation.

The reason why most of the tests are passing is due to a limitation of HIP-CPU which causes some kernel codes to be updated when using this back-end: namely HIP-CPU lacks the lock-step execution of warps which device execution exhibits. This is a subtle difference, yet very important. When warp operations appear in divergent control-flow, HIP-CPU breaks. A typical patch would look like this:

    // Scan the warp reduction results to calculate warp prefixes
    if(flat_id < warps_no)
    {
        unsigned int prefix = storage_.warp_prefixes[flat_id];
        warp_scan_prefix_type().inclusive_scan(prefix, prefix, ::rocprim::plus<unsigned int>());
        storage_.warp_prefixes[flat_id] = prefix;
    }
#ifdef __HIP_CPU_RT__
    else
    {
        // HIP-CPU doesn't implement lockstep behavior. Need to invoke the same number sync ops in divergent branch.
        empty_type empty;
        ::rocprim::detail::warp_scan_crosslane<empty_type, detail::next_power_of_two(warps_no)>().inclusive_scan(empty, empty, empty_binary_op{});
    }
#endif
    ::rocprim::syncthreads();

We're executing warp-level algos in divergent control-flow. In HIP-CPU all warp-level intrinsics (__shfl(), etc.) also act as block-level sync instructions (their implementation issues __sycnthreads()), therefore they synchronize at a higher level than just a warp and some threads missing a sync-instruction is "bad, mkay"? HIP-CPU doesn't crash, but the kernel goes out-of-sync and starts returning garbage results.

This is the current status of tests using HIP-CPU: (green: pass, red: fail, grey: hang)

All block and warp-level algos need to be revised one after the other and be patched up. This MR does not address any of these issues, as finding out this limitation introduced a whole bunch of work we currently don't have the capacity to implement. However, the HIP-CPU team was approached by a customer who have code directly using rocPRIM (yaaaay!) and wish to use it with HIP-CPU. They would be willing to put in the extra effort and fixup the algos they use. (It's free work for us to leverage from the community.)

Carrying this set of patches in a fork is a lot of work, so I would like to get this upstreamed as soon as possible.

MathiasMagnus · 2021-06-04T10:37:17Z

On the margin of not being the biggest fan of Dependencies.cmake: I understand the sentiment of wanting to provide a 'clone-build-run' experience to users, but in the case of having to support Linux/Windows with hipcc/Clang/GCC/MSVC and having to handle GTest, GBench, HIP-CPU, TBB, pthreads... some apt installed, some user provided... it starts to get messy and turns into maintaining a poor man's version of Vcpkg and/or Conan. The maintenance cost may be higher than the value provided.

When some users apt install their deps, others build them on their own (usually via Vcpkg or Conan) having to support all combinations becomes a lot. In CI we test going through Dependencies.cmake only, but developers who don't want to rebuild the world for every clean CMake configure. There's a lot of conflict. The DownloadProject.cmake we rely on isn't multi-config generator friendly, so people can't use Ninja Multi-config or Visual Studio build files at all, it will try to link debug deps to every build. (Debug and Release builds of these libs are link incompatible on Windows.) Once hipcc comes to Windows, the GBench compiler override currently in place will become a mini-project of it's own, overriding the compiler to something which may not even be on the PATH or installed at all (Clang + libc++).

The same spirit was followed when trying to auto-detect in HIP-CPU whether someone is compiling with GCC or Clang and using libstdc++ instead of libc++, because the implicit dependence on TBB is STL specific, not compiler specific. The symbol check we introduced in HIP-CPU for GCC 9 is already broken using GCC 10, hence here we moved to just checking for libstdc++ regardless of versions or internals and link to TBB and keep our fingers crossed.

Keeping these up to date is tedious, breaks often and (IMHO) doesn't provide much value to users/customers. Managing dependencies should be done by projects dedicated to doing just that. (Why GCC doesn't default to a command-line that compiles the STL itself is beyond me, but it's what we have to live with.) I'll maintain Dependencies.cmake for as long as requested, but I feel I have to mention that it may have outlived it's usefulness or scope, at least in its current form.

AlexVlx · 2021-06-04T12:03:39Z

uninitialized __shared__ memory on GPUs are scratchpads and by default zero initialized (something we actively build on) whereas this isn't true for uninitialized CPU memory.

First, many thanks for doing this, I think it's extremely nice (then again, I'm biased so...). In what regards the above, that's not actually guaranteed, it just so happens that our current HW does that... maybe... sometimes. In practice, and per the CUDA spec that HIP tries to stay in tune with, the only guarantee you get for shared is that it's uninitialised, and thus it's UB to rely on it having any particular value that you did not explicitly write in there yourself. Thus, rocPRIM's reliance on this behaviour should be corrected, as it's a latent bug waiting to manifest strangely.

MathiasMagnus · 2021-06-05T15:15:31Z

Thus, rocPRIM's reliance on this behaviour should be corrected, as it's a latent bug waiting to manifest strangely.

My statement was slightly strong, the compiler warns about potential use of uninitialized storage. Having taken a closer look, I couldn't 100% outrule that some elements of the array remain uncopied to. It may be a false positive, given how we declare the shared array, and invoke a copy on the next line which takes a non-const pointer to said storage as an argument, and this is what the compiler warns about. For the time being I added a memset operation until I can confidently rule out that no storage remains uncopied to.

The API surface is large and so are the internals. Once it became apparent that the changeset cannot be dragged until the port matures, I stopped updating all uses of shared and deferred investigation til algos are revisited on a case-by-case basis to make the unit tests, the compielrs (and us devs) happy.

neon60 · 2021-06-13T12:50:30Z

Rebased the branch and changed the hip-cpu url.
I will create a separate issue for this internally:

Zero the uninitialized shared memory on GPUs when it's necessary and do not build on the default zero initializatiuon.

stanleytsang-amd · 2021-06-18T18:36:25Z

I noticed the gfx1030 job was failing. Two things I fixed (hopefully): fully adding the gfx1030 target to CMakeLists.txt, and also I noticed there was compilation failure with device_segmented_radix_sort, namely there was still a hardcoded reference to warp size 64U, so I switched it to device_warp_size(). Feel free to change my fix if it is not appropriate. Hopefully this gets CI passing and then I will merge this in.

neon60 · 2021-06-20T15:01:46Z

I noticed the gfx1030 job was failing. Two things I fixed (hopefully): fully adding the gfx1030 target to CMakeLists.txt, and also I noticed there was compilation failure with device_segmented_radix_sort, namely there was still a hardcoded reference to warp size 64U, so I switched it to device_warp_size(). Feel free to change my fix if it is not appropriate. Hopefully this gets CI passing and then I will merge this in.

Thanks for the fix.

neon60 requested review from stanleytsang-amd, saadrahim, doctorcolinsmith and AlexVlx June 4, 2021 07:25

AlexVlx requested a review from bensander June 4, 2021 11:48

MathiasMagnus mentioned this pull request Jun 10, 2021

Provide FindHIP.cmake, HIPConfig.cmake, hip-config.cmake to aid building other HIP Libraries ROCm/HIP-CPU#7

Open

neon60 and others added 21 commits June 12, 2021 20:56

Update benchmark config tuning code

2cd1dd9

Move options up front

1fcef19

Remove DISABLE_WERROR option

a24a928

Update summary

e86c152

Add to README as experimental back-end

c599a5c

IMPORTED target based dependency definition

d502907

[HIP-CPU] MSVC: error C1128 (/bigobj)

72f5077

[HIP-CPU] CUDA-style indexing and kernel launch

58d9078

[HIP-CPU] ambiguous call to overloaded function

d5b1b9c

cannot convert argument 1 from 'T **' to 'void **'

5299380

[HIP-CPU] unknown pragma 'unroll'

4cc7f4f

[HIP-CPU] half has explicit CTOR

3200581

[HIP-CPU] MSVC: 'name' hides previous declaration

40e8714

[HIP-CPU] MSVC: error C2975

03cf8dd

[HIP-CPU] MSVC: C2365 (redefinition of symbol)

7deb788

[HIP-CPU] rocprim::half_native for plats w/o one

0b9767b

[HIP-CPU] ISO C++ alignment specification

c81c744

TYPED_TEST_CASE is deprecated

4a38c84

Silence conversion warning

0fe9f89

[HIP-CPU] Note on optimization opportunity

1e69e2e

uniform_int_distribution is undefined for (u)int8

edee0e0

MathiasMagnus and others added 21 commits June 12, 2021 21:06

Temporarily reroute to HIP-CPU fork

a89f92f

error C3861: 'uint': identifier not found

2b795af

error C3861: 'min': identifier not found

5b06718

disambiguate function call

3cd966f

ISO C++ alignment specification

f3220d8

Prevent shared memory from slipping to size == 0

2564480

Bump GTest version dep

939d32a

Inherit build and library type

cc30136

Issue error when using multi-conf gen with dep mgt

619ca18

Downgrade TBB dependence to match libstc++

b79b9b1

[HIP-CPU] GCC: Circumvent load/store_volatile

7424a5e

[HIP-CPU] Remove static_assert type-check

65fe582

[HIP-CPU] MSVC: Guard against inline asm

1e87e42

[HIP-CPU] Clang/hipcc: use diff half impls

7660c18

Reinstantiate -Werror via CI script

779334e

Removed unused local type aliases

7731c2b

[HIP-CPU] Revert to original comparator with fix

82f9fba

Remove implicit cast (truncation)

7e0f0b7

[HIP-CPU] Limit CPU builds to fit runner RAM

e1bb2e3

Disable hip-cpu build step from gitlab CI

d7ea5cd

Handle size_t in cmdparser.hpp

9ea86f8

neon60 force-pushed the hip-cpu branch from 4b332fa to 9ea86f8 Compare June 13, 2021 12:38

Change hip-cpu url

ce912b1

Added experimental HIP-CPU support to changelog

30f6daa

stanleytsang-amd approved these changes Jun 14, 2021

View reviewed changes

Compilation fix for gfx1030

7658e89

stanleytsang-amd merged commit 7f2bba5 into develop Jun 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hip-Cpu support initial #233

Add Hip-Cpu support initial #233

neon60 commented Jun 4, 2021 •

edited

Loading

MathiasMagnus commented Jun 4, 2021

AlexVlx commented Jun 4, 2021

MathiasMagnus commented Jun 5, 2021 •

edited

Loading

neon60 commented Jun 13, 2021 •

edited

Loading

stanleytsang-amd commented Jun 18, 2021

neon60 commented Jun 20, 2021

Add Hip-Cpu support initial #233

Add Hip-Cpu support initial #233

Conversation

neon60 commented Jun 4, 2021 • edited Loading

MathiasMagnus commented Jun 4, 2021

AlexVlx commented Jun 4, 2021

MathiasMagnus commented Jun 5, 2021 • edited Loading

neon60 commented Jun 13, 2021 • edited Loading

stanleytsang-amd commented Jun 18, 2021

neon60 commented Jun 20, 2021

neon60 commented Jun 4, 2021 •

edited

Loading

MathiasMagnus commented Jun 5, 2021 •

edited

Loading

neon60 commented Jun 13, 2021 •

edited

Loading