Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider ROCm releases with upstream llvm compatibility #1904

Open
littlewu2508 opened this issue Feb 10, 2023 · 9 comments
Open

Consider ROCm releases with upstream llvm compatibility #1904

littlewu2508 opened this issue Feb 10, 2023 · 9 comments
Assignees

Comments

@littlewu2508
Copy link

One of major issues for ROCm distribution packaging is the compatibility with upstream llvm due to the asynchronous development cycle.

For example, feature A pushed to ROCm's llvm, and then pushed to upstream llvm. If unfortunate, A needs to wait about half a year to enter next llvm stable release; while for ROCm the next release will come sooner, shipping A and components depending on A.

What's more, A is implemented upon llvm version X, but when upstreamed it enters the X+1 release. However the ROCm new release Y ships A with llvm X (ROCm forked). Distribution: cannot ship llvm-X & ROCm-Y becuase ROCm-Y depend on A. llvm-X+1 is also incompatible with ROCm-Y because it is based on llvm-X.

Situation nowadays are much better, only few patches can do the hack. But it still makes maintaining ROCm distribution packages difficult. It would be better if ROCm latest source releases can be compatible with the upstream llvm release.

@emollier @cgmb @Mystro256 @Madouura @tpkessler Distribution packagers can also reference this issue when reporting and fixing such incompatibility, and sharing information to reduce workload.

@Mystro256
Copy link

I was speaking with some of the compiler guys and I believe the intention is to move towards using stable LLVM or at least allow for compilation against stable LLVM. Unfortunately, there's no timeline and it seems like a bit of an involved task due to how their code flows processes work.

For now, I've been just maintaining my own branch of compiler support:
https://github.com/Mystro256/ROCm-CompilerSupport/tree/rocm-4.5.2-llvm13
https://github.com/Mystro256/ROCm-CompilerSupport/tree/rocm-roc-5.0.x-llvm13
https://github.com/Mystro256/ROCm-CompilerSupport/tree/roc-5.1.x-llvm14
https://github.com/Mystro256/ROCm-CompilerSupport/tree/rocm-5.2.x-llvm14
https://github.com/Mystro256/ROCm-CompilerSupport/tree/rocm-5.2.x-llvm15
(I believe ROCm 5.3 and 5.4 just worked as-is with LLVM 15 and 16 respectively)

It's mostly reverts or cherry-picks to get it to work. It's not perfect, but it gets the Fedora packages moving.

Some downstream ROCm components might expect a newer LLVM, but this component is definitely the most sensitive. I suspect when I eventually get around to packaging HIP in Fedora, the package set will be more sensitive to the LLVM version.

@littlewu2508
Copy link
Author

I was speaking with some of the compiler guys and I believe the intention is to move towards using stable LLVM or at least allow for compilation against stable LLVM. Unfortunately, there's no timeline and it seems like a bit of an involved task due to how their code flows processes work.

Hope stable releases of downstream projects could be synced with LLVM. It would be better if they got upstream (especially rocm-device-libs)

Some downstream ROCm components might expect a newer LLVM, but this component is definitely the most sensitive. I suspect when I eventually get around to packaging HIP in Fedora, the package set will be more sensitive to the LLVM version.

From my experience rocm-device-libs and rocm-compilersupport has the closest relationship with LLVM. The first one provides bitcodes, and the second currently contains comgr, in my sense a library that calls clang driver (using C++ API) to compile and inspect GPU code objects. HIP is a runtime API + hipcc the perl wrapper (for AOT) + hiprtc (the wrapper for comgr? For JIT), and it's relationship with llvm/clang is somehow weaker compared to the previous two.

@littlewu2508
Copy link
Author

llvm/llvm-project#60313 & https://github.com/RadeonOpenCompute/ROCm-CompilerSupport/issues/45

4 of comgr test case use a feature to call amdgpu kernel within a amdgpu kernel, but the feature is not well implemented and not upstreamed. Last time a check fedora has patched out these 4 tests.

In Gentoo I patched the test to avoid calling other kernels inside a kernel, so the compilations are OK with upstream clang. This strips out the test failures, and keeps the purpose of testing basic compilation capabilities of comgr.

@littlewu2508
Copy link
Author

And in developing branch a potential breakage is found between llvm/clang and rocm-device-libs:

https://reviews.llvm.org/D142507 & ROCm/ROCm-Device-Libs@8dc779e

@Mystro256
Copy link

@searlmc1 as FYI

4 of comgr test case use a feature to call amdgpu kernel within a amdgpu kernel, but the feature is not well implemented and not upstreamed. Last time a check fedora has patched out these 4 tests.

Yes that was me, and I just adopted your patch. Thank you!
See:
https://src.fedoraproject.org/rpms/rocm-compilersupport/c/cd84e9a05e34cedd76a1e9e8eeb586cf7bf50791?branch=rawhide

And in developing branch a potential breakage is found between llvm/clang and rocm-device-libs:

Thanks for pointing this out. I guess it means we should keep device libs at 5.4 for LLVM 16? I'm a bit uncomfortable with this response:

[from https://reviews.llvm.org/D142507] In general the idea is that compiler and device-libs should match. I guess the correct answer then users of clang-16 shall use rocm-5.4.x branch of the device libs?

@littlewu2508
Copy link
Author

littlewu2508 commented Feb 28, 2023

See: https://src.fedoraproject.org/rpms/rocm-compilersupport/c/cd84e9a05e34cedd76a1e9e8eeb586cf7bf50791?branch=rawhide

And in amd-stg-open branch they added mangled_names test, which needs to be adjusted after the workaround for amdgpu kernel call:
https://github.com/littlewu2508/gentoo/blob/rocm-9999/dev-libs/rocm-comgr/files/rocm-comgr-9999-fix-tests.patch

Thanks for pointing this out. I guess it means we should keep device libs at 5.4 for LLVM 16? I'm a bit uncomfortable with this response:

Yes. While there's uncertainties. If they (ROCm) may revert ROCm/ROCm-Device-Libs@8dc779e on ROCm-5.5.x releases then things will be fine. Or we have to revert it ourselves for llvm-16 compatibility.

@Mystro256
Copy link

Mystro256 commented Feb 28, 2023

Ok after some thought, for the meantime I think I'm going to try to maintain some branches for LLVM versions:

https://github.com/Mystro256/ROCm-Device-Libs/tree/release/16.x
https://github.com/Mystro256/ROCm-CompilerSupport/tree/release/16.x

Feel free to contribute if you are interested. I'm going to try evaluating it when Fedora bumps to llvm 16.
I'm going to skim through the patches as they come in release branches and pull in the critical fixes as I go. For now it's just a snapshot of amd-stg-open.

@aaronmondal
Copy link

Haven't cut a release yet, but for the HIP portion of ROCm I now have Bazel build files in rules_ll.

This project makes it possible to write a target like:

ll_binary(name="hip_example", srcs=["hello.hip"], compilation_mode="hip_amdgpu")

Which can be built and run with:

bazel run hip_example

rules_ll will fetch LLVM and ROCm from close to upstream, build a C++ toolchain from the fetched LLVM, use that toolchain to build the ROCm stack and then construct a hip_amdgpu toolchain that builds the hip_example target. (and then Bazel runs it). This all happens in wrapped Nix environments, making this build fully hermetic and reproducible down to the c library (modulo bugs).

With a single command invocation this all happens automaticlly 🥳

This is of course highly experimental. at the moment it only works with out-of-tree rules_ll, and not with the regular installation instructions. I'll create a release in the next few days and then create an issue that explains this in more detail.

@ppanchad-amd
Copy link

@littlewu2508 Is this ticket still relevant? Thanks!

@lamb-j lamb-j self-assigned this May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants