-
-
Notifications
You must be signed in to change notification settings - Fork 14.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocmPackages.composable_kernel: compress output #299589
Conversation
Result of 1 package marked as broken and skipped:
2 packages built:
|
I'm going to invoke @wegank and @SuperSandro2000 to give their view of this |
Do we even need the static library? Is it stripped? |
It is intended to be consumed as a static library. Building it as a shared library needs patching. I've tried that in the past, and every time MIOpen would fail to find composable_kernel properly if I do that. I will try this again and see if the situation now improves. |
An option would be to cherry-pick llvm/llvm-project@7e28234 to ROCm's LLVM. Or maybe backport ROCm/composable_kernel#1044 and ROCm/MIOpen#2526. I've tried this a few months ago (GZGavinZhao/composable_kernel@c8d9ac8) and I don't recall getting too much space improvement, but I could've mis-remembered since my original intention wasn't to save space. |
Yes, we need it. It is a fat binary file with a a lot of of GPU-specific code for each target GPU, which is used by other parts of ROCm. I checked the log and yes, it is stripped:
I would prefer if we first found an acceptable solution to get caching to work for users of ROCm, because especially Another alternative to this PR would probably be setting a long enough In case the bandwidth and storage costs are actually proportional to compressed size I think it would make sense to tie Hydra's output size limit to compressed size to avoid these kinds of workarounds as much as possible. |
Thanks for linking this, I was aware of the
To me it looks like this just splits up that huge file in a few smaller ones that add up to roughly the same total size again, so it does not help us with the self-imposed limit we have here. Going by that first PR it also looks like it would need corresponding changes in other places. I cannot invest the time required to try this. |
We could also try dropping the gfx940 and gfx941 targets as suggested here #298388 (comment) to see if that gets us below the size limit. |
Here is a bit of an overview of what's in that file:
|
The workaround here looks good to me. Thanks for putting so much time and effort into this. |
Turns out this would not even work because |
I tried this as well now and it does not work either. It only gets us down to 3.1 GB. |
I have backported this commit here mschwaig/llvm-project@527b6ea and it looks like that will work. The size of What I'm still having trouble with:
You can find what I have here: mschwaig@fe795d4 To conserve reviewer's time I am moving this back into the draft stage. |
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: https://discourse.nixos.org/t/trouble-patching-llvm-source-code-for-rocm/42772/1 |
f199cf1
to
b9f41cf
Compare
7fd6313
to
f7b817c
Compare
This patches the clang-offload-bundler tool to add a compression option from a more recent version of clang. This compression option reduces the size of ROCm's fat binaries. Those binaries contain .hip_fatbin sections with GPU-specific code, for each target. Compression is automatically turned on for all produced outputs via a wrapper, because it's difficult to identify all the places where the -compression argument would be needed. Once upsteam introduces handeling for this argument, we should drop the wrapper again. This transistion will create inconsistsency, but I do not think that it will impact any actual users and it's what's practical to implement.
f7b817c
to
772dbad
Compare
I have a working version with the Result of 11 packages marked as broken and skipped:
88 packages built:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good now. Landing this would be huge since it takes quite some time to build composable_kernel
As the author of this PR I think its ready too. 👍 I ran (This PR is also a prerequisite for merging #298388 as is, because that PR simplifies the |
This comment is interesting, maybe we can drop those versions that are mentioned? This can be done in a separate PR however so that we can merge this without anything that could be considered controversial sneaking in. |
I agree that dropping those ISAs is worth looking into. |
Unfortunately I made a mistake when testing this PR and this segfaults when properly tested with llama.cpp:
|
@GZGavinZhao do you happen to have a working version of those the patch for llvm/clang? |
Should we roll back this @mschwaig ? Or does someone have time to fix it before 24.05? |
I couldn't get LLVM to build with my patches. IIRC it would fail during the test phase, which was sort of expected since I removed the test file changes in the patches to prevent conflicts, but I didn't have time to dig deep into it. From the stack trace, it seems like the problem is not with LLVM's side, but that the HIP clr runtime and/or ROCm-CompilerSupport doesn't support compressed device binaries yet. I was able to reproduce this segmentation fault just with the
This is weird because according to llvm/llvm-project#67162, device binary compression should be off by default, but it seems like |
Thanks for the minimal test case and all of the additional info. Compression always being on is due to the wrapper I added here:
|
I don't quite understand the timelines involved, as far as I know this would have to be fixed until tomorrow, to get into the release. Until tomorrow (or within the next week even) I don't think this change can be salvaged. We could
I would assume that getting around the Hydra size limit with either solution is something we could backport into the release later on as well (maybe after we have something working on unstable for some time)? |
I personally suggest a revert as well. I just dug into clr and comgr code to check and found that even AMD themselves currently have no support for compressed device binaries. This means that we would essentially have to implement support for compressed device binaries in comgr by ourselves. It shouldn't be hard, but I don't think it's something doable in one or two days. A week is probably enough, but if the deadline for 24.05 is tomorrow, then I suggest we should just revert the PR. |
The deadline is certainly not tomorrow, as it only applies to breaking changes to Release Critical Packages. EDIT: reverted in #304672. |
Not sure how ROCm works, but if we're not getting ROCm in Hydra in 24.05 then would it be possible instead to make an override to build a single target? (if that's even possible) |
Thanks to a very new PR (4 days ago) llvm/llvm-project#88827, I managed to make compressed device binaries work for |
@GZGavinZhao please ping me for testing etc |
Description of changes
I made a mistake in my ROCm 6.0 PR #287846, not realizing that
rocmPackages.composable_kernel
would be larger than Hydra's size limit. See the comments here for details: #287846 (comment)This PR compresses the hugelib/libdevice_operations.a
file that's in the output of that derivation, and then decompresses it again in another derivation based onrunCommandLocal
.Is this kind of fix acceptable?Is how I implemented this configurable enough and the derivations carry the correct metadata?How else could this issue be addressed?Not having that derivation cached is a huge pain, since the build takes a few hours at least.I am runningnixpkgs-review
on this right now.Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.