Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failure: rocmPackages_5.rocblas #302412

Closed
ony opened this issue Apr 7, 2024 · 7 comments
Closed

Build failure: rocmPackages_5.rocblas #302412

ony opened this issue Apr 7, 2024 · 7 comments
Labels
0.kind: build failure A package fails to build

Comments

@ony
Copy link
Contributor

ony commented Apr 7, 2024

Steps To Reproduce

Steps to reproduce the behavior:

  1. Try override default setting tensileLazyLib and tensileSepArch both to false to follow work-around for RDNA 1 like in Debian
  2. Observe failed attempt to copy into /build/source/build/Tensile/library/Kernels.so-000-gfx1010.hsaco

Build log

  File "/build/source/tensile/lib/python3.11/site-packages/Tensile/Parallel.py", line 53, in pcallWithGlobalParamsMultiArg
    return f(*args)
           ^^^^^^^^
  File "/build/source/tensile/lib/python3.11/site-packages/Tensile/TensileCreateLibrary.py", line 322, in buildSourceCodeObjectFile
    shutil.copyfile(src, dst)
  File "/nix/store/yvhwsfbh4bc99vfvwpaa70m4yng4pvpz-python3-3.11.8/lib/python3.11/shutil.py", line 258, in copyfile
    with open(dst, 'wb') as fdst:
         ^^^^^^^^^^^^^^^
PermissionError: [Errno 13] Permission denied: '/build/source/build/Tensile/library/Kernels.so-000-gfx1010.hsaco'

Additional context

This likely caused by symlinking to Nix store in

for path in ${gfx80} ${gfx90} ${gfx94} ${gfx10} ${gfx11} ${fallbacks}; do
ln -s $path/lib/rocblas/library/* build/Tensile/library
done

This might also make overriding gpuTargets a bit less effective as it being re-overridden again in those intermediate packages like rocblas-tensile-gfx90 regardless of what original rocblas package have.

Notify maintainers

From teams.rocm.members:

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

zsh%  nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.6.22, NixOS, 23.11 (Tapir)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.1`
 - channels(root): `"nixos-21.11.334684.1158f346391"`
 - channels(mykola): `"nixpkgs-unstable-21.11pre310022.14b0f20fa1f"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Add a 👍 reaction to issues you find important.

@ony ony added the 0.kind: build failure A package fails to build label Apr 7, 2024
@ony
Copy link
Contributor Author

ony commented Apr 8, 2024

As a work-around:

  rocmPackages_5 = super.rocmPackages_5 // {
    rocblas = (super.rocmPackages_5.rocblas.override {
      # Work-around for https://github.com/ROCm/Tensile/issues/1757
      # https://www.reddit.com/r/ROCm/comments/1bd8vde/psa_rdna1_gfx1010gfx101_gpus_should_start_working/
      # for ROCm 5.2+ till 6.1 released
      tensileLazyLib = false;
      tensileSepArch = false; # https://github.com/ROCm/rocBLAS/issues/1339#issuecomment-1682846493
      gpuTargets = ["gfx1010"];
    }).overrideDerivation (oldAttrs: {
      # work-around for https://github.com/NixOS/nixpkgs/issues/302412
      postPatch = ''
        ${oldAttrs.postPatch}
        rm -v /build/source/build/Tensile/library/Kernels.so-000-gfx1010.hsaco
      '';
    });
  };

But it took ~5h for me to re-build all downstream dependencies because I used this in overlay.

So maybe it also worth to re-consider default values until ROCm 6.1 released as mentioned in https://www.reddit.com/r/ROCm/comments/1bd8vde/psa_rdna1_gfx1010gfx101_gpus_should_start_working/

Note to distribution maintainers: just porting that single fix is not enough because it depends on a previous bug fix. It's recommended for now to continue building rocBLAS with -DTensile_LAZY_LIBRARY_LOADING=OFF until a release containing both patches comes out.

Or maybe there is a way to split package into build and run-time dependencies.

@mschwaig
Copy link
Member

mschwaig commented Apr 8, 2024

We have an open PR which I think should address this for ROCm 6.0 :
#298388

Could you also use ROCm 6.0 if we had that fix?
If not it would be helpful if you could explain a bit more about your use case, and why you are stuck on 5.7.

EDIT: thanks for opening this issue and sharing your workaround

@ony
Copy link
Contributor Author

ony commented Apr 8, 2024

Yes! That PR looks exactly what I need. Fixes both issue with options not been usable by dropping those GPU buckets for Tensile, and also set switches lazy-loading off by default.
I should have spent more effort looking through both issues and PRs.

Could you also use ROCm 6.0 if we had that fix?

I could, once it will be in stable release. Right now I'm on NixOS 23.11 which still have only 5.x. But 1-2 months we are going to have 23.05, as I understand.
Once PR is merged in unstable I might be able to pull it in parallel with stable nixpkgs. Likely it will be better to trade disk space for duplicated packages instead of getting my home build taking ~5h 😄

P.S. Thank you for maintaining these packages.

@ony
Copy link
Contributor Author

ony commented Jun 16, 2024

No luck after upgrade to 24.05 😢. Without HSA_OVERRIDE_GFX_VERSION=10.1.0 getting not supported and with it set:

rocBLAS error: Cannot read /nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010
 List of available TensileLibrary Files : 
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"
"/nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat"
time=2024-06-16T09:31:08.050+02:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server not responding"
time=2024-06-16T09:31:09.203+02:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error:Cannot read /nix/store/y74cnhncj1zbnsg04dmi87rk4q0ybm6n-rocm-path/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010"

(and it is ROCm 6.0.2 judging by symlinks like /nix/store/r7mpx3da0rk1l0n6a2nfv2bh8f7b0m0a-hipblas-6.0.2/hipblas leading out of rocm-path)

And that's despite the fact that currently rocmPackages.rocblas have those flags both in false by default:

# https://github.com/ROCm/Tensile/issues/1757
# Allows gfx101* users to use rocBLAS normally.
# Turn the below two values to `true` after the fix has been cherry-picked
# into a release. Just backporting that single fix is not enough because it
# depends on some previous commits.
, tensileSepArch ? false
, tensileLazyLib ? false

I also double-checked that those are false by adding overlay and observing no re-build:

  rocmPackages_6 = prev.rocmPackages_6 // {
    rocblas = (prev.rocmPackages_6.rocblas.override {
      # Work-around for https://github.com/ROCm/Tensile/issues/1757
      # https://www.reddit.com/r/ROCm/comments/1bd8vde/psa_rdna1_gfx1010gfx101_gpus_should_start_working/
      # for ROCm 5.2+ till 6.1 released
      tensileLazyLib = false;
      tensileSepArch = false; # https://github.com/ROCm/rocBLAS/issues/1339#issuecomment-1682846493
      # gpuTargets = ["gfx1010"];
    });
  };

It looks like as if with ROCm 6 those flags are not effective.

P.S. I think that rocmPackages.rocblas not in cache as I had to build it from sources.
P.P.S. And ollama 0.1.38 now require ROCm v6 😢 . So can't just override to use v5.7 without downgrading ollama.

@caffineehacker
Copy link
Contributor

caffineehacker commented Jun 18, 2024

Rocblas is in the cache (https://hydra.nixos.org/search?query=rocblas), but Ollama was doing an override so it resulted in a cache miss. Does it work if you use nixpkgs after this commit: fbb5b1b

I can confirm on my system I was getting a cache miss, but now it is able to use rocblas from Hydra after syncing past that point for nixpkgs.

Edit: Another note is that overriding rocblas the way you are doesn't entirely work. It will override a direct reference, but I find that other rocmPackages (e.g. rocsolver and hipblas) still refer to a non-overridden version of rocblas. I'm still trying to figure out the proper way to do that override.

@ony
Copy link
Contributor Author

ony commented Jun 19, 2024

Nice finding! I tried to pick it from master and it works for me. Thank you.

Would be nice to backport it to nixos-24.05.
(making an override seems to be complicated as it is in let .. in section and it might be easier to override .override to do nothing)

P.S. Next override to replicate that change works too:

  ollama = pkgs.ollama.override {
    rocmPackages = pkgs.rocmPackages // {
      # Ignore overrides for rocblas
      rocblas = pkgs.rocmPackages.rocblas // { override = (attrs: pkgs.rocmPackages.rocblas); };
    };
  };

@ony
Copy link
Contributor Author

ony commented Sep 1, 2024

As of 2024-09-01 it is back-ported to 24.05. Closing as original issue no longer exists with ROCm 6

@ony ony closed this as completed Sep 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: build failure A package fails to build
Projects
None yet
Development

No branches or pull requests

3 participants