Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to retrigger issue 517 - DO NOT MERGE THIS PR #527

Closed

Conversation

casparvl
Copy link
Collaborator

@casparvl casparvl commented Apr 2, 2024

This PR is just to reproduce and investigate #517 . It should NOT be merged.

Copy link

eessi-bot bot commented Apr 2, 2024

Instance eessi-bot-mc-aws is configured to build:

  • arch x86_64/generic for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/generic for repo eessi-hpc.org-2023.06-software
  • arch x86_64/generic for repo eessi.io-2023.06-compat
  • arch x86_64/generic for repo eessi.io-2023.06-software
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-compat
  • arch x86_64/intel/haswell for repo eessi.io-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-compat
  • arch x86_64/intel/skylake_avx512 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen2 for repo eessi.io-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi-hpc.org-2023.06-software
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-compat
  • arch x86_64/amd/zen3 for repo eessi.io-2023.06-software
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/generic for repo eessi-hpc.org-2023.06-software
  • arch aarch64/generic for repo eessi.io-2023.06-compat
  • arch aarch64/generic for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_n1 for repo eessi.io-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi-hpc.org-2023.06-software
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-compat
  • arch aarch64/neoverse_v1 for repo eessi.io-2023.06-software

@casparvl casparvl marked this pull request as ready for review April 2, 2024 11:03
Caspar van Leeuwen added 2 commits April 2, 2024 13:10
@casparvl
Copy link
Collaborator Author

casparvl commented Apr 2, 2024

Ok, I tried to replicate this by running the code for the CI action from here onward. The test load_easybuild_module.sh script test worked fine, as expected. The test install_software_layer.sh script test actually is also running fine, but I'm pretty sure I know why it was failing: I'm seeing that as a first step, it now starts to install CUDA in the host_injections prefix, since CUDA isn't there yet. And CUDA requires a ton of disk space...

...
From directory: /home/casparvl/support/pr527/software-layer/scripts
To directory: /cvmfs/software.eessi.io/versions/2023.06/scripts
Files /home/casparvl/support/pr527/software-layer/scripts/utils.sh and /cvmfs/software.eessi.io/versions/2023.06/scripts/utils.sh differ
cp: error writing '/cvmfs/software.eessi.io/versions/2023.06/scripts/utils.sh': Function not implemented
File /home/casparvl/support/pr527/software-layer/scripts/utils.sh copied to /cvmfs/software.eessi.io/versions/2023.06/scripts/utils.sh
Copying files: install_cuda_host_injections.sh link_nvidia_host_libraries.sh
From directory: /home/casparvl/support/pr527/software-layer/scripts/gpu_support/nvidia
To directory: /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia
Files /home/casparvl/support/pr527/software-layer/scripts/gpu_support/nvidia/install_cuda_host_injections.sh and /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia/install_cuda_host
_injections.sh are identical. No copy needed.
Files /home/casparvl/support/pr527/software-layer/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh and /cvmfs/software.eessi.io/versions/2023.06/scripts/gpu_support/nvidia/link_nvidia_host_li
braries.sh are identical. No copy needed.
Attempting to load an EasyBuild module to do actual install
== Temporary log file in case of crash /tmp/eb-xn2wut5f/easybuild-g1eaze70.log
== found valid index for /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/EasyBuild/4.9.0/easybuild/easyconfigs, so using it...
== processing EasyBuild easyconfig /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/generic/software/EasyBuild/4.9.0/easybuild/easyconfigs/c/CUDA/CUDA-12.1.1.eb
== building and installing CUDA/12.1.1...
  >> installation prefix: /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/generic/software/CUDA/12.1.1
== fetching files...
  >> download succeeded: https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
  >> sources:
  >> /tmp/casparvl/easybuild/sources/c/CUDA/cuda_12.1.1_530.30.02_linux.run [SHA256: d74022d41d80105319dfa21beea39b77a5b9919539c0487a05caaf2446d6a70e]
== ... (took 31 secs)
== creating build dir, resetting environment...
  >> build dir: /tmp/tmp.vDDwmoxVD0/build/CUDA/12.1.1/system-system
== ... (took < 1 sec)
== unpacking...
  >> running command:
        [started at: 2024-04-02 11:49:05]
        [working dir: /home/casparvl/support/pr527/software-layer]
        [output logged in /tmp/eb-xn2wut5f/easybuild-run_cmd-1tti71s9.log]
        /bin/sh /tmp/casparvl/easybuild/sources/c/CUDA/cuda_12.1.1_530.30.02_linux.run --noexec --nox11 --target /tmp/tmp.vDDwmoxVD0/build/CUDA/12.1.1/system-system
  >> command completed: exit 0, ran in 00h01m20s
... etc

(in my interactive session, it completes fine, but I have plenty of disk space). I seem to remember @bedroge seeing errors regarding disk space. This is probably why. We should make sure we can skip the CUDA installation in host_injections somehow when running this stuff in a CI environment. Should be easy enough to implement something like that, I'll give it a try first in this PR to see if I can make the CI pass here. Once succesful, I'll make a new PR that only does changes to the required files (and leave this one as a 'proof' for the fix).

@casparvl
Copy link
Collaborator Author

casparvl commented Apr 2, 2024

Ok, that's clear, the fact that 877f364 failes and ba142ca succeeds proves the installation of the CUDA SDK was the issue, and being able to skip it is the solution. I'll implement that in a seperate PR, and close this one so that we can go back to the 'proof' of the issue if needed.

@casparvl casparvl closed this Apr 2, 2024
@casparvl casparvl deleted the retrigger_issue_517 branch August 15, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant