Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test if non CUDA builds are not added to accelorator path with jax #917

Open
wants to merge 27 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

laraPPr
Copy link
Collaborator

@laraPPr laraPPr commented Feb 13, 2025

5 out of 86 required modules missing:



* absl-py/2.1.0-GCCcore-12.3.0 (absl-py-2.1.0-GCCcore-12.3.0.eb)

* pytest/7.4.2-GCCcore-12.3.0 (pytest-7.4.2-GCCcore-12.3.0.eb)

* pytest-xdist/3.3.1-GCCcore-12.3.0 (pytest-xdist-3.3.1-GCCcore-12.3.0.eb)

* ml_dtypes/0.3.2-gfbf-2023a (ml_dtypes-0.3.2-gfbf-2023a.eb)

* jax/0.4.25-gfbf-2023a-CUDA-12.1.1 (jax-0.4.25-gfbf-2023a-CUDA-12.1.1.eb)

Copy link

eessi-bot bot commented Feb 13, 2025

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/sapphire_rapids, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

Copy link

eessi-bot bot commented Feb 13, 2025

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi.io-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-casparvl
Copy link

Instance eessi-bot-casparvl is configured to build for:

  • architectures: x86_64/amd/zen4, x86_64/amd/zen2
  • repositories: eessi.io-2023.06-software, eessi-hpc.org-2023.06-compat, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

Instance eessi-bot-vsc-ugent is configured to build for:

  • architectures: x86_64/amd/zen3
  • repositories: eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat

@eessi-bot-trz42
Copy link

Instance trz42-GH200-jr is configured to build for:

  • architectures: aarch64/nvidia/grace
  • repositories: eessi.io-2023.06-software

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 13, 2025

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80

Copy link

eessi-bot bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

Updates by the bot instance eessi-bot-casparvl (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.02/pr_917/15445297

date job status comment
Feb 13 13:25:54 UTC 2025 submitted job id 15445297 awaits release by job manager
Feb 13 13:26:58 UTC 2025 released job awaits launch by Slurm scheduler
Feb 13 13:29:02 UTC 2025 running job 15445297 is running
Feb 13 15:05:06 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-15445297.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1739456837.tar.gzsize: 6 MiB (6667595 bytes)
entries: 1191
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
absl-py/2.1.0-GCCcore-12.3.0.lua
ml_dtypes/0.3.2-gfbf-2023a.lua
pytest/7.4.2-GCCcore-12.3.0.lua
pytest-xdist/3.3.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
absl-py/2.1.0-GCCcore-12.3.0
ml_dtypes/0.3.2-gfbf-2023a
pytest/7.4.2-GCCcore-12.3.0
pytest-xdist/3.3.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
no other files in tarball
Feb 13 15:05:06 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/1) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_4_node /497af4b1 @BotBuildTests:x86_64_amd_zen3_nvidia_cc80+default
P: perf: 4447.069 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-15445297.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 13, 2025

@trz42 @ocaisa this looks like it is not doing what we expect it to do because it seems to be installing pytest-xdist in the accelerator path.

/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/Python/3.11.3-GCCcore-12.3.0/bin/python -m pip install --prefix=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software/pytest-xdist/3.3.1-GCCcore-12.3.0  --verbose  --no-deps  --ignore-installed  --no-index  --no-build-isolation  .

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 13, 2025

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from laraPPr

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from laraPPr

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-casparvl
Copy link

Updates by the bot instance eessi-bot-casparvl (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from laraPPr

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Feb 13, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_917/45927

date job status comment
Feb 13 13:43:54 UTC 2025 submitted job id 45927 awaits release by job manager
Feb 13 13:44:41 UTC 2025 released job awaits launch by Slurm scheduler
Feb 13 13:53:28 UTC 2025 running job 45927 is running
Feb 13 14:18:15 UTC 2025 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-45927.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1739455531.tar.gzsize: 6 MiB (6661749 bytes)
entries: 1191
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
absl-py/2.1.0-GCCcore-12.3.0.lua
ml_dtypes/0.3.2-gfbf-2023a.lua
pytest/7.4.2-GCCcore-12.3.0.lua
pytest-xdist/3.3.1-GCCcore-12.3.0.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
absl-py/2.1.0-GCCcore-12.3.0
ml_dtypes/0.3.2-gfbf-2023a
pytest/7.4.2-GCCcore-12.3.0
pytest-xdist/3.3.1-GCCcore-12.3.0
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Feb 13 14:18:15 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-45927.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@laraPPr
Copy link
Collaborator Author

laraPPr commented Feb 13, 2025

It failed building jax with this error:

FAILED: Installation ended unsuccessfully (build directory: /tmp/vsc48506/easybuild/build/jax/0.4.25/gfbf-2023a-CUDA-12.1.1): build failed (first 300 chars): Failed to determine installation prefix for binutils (took 39 mins 48 secs

and as you can see in the artifacts the non enabled cuda builds were build in the accelerator path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant