Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780

Open
wants to merge 3 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

Neves-P
Copy link
Member

@Neves-P Neves-P commented Oct 9, 2024

This PR adds the same version of waLBerla as installed previously, but with the updated foss2023a toolchain with CUDA support.

Copy link

eessi-bot bot commented Oct 9, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Oct 9, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

@Neves-P Neves-P added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Oct 9, 2024
@Neves-P
Copy link
Member Author

Neves-P commented Oct 9, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Oct 9, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from Neves-P

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Oct 9, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from Neves-P

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Oct 9, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.10/pr_780/22293

date job status comment
Oct 09 11:35:25 UTC 2024 submitted job id 22293 awaits release by job manager
Oct 09 11:36:06 UTC 2024 released job awaits launch by Slurm scheduler
Oct 09 11:44:08 UTC 2024 running job 22293 is running
Oct 09 12:22:59 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-22293.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1728475687.tar.gzsize: 37 MiB (39133902 bytes)
entries: 8478
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
waLBerla/6.1-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/.lmod/lmodrc.lua
2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua
Oct 09 12:22:59 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case
Details
✅ job output file slurm-22293.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@Neves-P
Copy link
Member Author

Neves-P commented Oct 9, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80

Copy link

eessi-bot bot commented Oct 9, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from Neves-P

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Oct 9, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from Neves-P

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Oct 9, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.10/pr_780/22295

date job status comment
Oct 09 12:33:20 UTC 2024 submitted job id 22295 awaits release by job manager
Oct 09 12:34:14 UTC 2024 released job awaits launch by Slurm scheduler
Oct 09 12:35:16 UTC 2024 running job 22295 is running
Oct 09 13:05:06 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-22295.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1728478320.tar.gzsize: 37 MiB (39136783 bytes)
entries: 8481
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
waLBerla/6.1-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
2023.06/init/bash
2023.06/init/eessi_archdetect.sh
2023.06/init/eessi_environment_variables
2023.06/software/linux/x86_64/amd/zen3/.lmod/lmodrc.lua
2023.06/software/linux/x86_64/amd/zen3/.lmod/SitePackage.lua
Oct 09 13:05:06 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 10/10 test case
Details
✅ job output file slurm-22295.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@boegel boegel changed the title {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 Oct 11, 2024
@Neves-P Neves-P requested a review from boegel October 11, 2024 09:25
@boegel
Copy link
Contributor

boegel commented Oct 11, 2024

@Neves-P easyconfig PR is merged, do we need to check/verify anything here, or is it ready to deploy?

@Neves-P
Copy link
Member Author

Neves-P commented Oct 14, 2024

@boegel , I think this is good to go. Looking into the tarball I see a directory 2023.06\software\linux\x86_64\amd\zen3\accel\nvidia\cc80\software\waLBerla\6.1-foss-2023a-CUDA-12.1.1\build\src\cuda\.

This contains a static library libcuda.a (along with other things, mostly .cmake files and similar). This directory is not present on the non-CUDA waLBerla already in software.eessi.io. I take this to mean that the installation with CUDA did work, but I'm not sure how to confirm this in practice...

@laraPPr
Copy link
Collaborator

laraPPr commented Feb 11, 2025

bot: help

Copy link

eessi-bot bot commented Feb 11, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Feb 11, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 11, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command help from laraPPr

    • expanded format: help
  • handling command help resulted in:
    How to send commands to bot instances

    • Commands must be sent with a new comment (edits of existing comments are ignored).
    • A comment may contain multiple commands, one per line.
    • Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
    • Currently supported COMMANDs are: help, build, show_config, status

    For more information, see https://www.eessi.io/docs/bot

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-casparvl
Copy link

Updates by the bot instance eessi-bot-casparvl (click for details)
  • account laraPPr has NO permission to send commands to the bot

@laraPPr
Copy link
Collaborator

laraPPr commented Feb 11, 2025

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80

Copy link

eessi-bot bot commented Feb 11, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Feb 11, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account laraPPr has NO permission to send commands to the bot

@eessi-bot-trz42
Copy link

Updates by the bot instance trz42-GH200-jr (click for details)
  • account laraPPr has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr

    • expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
  • handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 13, 2025

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.02/pr_780/15445267

date job status comment
Feb 13 11:37:32 UTC 2025 submitted job id 15445267 awaits release by job manager
Feb 13 11:37:37 UTC 2025 released job awaits launch by Slurm scheduler
Feb 13 11:39:41 UTC 2025 running job 15445267 is running
Feb 13 12:34:52 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-15445267.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen3-1739449084.tar.gzsize: 37 MiB (39156492 bytes)
entries: 8476
modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all
waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software
waLBerla/6.1-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80
no other files in tarball
Feb 13 12:34:52 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/1) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_4_node /497af4b1 @BotBuildTests:x86_64_amd_zen3_nvidia_cc80+default
P: perf: 4373.624 timesteps/s (r:0, l:None, u:None)
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-15445267.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@laraPPr
Copy link
Collaborator

laraPPr commented Feb 13, 2025

Ah what I thought was an infinite loop is it checking for 30 min if the job reframe job is already finished; The output and error file is also completely empty so its seems like nothing is happening until the Timeout. An we did not see this until testing in this pr.

15444897 -> node3905.accelgor.os FAILED 15444939 -> host node3901.accelgor.os FAILED 15443502 -> node3908.accelgor.os PASSED

@boegel it seems that some nodes might be having some problems?
For 15445267 on node node3904 it now passed.

@Neves-P I think this means the pr should be ready to deploy?

@Neves-P
Copy link
Member Author

Neves-P commented Feb 14, 2025

Ah what I thought was an infinite loop is it checking for 30 min if the job reframe job is already finished; The output and error file is also completely empty so its seems like nothing is happening until the Timeout. An we did not see this until testing in this pr.
15444897 -> node3905.accelgor.os FAILED 15444939 -> host node3901.accelgor.os FAILED 15443502 -> node3908.accelgor.os PASSED

@boegel it seems that some nodes might be having some problems? For 15445267 on node node3904 it now passed.

@Neves-P I think this means the pr should be ready to deploy?

@laraPPr , just to clarify, the tests worked fine in this run #780 (comment) and were failing due to some (I/O?) problem on the build cluster earlier, is that it?

@boegel
Copy link
Contributor

boegel commented Feb 15, 2025

Ah what I thought was an infinite loop is it checking for 30 min if the job reframe job is already finished; The output and error file is also completely empty so its seems like nothing is happening until the Timeout. An we did not see this until testing in this pr.
15444897 -> node3905.accelgor.os FAILED 15444939 -> host node3901.accelgor.os FAILED 15443502 -> node3908.accelgor.os PASSED

@boegel it seems that some nodes might be having some problems? For 15445267 on node node3904 it now passed.
@Neves-P I think this means the pr should be ready to deploy?

@laraPPr , just to clarify, the tests worked fine in this run #780 (comment) and were failing due to some (I/O?) problem on the build cluster earlier, is that it?

I think we've seen random hangs with waLBerla before in previous builds, no?

We did experience some intermittent trouble with internet access from our nodes, that may be related (if some stuff is being downloaded during the installation procedure).

@laraPPr
Copy link
Collaborator

laraPPr commented Feb 15, 2025

The walberla build always went fine. It was the Lammps ReFrame test that got stuck in limbo for 30 minutes. And than fails.

@boegel
Copy link
Contributor

boegel commented Feb 17, 2025

Since the build for zen2 was done in Oct'24, I'll re-trigger that one:

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

Copy link

eessi-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

Updates by the bot instance eessi-bot-surf (click for details)
  • account boegel has NO permission to send commands to the bot

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account boegel has NO permission to send commands to the bot

@riscv-eessi-io-bot
Copy link

riscv-eessi-io-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-riscv (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Feb 17, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_780/46530

date job status comment
Feb 17 09:01:34 UTC 2025 submitted job id 46530 awaits release by job manager
Feb 17 09:02:18 UTC 2025 released job awaits launch by Slurm scheduler
Feb 17 09:09:22 UTC 2025 running job 46530 is running
Feb 17 09:39:18 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-46530.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1739784568.tar.gzsize: 37 MiB (39142146 bytes)
entries: 8476
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
waLBerla/6.1-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
no other files in tarball
Feb 17 09:39:18 UTC 2025 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-46530.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@Neves-P
Copy link
Member Author

Neves-P commented Feb 17, 2025

The tests fail with:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'

This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?

@boegel
Copy link
Contributor

boegel commented Feb 17, 2025

The tests fail with:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'

This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?

@laraPPr What do you think about this?

@laraPPr
Copy link
Collaborator

laraPPr commented Feb 17, 2025

The tests fail with:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'

This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?

@laraPPr What do you think about this?

We should not hold up this pr on this. @casparvl or I could add a reframe config file for the gpu partition with a cpu config for reframe?

@casparvl
Copy link
Collaborator

Let me see if we can also deploy this for zen4+H100 immediately... (and I'll look at the testing config)

bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

Copy link

eessi-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account casparvl has NO permission to send commands to the bot

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

@eessi-bot-toprichard
Copy link

Updates by the bot instance rt-Grace-jr (click for details)
  • account casparvl has NO permission to send commands to the bot

@gpu-bot-ugent
Copy link

gpu-bot-ugent bot commented Feb 17, 2025

Updates by the bot instance eessi-bot-vsc-ugent (click for details)
  • received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl

    • expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
  • handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:

    • no jobs were submitted

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 17, 2025

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.02/pr_780/10028769

date job status comment
Feb 17 16:01:14 UTC 2025 submitted job id 10028769 will be eligible to start in about 20 seconds
Feb 17 16:01:37 UTC 2025 received job awaits launch by Slurm scheduler
Feb 17 16:02:07 UTC 2025 running job 10028769 is running
Feb 17 17:02:17 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-10028769.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen4-1739810722.tar.gzsize: 747 MiB (783957070 bytes)
entries: 8601
modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
LightGBM/4.5.0-foss-2023a-CUDA-12.1.1.lua
cuDNN/8.9.2.26-CUDA-12.1.1.lua
waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua
software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
LightGBM/4.5.0-foss-2023a-CUDA-12.1.1
cuDNN/8.9.2.26-CUDA-12.1.1
waLBerla/6.1-foss-2023a-CUDA-12.1.1
other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Feb 17 17:02:17 UTC 2025 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job10028769.test does not exist in job directory, or parsing it failed.

@casparvl
Copy link
Collaborator

casparvl commented Feb 17, 2025

The tests fail with:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'

This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?

@laraPPr What do you think about this?

We should not hold up this pr on this. @casparvl or I could add a reframe config file for the gpu partition with a cpu config for reframe?

Ok, with regards to this: yes, we can just add a gpu partition in the config. We could give it the CPU feature, but one could even argue there is no point in testing at all - as it wouldn't test this build anyway - so I guess we could also give it no features, and it would skip all tests. And by the way: since the conclusion is that we'd skip testing anyway, I agree, it shouldn't block this PR.

We should really have EESSI/eessi-bot-software-layer#294 . The reason we missed this is because there is no explicit reference in the bot config as to which GPUs these nodes can build for. We will have to add all allowed GPU configs in the ReFrame config.

Question: what's our current policy on GPU builds? Which architectures do we target, before we deploy something? Do we cross compile at all? Do we agree to build for at least one GPU arch natively when deploying anything?

Without wanting to hold back this PR, shouldn't we have these questions answered beforehand? Or do we have that somewhere, and I just don't know about it? :D

@casparvl
Copy link
Collaborator

Hm, I don't really have time right now to fix it. Feel free to ignore it for now.

@laraPPr If you want to take a stab at it: my idea would be to define a single partition dictionary, i.e. something like https://github.com/EESSI/bot-configs/blob/dad6d7434229ddbb6852a79c6fc19540d76b4b1e/mc-aws-rocky88-202310/reframe_config.py#L15 , change the features, and store that as gpu_cross_compile or something. Then, we can generate dicts from that for each of the GPU archs, only changing the name. I think that can be done with:

gpu_cross_compile = {<whatever_is_in_a_partition_dict}
gpu_arch_names = ['x86_64_amd_zen2_nvidia_cc80', 'x86_64_amd_zen2_nvidia_cc80', '... etc']
gpu_cross_compile_partitions = [{**gpu_cross_compile, "name": arch_name} for arch_name in gpu_arch_names]

# And now the actual reframe config...
site_configuration = {
   ...
   'systems': {
   ...
       'partitions': [
       ] + gpu_cross_compile_partitions

Haven't tested this, but this would make it pretty compact to list a lot of gpu_arch_names, which will be needed...

@casparvl
Copy link
Collaborator

casparvl commented Feb 17, 2025

For #780 (comment)

All went fine, it was killed due to time limit in the cleanup stage of the test suite... I guess I should increase the standard walltime for the bot :) Though note that ReFrame didn't run any tests, since none of the software to be tested (e.g. OSU with GPU support, LAMMPS with GPU support) has been build for this CPU+GPU architecture yet.

Since this also will deploy cuDNN, let me check that the installation is stripped correctly...

Edit: that seems to be the case. All headers and static libraries were stripped. Can someone (@trz42 or @ocaisa maybe?) confirm that those are indeed (all) the files we expect to be stripped for cuDNN?

Edit2: ah, yes, from the LICENSE file 2. Distribution. The following portions of the SDK are distributable under the Agreement: the runtime files .so and .dll. . Indeed, I only see the *.so files being included in the tarball.

@laraPPr laraPPr added the ready-to-deploy Mark a PR as ready to deploy label Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia ready-to-deploy Mark a PR as ready to deploy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants