{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780

Neves-P · 2024-10-09T11:22:35Z

This PR adds the same version of waLBerla as installed previously, but with the updated foss2023a toolchain with CUDA support.

eessi-bot · 2024-10-09T11:22:38Z

Instance eessi-bot-mc-aws is configured to build for:

architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

eessi-bot · 2024-10-09T11:22:39Z

Instance eessi-bot-mc-azure is configured to build for:

architectures: x86_64/amd/zen4
repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

Neves-P · 2024-10-09T11:35:18Z

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

eessi-bot · 2024-10-09T11:35:21Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from Neves-P
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:
- submitted job 22293, for details & status see {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780 (comment)

eessi-bot · 2024-10-09T11:35:21Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from Neves-P
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

eessi-bot · 2024-10-09T11:35:26Z

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.10/pr_780/22293

date	job status	comment
Oct 09 11:35:25 UTC 2024	submitted	job id `22293` awaits release by job manager
Oct 09 11:36:06 UTC 2024	released	job awaits launch by Slurm scheduler
Oct 09 11:44:08 UTC 2024	running	job `22293` is running
Oct 09 12:22:59 UTC 2024	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-22293.out` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1728475687.tar.gz` size: 37 MiB (39133902 bytes) entries: 8478 modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all `waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software `waLBerla/6.1-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80 `2023.06/software/linux/x86_64/amd/zen2/.lmod/lmodrc.lua` `2023.06/software/linux/x86_64/amd/zen2/.lmod/SitePackage.lua`
Oct 09 12:22:59 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case Details ✅ job output file `slurm-22293.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

Neves-P · 2024-10-09T12:33:13Z

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80

eessi-bot · 2024-10-09T12:33:16Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from Neves-P
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:
- submitted job 22295, for details & status see {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780 (comment)

eessi-bot · 2024-10-09T12:33:17Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 from Neves-P
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen3 accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

eessi-bot · 2024-10-09T12:33:21Z

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.10/pr_780/22295

date	job status	comment
Oct 09 12:33:20 UTC 2024	submitted	job id `22295` awaits release by job manager
Oct 09 12:34:14 UTC 2024	released	job awaits launch by Slurm scheduler
Oct 09 12:35:16 UTC 2024	running	job `22295` is running
Oct 09 13:05:06 UTC 2024	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-22295.out` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen3-1728478320.tar.gz` size: 37 MiB (39136783 bytes) entries: 8481 modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all `waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software `waLBerla/6.1-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80 `2023.06/init/bash` `2023.06/init/eessi_archdetect.sh` `2023.06/init/eessi_environment_variables` `2023.06/software/linux/x86_64/amd/zen3/.lmod/lmodrc.lua` `2023.06/software/linux/x86_64/amd/zen3/.lmod/SitePackage.lua`
Oct 09 13:05:06 UTC 2024	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ PASSED ] Ran 10/10 test case Details ✅ job output file `slurm-22295.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.4-2023a.yml

boegel · 2024-10-11T18:05:15Z

@Neves-P easyconfig PR is merged, do we need to check/verify anything here, or is it ready to deploy?

Neves-P · 2024-10-14T08:59:47Z

@boegel , I think this is good to go. Looking into the tarball I see a directory 2023.06\software\linux\x86_64\amd\zen3\accel\nvidia\cc80\software\waLBerla\6.1-foss-2023a-CUDA-12.1.1\build\src\cuda\.

This contains a static library libcuda.a (along with other things, mostly .cmake files and similar). This directory is not present on the non-CUDA waLBerla already in software.eessi.io. I take this to mean that the installation with CUDA did work, but I'm not sure how to confirm this in practice...

laraPPr · 2025-02-11T11:27:58Z

bot: help

eessi-bot · 2025-02-11T11:28:00Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

riscv-eessi-io-bot · 2025-02-11T11:28:00Z

Updates by the bot instance eessi-bot-riscv (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot · 2025-02-11T11:28:00Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

gpu-bot-ugent · 2025-02-11T11:28:00Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command help from laraPPr
- expanded format: help
handling command help resulted in:
How to send commands to bot instances
- Commands must be sent with a new comment (edits of existing comments are ignored).
- A comment may contain multiple commands, one per line.
- Every command begins at the start of a line and has the syntax bot: COMMAND [ARGUMENTS]*
- Currently supported COMMANDs are: help, build, show_config, status
For more information, see https://www.eessi.io/docs/bot

eessi-bot-trz42 · 2025-02-11T11:28:01Z

Updates by the bot instance trz42-GH200-jr (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-casparvl · 2025-02-11T11:28:01Z

Updates by the bot instance eessi-bot-casparvl (click for details)

account laraPPr has NO permission to send commands to the bot

laraPPr · 2025-02-11T11:30:34Z

bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80

eessi-bot · 2025-02-11T11:30:37Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr
- expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

eessi-bot · 2025-02-11T11:30:37Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr
- expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

riscv-eessi-io-bot · 2025-02-11T11:30:38Z

Updates by the bot instance eessi-bot-riscv (click for details)

account laraPPr has NO permission to send commands to the bot

eessi-bot-trz42 · 2025-02-13T11:37:27Z

Updates by the bot instance trz42-GH200-jr (click for details)

account laraPPr has NO permission to send commands to the bot

gpu-bot-ugent · 2025-02-13T11:37:28Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 from laraPPr
- expanded format: build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80
handling command build instance:eessi-bot-vsc-ugent repository:eessi.io-2023.06-software accelerator:nvidia/cc80 resulted in:
- submitted job 15445267, for details & status see {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780 (comment)

gpu-bot-ugent · 2025-02-13T11:37:33Z

New job on instance eessi-bot-vsc-ugent for CPU micro-architecture x86_64-amd-zen3 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /scratch/gent/vo/002/gvo00211/SHARED/jobs/2025.02/pr_780/15445267

date	job status	comment
Feb 13 11:37:32 UTC 2025	submitted	job id `15445267` awaits release by job manager
Feb 13 11:37:37 UTC 2025	released	job awaits launch by Slurm scheduler
Feb 13 11:39:41 UTC 2025	running	job `15445267` is running
Feb 13 12:34:52 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-15445267.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen3-1739449084.tar.gz` size: 37 MiB (39156492 bytes) entries: 8476 modules under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/modules/all `waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80/software `waLBerla/6.1-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen3/accel/nvidia/cc80 no other files in tarball
Feb 13 12:34:52 UTC 2025	test result	😁 SUCCESS (click triangle for details) ReFrame Summary [ OK ] (1/1) EESSI_LAMMPS_lj %device_type=gpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1 %scale=1_4_node /497af4b1 @BotBuildTests:x86_64_amd_zen3_nvidia_cc80+default P: perf: 4373.624 timesteps/s (r:0, l:None, u:None) [ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted) Details ✅ job output file `slurm-15445267.out` ✅ no message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

laraPPr · 2025-02-13T12:38:06Z

Ah what I thought was an infinite loop is it checking for 30 min if the job reframe job is already finished; The output and error file is also completely empty so its seems like nothing is happening until the Timeout. An we did not see this until testing in this pr.

15444897 -> node3905.accelgor.os FAILED 15444939 -> host node3901.accelgor.os FAILED 15443502 -> node3908.accelgor.os PASSED

@boegel it seems that some nodes might be having some problems?
For 15445267 on node node3904 it now passed.

@Neves-P I think this means the pr should be ready to deploy?

Neves-P · 2025-02-14T13:02:09Z

Ah what I thought was an infinite loop is it checking for 30 min if the job reframe job is already finished; The output and error file is also completely empty so its seems like nothing is happening until the Timeout. An we did not see this until testing in this pr.
15444897 -> node3905.accelgor.os FAILED 15444939 -> host node3901.accelgor.os FAILED 15443502 -> node3908.accelgor.os PASSED

@boegel it seems that some nodes might be having some problems? For 15445267 on node node3904 it now passed.

@Neves-P I think this means the pr should be ready to deploy?

@laraPPr , just to clarify, the tests worked fine in this run #780 (comment) and were failing due to some (I/O?) problem on the build cluster earlier, is that it?

boegel · 2025-02-15T10:06:10Z

Ah what I thought was an infinite loop is it checking for 30 min if the job reframe job is already finished; The output and error file is also completely empty so its seems like nothing is happening until the Timeout. An we did not see this until testing in this pr.
15444897 -> node3905.accelgor.os FAILED 15444939 -> host node3901.accelgor.os FAILED 15443502 -> node3908.accelgor.os PASSED

@boegel it seems that some nodes might be having some problems? For 15445267 on node node3904 it now passed.
@Neves-P I think this means the pr should be ready to deploy?

@laraPPr , just to clarify, the tests worked fine in this run #780 (comment) and were failing due to some (I/O?) problem on the build cluster earlier, is that it?

I think we've seen random hangs with waLBerla before in previous builds, no?

We did experience some intermittent trouble with internet access from our nodes, that may be related (if some stuff is being downloaded during the installation procedure).

laraPPr · 2025-02-15T10:34:11Z

The walberla build always went fine. It was the Lammps ReFrame test that got stuck in limbo for 30 minutes. And than fails.

boegel · 2025-02-17T09:01:25Z

Since the build for zen2 was done in Oct'24, I'll re-trigger that one:

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

eessi-bot · 2025-02-17T09:01:30Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:
- submitted job 46530, for details & status see {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780 (comment)

eessi-bot · 2025-02-17T09:01:30Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

gpu-bot-ugent · 2025-02-17T09:01:30Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

eessi-bot-surf · 2025-02-17T09:01:31Z

Updates by the bot instance eessi-bot-surf (click for details)

account boegel has NO permission to send commands to the bot

eessi-bot-toprichard · 2025-02-17T09:01:31Z

Updates by the bot instance rt-Grace-jr (click for details)

account boegel has NO permission to send commands to the bot

riscv-eessi-io-bot · 2025-02-17T09:01:31Z

Updates by the bot instance eessi-bot-riscv (click for details)

received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from boegel
- expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:
- no jobs were submitted

eessi-bot · 2025-02-17T09:01:35Z

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.02/pr_780/46530

date	job status	comment
Feb 17 09:01:34 UTC 2025	submitted	job id `46530` awaits release by job manager
Feb 17 09:02:18 UTC 2025	released	job awaits launch by Slurm scheduler
Feb 17 09:09:22 UTC 2025	running	job `46530` is running
Feb 17 09:39:18 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-46530.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen2-1739784568.tar.gz` size: 37 MiB (39142146 bytes) entries: 8476 modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all `waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software `waLBerla/6.1-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80 no other files in tarball
Feb 17 09:39:18 UTC 2025	test result	😢 FAILURE (click triangle for details) Reason EESSI test suite was not run, test step itself failed to execute. Details ✅ job output file `slurm-46530.out` ❌ found message matching `ERROR:` ✅ no message matching `[\sFAILED\s].Ran . test case`

Neves-P · 2025-02-17T12:50:05Z

The tests fail with:

ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'

This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?

boegel · 2025-02-17T13:05:58Z

The tests fail with:
ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'
This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?

@laraPPr What do you think about this?

laraPPr · 2025-02-17T15:52:42Z

The tests fail with:
ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'
This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?
@laraPPr What do you think about this?

We should not hold up this pr on this. @casparvl or I could add a reframe config file for the gpu partition with a cpu config for reframe?

casparvl · 2025-02-17T16:01:06Z

Let me see if we can also deploy this for zen4+H100 immediately... (and I'll look at the testing config)

bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90

eessi-bot · 2025-02-17T16:01:10Z

Updates by the bot instance eessi-bot-mc-aws (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot · 2025-02-17T16:01:10Z

Updates by the bot instance eessi-bot-mc-azure (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

riscv-eessi-io-bot · 2025-02-17T16:01:11Z

Updates by the bot instance eessi-bot-riscv (click for details)

account casparvl has NO permission to send commands to the bot

eessi-bot-surf · 2025-02-17T16:01:11Z

Updates by the bot instance eessi-bot-surf (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- submitted job 10028769, for details & status see {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780 (comment)

eessi-bot-toprichard · 2025-02-17T16:01:11Z

Updates by the bot instance rt-Grace-jr (click for details)

account casparvl has NO permission to send commands to the bot

gpu-bot-ugent · 2025-02-17T16:01:11Z

Updates by the bot instance eessi-bot-vsc-ugent (click for details)

received bot command build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 from casparvl
- expanded format: build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90
handling command build instance:eessi-bot-surf repository:eessi.io-2023.06-software architecture:zen4 accelerator:nvidia/cc90 resulted in:
- no jobs were submitted

eessi-bot-surf · 2025-02-17T16:01:16Z

New job on instance eessi-bot-surf for CPU micro-architecture x86_64-amd-zen4 and accelerator nvidia/cc90 for repository eessi.io-2023.06-software in job dir /projects/eessibot/eessi-bot-surf/jobs/2025.02/pr_780/10028769

date	job status	comment
Feb 17 16:01:14 UTC 2025	submitted	job id `10028769` will be eligible to start in about 20 seconds
Feb 17 16:01:37 UTC 2025	received	job awaits launch by Slurm scheduler
Feb 17 16:02:07 UTC 2025	running	job `10028769` is running
Feb 17 17:02:17 UTC 2025	finished	😁 SUCCESS (click triangle for details) Details ✅ job output file `slurm-10028769.out` ✅ no message matching `FATAL:` ✅ no message matching `ERROR:` ✅ no message matching `FAILED:` ✅ no message matching `required modules missing:` ✅ found message(s) matching `No missing installations` ✅ found message matching `.tar.gz created!` Artefacts `eessi-2023.06-software-linux-x86_64-amd-zen4-1739810722.tar.gz` size: 747 MiB (783957070 bytes) entries: 8601 modules under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all `LightGBM/4.5.0-foss-2023a-CUDA-12.1.1.lua` `cuDNN/8.9.2.26-CUDA-12.1.1.lua` `waLBerla/6.1-foss-2023a-CUDA-12.1.1.lua` software under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software `LightGBM/4.5.0-foss-2023a-CUDA-12.1.1` `cuDNN/8.9.2.26-CUDA-12.1.1` `waLBerla/6.1-foss-2023a-CUDA-12.1.1` other under 2023.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90 no other files in tarball
Feb 17 17:02:17 UTC 2025	test result	🤷 UNKNOWN (click triangle for detailed information) Job test file `_bot_job10028769.test` does not exist in job directory, or parsing it failed.

casparvl · 2025-02-17T16:07:25Z

The tests fail with:
ERROR: failed to load configuration: could not find a configuration entry for the requested system/partition combination: 'BotBuildTests:x86_64_amd_zen2_nvidia_cc80'
This seems expected, as indeed the ReFrame config file does not have the x86_64_amd_zen2_nvidia_cc80. I'd say this can be deployed then?
@laraPPr What do you think about this?
We should not hold up this pr on this. @casparvl or I could add a reframe config file for the gpu partition with a cpu config for reframe?

Ok, with regards to this: yes, we can just add a gpu partition in the config. We could give it the CPU feature, but one could even argue there is no point in testing at all - as it wouldn't test this build anyway - so I guess we could also give it no features, and it would skip all tests. And by the way: since the conclusion is that we'd skip testing anyway, I agree, it shouldn't block this PR.

We should really have EESSI/eessi-bot-software-layer#294 . The reason we missed this is because there is no explicit reference in the bot config as to which GPUs these nodes can build for. We will have to add all allowed GPU configs in the ReFrame config.

Question: what's our current policy on GPU builds? Which architectures do we target, before we deploy something? Do we cross compile at all? Do we agree to build for at least one GPU arch natively when deploying anything?

Without wanting to hold back this PR, shouldn't we have these questions answered beforehand? Or do we have that somewhere, and I just don't know about it? :D

casparvl · 2025-02-17T16:21:07Z

Hm, I don't really have time right now to fix it. Feel free to ignore it for now.

@laraPPr If you want to take a stab at it: my idea would be to define a single partition dictionary, i.e. something like https://github.com/EESSI/bot-configs/blob/dad6d7434229ddbb6852a79c6fc19540d76b4b1e/mc-aws-rocky88-202310/reframe_config.py#L15 , change the features, and store that as gpu_cross_compile or something. Then, we can generate dicts from that for each of the GPU archs, only changing the name. I think that can be done with:

gpu_cross_compile = {<whatever_is_in_a_partition_dict}
gpu_arch_names = ['x86_64_amd_zen2_nvidia_cc80', 'x86_64_amd_zen2_nvidia_cc80', '... etc']
gpu_cross_compile_partitions = [{**gpu_cross_compile, "name": arch_name} for arch_name in gpu_arch_names]

# And now the actual reframe config...
site_configuration = {
   ...
   'systems': {
   ...
       'partitions': [
       ] + gpu_cross_compile_partitions

Haven't tested this, but this would make it pretty compact to list a lot of gpu_arch_names, which will be needed...

casparvl · 2025-02-17T17:18:00Z

For #780 (comment)

All went fine, it was killed due to time limit in the cleanup stage of the test suite... I guess I should increase the standard walltime for the bot :) Though note that ReFrame didn't run any tests, since none of the software to be tested (e.g. OSU with GPU support, LAMMPS with GPU support) has been build for this CPU+GPU architecture yet.

Since this also will deploy cuDNN, let me check that the installation is stripped correctly...

Edit: that seems to be the case. All headers and static libraries were stripped. Can someone (@trz42 or @ocaisa maybe?) confirm that those are indeed (all) the files we expect to be stripped for cuDNN?

Edit2: ah, yes, from the LICENSE file 2. Distribution. The following portions of the SDK are distributable under the Agreement: the runtime files .so and .dll. . Indeed, I only see the *.so files being included in the tarball.

Add waLBerla 6.1 w/ CUDA

053c989

Neves-P added the 2023.06-software.eessi.io 2023.06 version of software.eessi.io label Oct 9, 2024

Neves-P added the accel:nvidia label Oct 9, 2024

boegel requested changes Oct 11, 2024

View reviewed changes

easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.4-2023a.yml Outdated Show resolved Hide resolved

boegel changed the title ~~{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA~~ {2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 Oct 11, 2024

Move waLBerla w/ CUDA install to accel/nvidia easystackdirectory

69e1ba5

Neves-P requested a review from boegel October 11, 2024 09:25

laraPPr added the ready-to-deploy Mark a PR as ready to deploy label Feb 18, 2025

{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780

Are you sure you want to change the base?

{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780

Conversation

Neves-P commented Oct 9, 2024

eessi-bot bot commented Oct 9, 2024

eessi-bot bot commented Oct 9, 2024

Neves-P commented Oct 9, 2024

eessi-bot bot commented Oct 9, 2024 • edited Loading

eessi-bot bot commented Oct 9, 2024 • edited Loading

eessi-bot bot commented Oct 9, 2024 • edited Loading

Neves-P commented Oct 9, 2024

eessi-bot bot commented Oct 9, 2024 • edited Loading

eessi-bot bot commented Oct 9, 2024 • edited Loading

eessi-bot bot commented Oct 9, 2024 • edited Loading

boegel commented Oct 11, 2024

Neves-P commented Oct 14, 2024

laraPPr commented Feb 11, 2025

eessi-bot bot commented Feb 11, 2025 • edited Loading

riscv-eessi-io-bot bot commented Feb 11, 2025

eessi-bot bot commented Feb 11, 2025 • edited Loading

gpu-bot-ugent bot commented Feb 11, 2025 • edited Loading

eessi-bot-trz42 bot commented Feb 11, 2025

eessi-bot-casparvl bot commented Feb 11, 2025

laraPPr commented Feb 11, 2025

eessi-bot bot commented Feb 11, 2025 • edited Loading

eessi-bot bot commented Feb 11, 2025 • edited Loading

riscv-eessi-io-bot bot commented Feb 11, 2025

eessi-bot-trz42 bot commented Feb 13, 2025

gpu-bot-ugent bot commented Feb 13, 2025 • edited Loading

gpu-bot-ugent bot commented Feb 13, 2025 • edited Loading

laraPPr commented Feb 13, 2025

Neves-P commented Feb 14, 2025

boegel commented Feb 15, 2025

laraPPr commented Feb 15, 2025

boegel commented Feb 17, 2025

eessi-bot bot commented Feb 17, 2025 • edited Loading

eessi-bot bot commented Feb 17, 2025 • edited Loading

gpu-bot-ugent bot commented Feb 17, 2025 • edited Loading

eessi-bot-surf bot commented Feb 17, 2025

eessi-bot-toprichard bot commented Feb 17, 2025

riscv-eessi-io-bot bot commented Feb 17, 2025 • edited Loading

eessi-bot bot commented Feb 17, 2025 • edited Loading

Neves-P commented Feb 17, 2025

boegel commented Feb 17, 2025

laraPPr commented Feb 17, 2025 • edited Loading

casparvl commented Feb 17, 2025

eessi-bot bot commented Feb 17, 2025 • edited Loading

eessi-bot bot commented Feb 17, 2025 • edited Loading

riscv-eessi-io-bot bot commented Feb 17, 2025

eessi-bot-surf bot commented Feb 17, 2025 • edited Loading

eessi-bot-toprichard bot commented Feb 17, 2025

gpu-bot-ugent bot commented Feb 17, 2025 • edited Loading

eessi-bot-surf bot commented Feb 17, 2025 • edited Loading

casparvl commented Feb 17, 2025 • edited Loading

casparvl commented Feb 17, 2025

casparvl commented Feb 17, 2025 • edited Loading

eessi-bot bot commented Oct 9, 2024 •

edited

Loading

eessi-bot bot commented Oct 9, 2024 •

edited

Loading

eessi-bot bot commented Oct 9, 2024 •

edited

Loading

eessi-bot bot commented Oct 9, 2024 •

edited

Loading

eessi-bot bot commented Oct 9, 2024 •

edited

Loading

eessi-bot bot commented Oct 9, 2024 •

edited

Loading

eessi-bot bot commented Feb 11, 2025 •

edited

Loading

eessi-bot bot commented Feb 11, 2025 •

edited

Loading

gpu-bot-ugent bot commented Feb 11, 2025 •

edited

Loading

eessi-bot bot commented Feb 11, 2025 •

edited

Loading

eessi-bot bot commented Feb 11, 2025 •

edited

Loading

gpu-bot-ugent bot commented Feb 13, 2025 •

edited

Loading

gpu-bot-ugent bot commented Feb 13, 2025 •

edited

Loading

eessi-bot bot commented Feb 17, 2025 •

edited

Loading

eessi-bot bot commented Feb 17, 2025 •

edited

Loading

gpu-bot-ugent bot commented Feb 17, 2025 •

edited

Loading

riscv-eessi-io-bot bot commented Feb 17, 2025 •

edited

Loading

eessi-bot bot commented Feb 17, 2025 •

edited

Loading

laraPPr commented Feb 17, 2025 •

edited

Loading

eessi-bot bot commented Feb 17, 2025 •

edited

Loading

eessi-bot bot commented Feb 17, 2025 •

edited

Loading

eessi-bot-surf bot commented Feb 17, 2025 •

edited

Loading

gpu-bot-ugent bot commented Feb 17, 2025 •

edited

Loading

eessi-bot-surf bot commented Feb 17, 2025 •

edited

Loading

casparvl commented Feb 17, 2025 •

edited

Loading

casparvl commented Feb 17, 2025 •

edited

Loading