-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780
base: 2023.06-software.eessi.io
Are you sure you want to change the base?
{2023.06}[foss/2023a] waLBerla 6.1 w/ CUDA 12.1.1 #780
Conversation
Instance
|
Instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen3 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
easystacks/software.eessi.io/2023.06/eessi-2023.06-eb-4.9.4-2023a.yml
Outdated
Show resolved
Hide resolved
@Neves-P easyconfig PR is merged, do we need to check/verify anything here, or is it ready to deploy? |
@boegel , I think this is good to go. Looking into the tarball I see a directory This contains a static library |
bot: help |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
bot: build instance:eessi-bot-vsc-ugent repo:eessi.io-2023.06-software accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
@boegel it seems that some nodes might be having some problems? @Neves-P I think this means the pr should be ready to deploy? |
@laraPPr , just to clarify, the tests worked fine in this run #780 (comment) and were failing due to some (I/O?) problem on the build cluster earlier, is that it? |
I think we've seen random hangs with waLBerla before in previous builds, no? We did experience some intermittent trouble with internet access from our nodes, that may be related (if some stuff is being downloaded during the installation procedure). |
The walberla build always went fine. It was the Lammps ReFrame test that got stuck in limbo for 30 minutes. And than fails. |
Since the build for bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
The tests fail with:
This seems expected, as indeed the ReFrame config file does not have the |
@laraPPr What do you think about this? |
We should not hold up this pr on this. @casparvl or I could add a reframe config file for the gpu partition with a cpu config for reframe? |
Let me see if we can also deploy this for zen4+H100 immediately... (and I'll look at the testing config) bot: build instance:eessi-bot-surf repo:eessi.io-2023.06-software arch:zen4 accel:nvidia/cc90 |
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
Updates by the bot instance
|
New job on instance
|
Ok, with regards to this: yes, we can just add a gpu partition in the config. We could give it the CPU feature, but one could even argue there is no point in testing at all - as it wouldn't test this build anyway - so I guess we could also give it no features, and it would skip all tests. And by the way: since the conclusion is that we'd skip testing anyway, I agree, it shouldn't block this PR. We should really have EESSI/eessi-bot-software-layer#294 . The reason we missed this is because there is no explicit reference in the bot config as to which GPUs these nodes can build for. We will have to add all allowed GPU configs in the ReFrame config. Question: what's our current policy on GPU builds? Which architectures do we target, before we deploy something? Do we cross compile at all? Do we agree to build for at least one GPU arch natively when deploying anything? Without wanting to hold back this PR, shouldn't we have these questions answered beforehand? Or do we have that somewhere, and I just don't know about it? :D |
Hm, I don't really have time right now to fix it. Feel free to ignore it for now. @laraPPr If you want to take a stab at it: my idea would be to define a single partition dictionary, i.e. something like https://github.com/EESSI/bot-configs/blob/dad6d7434229ddbb6852a79c6fc19540d76b4b1e/mc-aws-rocky88-202310/reframe_config.py#L15 , change the
Haven't tested this, but this would make it pretty compact to list a lot of |
For #780 (comment) All went fine, it was killed due to time limit in the cleanup stage of the test suite... I guess I should increase the standard walltime for the bot :) Though note that ReFrame didn't run any tests, since none of the software to be tested (e.g. OSU with GPU support, LAMMPS with GPU support) has been build for this CPU+GPU architecture yet. Since this also will deploy cuDNN, let me check that the installation is stripped correctly... Edit: that seems to be the case. All headers and static libraries were stripped. Can someone (@trz42 or @ocaisa maybe?) confirm that those are indeed (all) the files we expect to be stripped for cuDNN? Edit2: ah, yes, from the |
This PR adds the same version of
waLBerla
as installed previously, but with the updatedfoss2023a
toolchain withCUDA
support.