Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Great Lakes Cluster (UMich) #4869

Merged
merged 6 commits into from
Apr 25, 2024
Merged

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Apr 17, 2024

Start documenting how to use the Great Lakes Cluster at University of Michigan.

User guide: https://arc.umich.edu/greatlakes/user-guide

Action Items

  • complete profile
  • add install dependency script for missing modules: C-Blosc2, ADIOS2, BLAS++, LAPACK++
  • complete and test job script template
  • document small A100 partition and larger CPU partition, too?

@ax3l ax3l added component: documentation Docs, readme and manual machine / system Machine or system-specific issue labels Apr 17, 2024
@ax3l ax3l force-pushed the doc-greatlakes-umich branch from 3a14f3c to bd7d7f7 Compare April 17, 2024 22:00
@archermarx
Copy link
Contributor

Oh great, I can help with this as I have a working install on this cluster.

@archermarx
Copy link
Contributor

Here's my profile file, which we can use as a start

# please set your project account
export proj=#####
# remembers the location of this script
export MY_PROFILE=$(cd $(dirname $BASH_SOURCE) && pwd)"/"$(basename $BASH_SOURCE)
if [ -z ${proj-} ]; then echo "WARNING: The 'proj' variable is not yet set in your $MY_PROFILE file! Please edit its line 2 to continue!"; return; fi

# required dependencies
module load cmake
module load gcc
module load openmpi
module load phdf5
module load git
module load cuda
module load python/3.10.4

# compiler environment hints
export CC="$(which gcc)"
export CXX="$(which g++)"
export CUDACXX="$(which nvcc)"
export CUDAHOSTCXX="$(which g++)"
export FC="$(which gfortran)"
export SRC_DIR=${HOME}/src

and here's my install script

# Load required modules
source ~/warpx.profile
module load python/3.10.4

# uninstall old versions
rm -rf build
rm -r *.whl

# Build warpx
cmake -S . -B build \
        -DWarpX_LIB=ON \
        -DWarpX_APP=ON \
        -DWarpX_MPI=ON \
        -DWarpX_COMPUTE=CUDA \
        -DWarpX_DIMS="1;2;RZ;3" \
        -DWarpX_PYTHON=ON \
        -DWarpX_PRECISION=DOUBLE \
        -DWarpX_PARTICLE_PRECISION=SINGLE \
        -DGPUS_PER_SOCKET=4 \
        -DGPUS_PER_NODE=8

cmake --build build -j 8
cmake --build build --target pip_install -j 8

@ax3l
Copy link
Member Author

ax3l commented Apr 18, 2024

Thank you @archermarx, that is great! I am working to get this documented mainline with Brendan Stassel ✨

Awesome, there is a parallel HDF5 module phdf5 - I must have overlooked that today :) Will update with your recipe included.

@ax3l ax3l force-pushed the doc-greatlakes-umich branch from bd7d7f7 to 5498e2b Compare April 18, 2024 04:56
@ax3l
Copy link
Member Author

ax3l commented Apr 18, 2024

@ax3l ax3l changed the title [WIP] Great Lakes Cluster (UMich) Great Lakes Cluster (UMich) Apr 18, 2024
@archermarx
Copy link
Contributor

Thank you @archermarx, that is great! I am working to get this documented mainline with Brendan Stassel ✨

Oh nice! I know Brendan

@bstassel
Copy link

Hey @archermarx! Thanks for sharing your profile and install script

@bstassel
Copy link

bstassel commented Apr 18, 2024

Draft of the docs for testing :) https://warpx--4869.org.readthedocs.build/en/4869/install/hpc/greatlakes.html

@ax3l following the doc you linked, git clone https://github.com/ECP-WarpX/WarpX.git doesn't load a great-lakes folder in the machines dir.

Copy link

@bstassel bstassel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c-blosc currently fails due to looking for c-blosc2 dir, then only using c-blosc dir. Either work, but it needs to be consistent.

@ax3l
Copy link
Member Author

ax3l commented Apr 18, 2024

@ax3l following the doc you linked, git clone https://github.com/ECP-WarpX/WarpX.git doesn't load a great-lakes folder in the machines dir.

Oh yes, because this PR is not yet merged. You could do this in ~/src/warpx/:

cd ~/src/warpx

git remote add ax3l https://github.com/ax3l/WarpX.git
git fetch --all
git checkout -b doc-greatlakes-umich ax3l/doc-greatlakes-umich

@bstassel
Copy link

@ax3l following the doc you linked, git clone https://github.com/ECP-WarpX/WarpX.git doesn't load a great-lakes folder in the machines dir.

Oh yes, because this PR is not yet merged. You could do this in ~/src/warpx/:

cd ~/src/warpx

git remote add ax3l https://github.com/ECP-WarpX/WarpX.git
git checkout -b doc-greatlakes-umich ax3l/doc-greatlakes-umich

I can add the remote git, but when I go to checkout the file it fails.

fatal: 'ax3l/doc-greatlakes-umich' is not a commit and a branch 'doc-greatlakes-umich' cannot be created from it

@ax3l
Copy link
Member Author

ax3l commented Apr 20, 2024

Sorry, I forgot to write also:

git fetch --all

and posted the wrong git URL. Editing now above msg.

@ax3l ax3l marked this pull request as ready for review April 22, 2024 22:05
@ax3l ax3l force-pushed the doc-greatlakes-umich branch from 40b071b to b80a6e0 Compare April 22, 2024 22:14
.. code-block:: bash

bash $HOME/src/warpx/Tools/machines/greatlakes-umich/install_v100_dependencies.sh
source ${HOME}/sw/greatlakes/v100/venvs/warpx-v100/bin/activate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above the guide informs the user to always source $HOME/greatlakes_v100_warpx.profile.

Is activate copied to the greatlakes_v100_warpx.profile or are we loading a different source here? If so, why?

Copy link
Member Author

@ax3l ax3l Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra line is only needed once, as we set up the dependencies, to continue in the same terminal.

The reason for that extra line in this step is that we already sourced the profile but only the line now adds the venv - so it was not yet activated.

@ax3l
Copy link
Member Author

ax3l commented Apr 24, 2024

Status: We are still working on the job script, to ensure the GPU visibility is correct set (one unique GPU per one MPI rank pinned on the closest CPU). @bstassel opened a ticket to support for this.

@archermarx what job script template are you using for the V100 GPUs? Did you solve this already?

@archermarx
Copy link
Contributor

archermarx commented Apr 24, 2024

@ax3l This job script appeared to work for me for running a 2-GPU job after some discussion with ARC-TS

#!/bin/bash
#SBATCH --job-name=#####
#SBATCH --account=#####
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --cpus-per-gpu=1
#SBATCH --gpus-per-task=v100:1
#SBATCH --gpu-bind=single:1
#SBATCH --mem=32000m
#SBATCH --time=00:05:00
#SBATCH --output=output.log
#SBATCH --mail-type=END,FAIL

# Load required modules
module load gcc hdf5 openmpi/4.1.6-cuda cmake git cuda python/3.10.4

srun python PICMI_inputs_2d.py

@ax3l
Copy link
Member Author

ax3l commented Apr 24, 2024

Oh,

#SBATCH --gpu-bind=single:1

could be what I missed.

@bstassel
Copy link

bstassel commented Apr 24, 2024

Oh,

#SBATCH --gpu-bind=single:1

could be what I missed.

I will give this a try and report the result.

EDIT: this resolved the error thrown by WarpX about MPI mapping

@bstassel
Copy link

bstassel commented Apr 24, 2024

For future reference, the complete documentation for SLURM is here: https://slurm.schedmd.com/srun.html#OPT_gres-flags

There is an interesting combination between --gpu-bind=single:<numtasks> and --gres-flags=allow-task-sharing that allows each task to see each GPU within the job allocation that is on the same node as the task, which allows for inter-GPU communication.

@bstassel
Copy link

bstassel commented Apr 25, 2024

I want to make sure I comprehend the SBATCH commands used in greatlakes_v100.sbatch.

#SBATCH -N 1 -> we request only 1 Node 
#SBATCH --exclusive -> do not allow SLURM to share the node it gives me with other users
#SBATCH --ntasks-per-node=2 -> limit only two processes to run on the 1 node we requested (I assume because there are 2x 2.4 GHz Intel Xeon Gold 6148 for each node)
#SBATCH --cpus-per-task=20 -> Allocate 20 cpus per processes (for a total of 40 CPUs on the node, the max for a node on the GreatLakes gpu partition)
#SBATCH --gpus-per-task=v100:1 -> Give each process its own V100 GPU, for a total of 2 GPUS
#SBATCH --gpu-bind=single:1 -> 1 process is bound to 1 GPU

Do I have that right?

@bstassel
Copy link

I was comparing the output.txt file between the job that had --gpu-bind=single:1 and the one that didn't. It appears forcing 1 GPU to 1 MPI rank gives worse performance? Maybe this is only because the simulation is so short, I don't see the benefits.

No gpu-bind

STEP 100 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file diags/diag1000100
--- INFO    : Writing openPMD file diags/openPMDfw000100
--- INFO    : Writing openPMD file diags/openPMDbw000100
STEP 100 ends. TIME = 1.083064693e-14 DT = 1.083064693e-16
Evolve time = 2.07016534 s; This step = 0.346136823 s; Avg. per step = 0.0207016534 s


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ THE END ]
*
* No recorded warnings.
********************************************************************************

Total Time                     : 4.078974041


TinyProfiler total time across processes [min...avg...max]: 4.08 ... 4.088 ... 4.096

`--gpu-bind=single:1'

STEP 100 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file diags/diag1000100
--- INFO    : Writing openPMD file diags/openPMDfw000100
--- INFO    : Writing openPMD file diags/openPMDbw000100
STEP 100 ends. TIME = 1.083064693e-14 DT = 1.083064693e-16
Evolve time = 8.21666101 s; This step = 1.493229092 s; Avg. per step = 0.0821666101 s


**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ THE END ]
*
* No recorded warnings.
********************************************************************************

Total Time                     : 16.89200922


TinyProfiler total time across processes [min...avg...max]: 16.89 ... 16.9 ... 16.91

@archermarx
Copy link
Contributor

archermarx commented Apr 25, 2024 via email

@ax3l
Copy link
Member Author

ax3l commented Apr 25, 2024

Awesome progress!

Starting to answer individual Qs from above.

#4869 (comment)

There is an interesting combination between --gpu-bind=single: and --gres-flags=allow-task-sharing that allows each task to see each GPU within the job allocation that is on the same node as the task, which allows for inter-GPU communication.

The last part, not exactly. We can do inter-GPU communication with MPI direct.
We want indeed only one GPU visible per task (aka MPI rank), so they have a 1:1 relation. We use GPU-aware MPI to do direct GPU-to-GPU communication.

#4869 (comment)

#SBATCH --ntasks-per-node=2 -> limit only two processes to run on the 1 node we requested (I assume because there are 2x 2.4 GHz Intel Xeon Gold 6148 for each node)

All correct. This in particular just says: we want to MPI processes per 1 node. The reason for that is that we have 2 GPUs per node and want a 1:1 mapping. (We would do the same if it was only one Intel Xeon Gold per node, because they are multi-core CPUs anyway.)

#4869 (comment)

I was comparing the output.txt file between the job that had --gpu-bind=single:1 and the one that didn't. It appears forcing 1 GPU to 1 MPI rank gives worse performance? Maybe this is only because the simulation is so short, I don't see the benefits.

That is totally ok.
The reason is likely that your simulation is to small (not too short). Now, it probably cuts your simulation in domain decomposition in so small pieces, that every GPU is barely busy and mostly spends time talking to other GPUs.

Solution: use less GPUs or solve a bigger problem :)

General guidance for 16 GB V100 GPUs: try to have about 128^3 to 256^3 cells per GPU, as fits with your number of particles per cell.

Looks like the GPU bindings work! 🎉

@bstassel
Copy link

Do you have Amex GPU aware mpi enabled?

Yeah, I think so.

# GPU-aware MPI optimizations
GPU_AWARE_MPI="amrex.use_gpu_aware_mpi=1"

ax3l added 6 commits April 24, 2024 20:16
Document how to use the Great Lakes Cluster at the
University of Michigan.
New module added on the system.
Ensure one MPI rank sees exactly one, unique GPU.
@ax3l ax3l force-pushed the doc-greatlakes-umich branch from 3997ba7 to 3b0fbd6 Compare April 25, 2024 03:16
@ax3l ax3l requested review from archermarx and roelof-groenewald and removed request for archermarx April 25, 2024 03:27
@archermarx
Copy link
Contributor

archermarx commented Apr 25, 2024 via email

Copy link
Member

@roelof-groenewald roelof-groenewald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ax3l
Copy link
Member Author

ax3l commented Apr 25, 2024

@archermarx thanks a lot!!

I will preliminarily merge this to development, so the doc render and refresh and you can use the RTD page without having to switch branches:
https://warpx.readthedocs.io/en/latest/install/hpc/greatlakes.html

Please report back here if this worked - otherwise we can do a follow-up PR.

@ax3l ax3l merged commit ed7e824 into ECP-WarpX:development Apr 25, 2024
43 of 45 checks passed
@ax3l ax3l deleted the doc-greatlakes-umich branch April 25, 2024 06:48
#SBATCH --cpus-per-task=20
#SBATCH --gpus-per-task=v100:1
#SBATCH --gpu-bind=single:1
#SBATCH -o WarpX.o%j
Copy link

@bstassel bstassel Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend expliciting putting
#SBATCH --mem=0
to signify this request is allocating all the memory on the node. This should happen dynamically, since --exclusive is set but i think for users it is a good idea to put so they have some reference to what is implicitly happening with the job request

Copy link
Member Author

@ax3l ax3l Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, --mem=0 means all? o.0
I think --exclusive is a bit clearer for now / avoids duplication, unless it does not work to reserve all host memory.

Haavaan pushed a commit to Haavaan/WarpX that referenced this pull request Jun 5, 2024
* Great Lakes Cluster (UMich)

Document how to use the Great Lakes Cluster at the
University of Michigan.

* Fix c-blosc2 typos

* Parallel HDF5 for CUDA-aware MPI

New module added on the system.

* Fix GPU Visibility

Ensure one MPI rank sees exactly one, unique GPU.

* Add `#SBATCH --gpu-bind=single:1`

* Add clean-up message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: documentation Docs, readme and manual machine / system Machine or system-specific issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants