Skip to content
Fredrik Jansson edited this page Apr 18, 2024 · 21 revisions

DALES setup on Fugaku

2024

Compiling

dev branch

git clone https://github.com/dalesteam/dales
cd dales
git checkout dev
git submodule init
git submodule update


. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load [email protected]%fj/mmdtg52
spack load fftw%fj
spack load [email protected]%gcc/wyds2me  # load cmake to avoid the fj cmake loaded by spack

# for some reason this is needed for netcdf-c to find hdf5 libraries
export LDFLAGS="-lhdf5_hl -lhdf5"

export SYST=FX-Fujitsu
cmake .. -DUSE_FFTW=True
make -j 8

For single precision, substitute the cmake command:

cmake .. -DUSE_FFTW=True -DFIELD_PRECISION=32 -DPOIS_PRECISION=32

Job script

#!/bin/sh
#PJM -L "node=1"
#PJM -L "rscgrp=small"
#PJM -L "elapse=6:00:00"
#PJM --mpi max-proc-per-node=48
#PJM -x PJM_LLIO_GFSCACHE=/vol0004:/vol0005
#PJM -g hp240116
#PJM -s

# other PJM flags
# tuning the compute node file system caching behavior
# --llio localtmp-size=500Mi
# --llio cn-cache-size=1Gi # default = 128Mb
# --llio sio-read-cache=on

# do not create empty stdout/stderr files
export PLE_MPI_STD_EMPTYFILE=off

#load spack modules
. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load [email protected]%fj/mmdtg52
spack load fftw%fj

DALES=/path/to/dales
NAMOPTIONS=namoptions.001

# make sure every compute node has the executable in file-system cache
# especially recomended for large jobs
llio_transfer ${DALES}

mpiexec -n $DALES $NAMOPTIONS

7.6.2021, by Fredrik Jansson

Branch to4.4_fredrik contains a few fixes needed on Fugaku, in particular the compiler settings for cmake.

Quirks

The amount of ram per core is rather small, ~600 MB. NetCDF4 seems to require a lot of memory, there is an option in the namelist to switch to NetCDF3, lclassic = .true. It's probably better to turn off netCDF synchronization, to reduce the amount of disk IO - don't set lsync = .true..

December 2021 Update

New spack, new compiler environment. Must explicitly request volumes to be mounted by the compute nodes. We request /vol0004 because Spack is located there. Compiler frtpx (FRT) 4.7.0 20211110 .

Add this to the job script header to request mounting /vol0004 :

#PJM -x PJM_LLIO_GFSCACHE=/vol0004 

Before compiling DALES, and in the run script before launching DALES:

. /vol0004/apps/oss/spack-v0.17.0/share/spack/setup-env.sh
spack load netcdf-fortran%fj
spack load fftw%fj  

September 2021 Update

With this update there is a working system-wide spack again. It uses the new tcsds-1.2.33 toolchain. The issues with MPI functions not accepting (..) arguments should now be solved, and also the MPI problems from mixing different tcdsc versions.

. /vol0004/apps/oss/spack-v0.16.2/share/spack/setup-env.sh
spack load netcdf-fortran%fj /bubmb4i
spack load fftw%fj

Spack setup (one-time) (not necessary after Sept 2021 update)

On Fugaku the spack package system is used to manage modules and libraries. We need it for fftw, netcdf and perhaps HYPRE. Currently the system-wide spack installation points to an older language environment tcsds-1.2.29, which doesn't work well with DALES due to MPI problems. (/vol0004/apps/oss/spack/etc/spack/packages.yaml refers to lang/tcsds-1.2.29).

As a work-around, we can set up a private spack environment. See Fugaku manual. Follow the steps, with these modifications:

  • In step 3.2, replace tcsds-1.2.29 by tcsds-1.2.31 in $(HOME)/.spack/linux/compilers.yaml
  • In step 3.3, don't link to the public instance
  • In step 3.4, replace tcsds-1.2.29 by tcsds-1.2.31 in $(HOME)/.spack/linux/packages.yaml

Submit an interactive job and install the spack modules

pjsub --interact -L "node=1" -L "rscgrp=int" -L "elapse=4:00:00" --sparam wait-time=900 --mpi max-proc-per-node=48
. ~/spack/share/spack/setup-env.sh
spack install fftw openmp=True
spack install netcdf-fortran

Compiling DALES

There are two Fujitsu compilers, named mpifrtpx (cross compiler usable on login node) and mpifrt (usable on compute nodes). Currently the DALES CMakeList specifies the cross compiler, so these steps work on the login node.

. ~/spack/share/spack/setup-env.sh
spack load netcdf-fortran%fj
spack load fftw%fj

# workaround for library errors in git after loading spack                                   
export LD_LIBRARY_PATH=/lib64:$LD_LIBRARY_PATH

export SYST=FX-Fujitsu
export LDFLAGS="-lhdf5_hl -lhdf5"

mkdir build
cd build
cmake ../dales -DUSE_FFTW=True
make -j 4 2>&1 | tee compilation-log.txt

There is a compilation script for a local spack environment in the dales-tester repository: https://github.com/fjansson/dales-tester/blob/master/compile-fugaku-localspack.sh

Running

For better file system performance, one should access files through the layered file system LLIO. In practice, for any path that starts with /vol0004/... use /vol0004_cache/.... This only works on the compute nodes. DALES writes output files in the current directory, so the job script should cd to /vol0004_cache/... before launching DALES.

Sample run script, where NX,NY,the number of nodes and the git commit tag of the dales binary can be specified on submission. If restart files exist in the job directory, they are used for a restart.

#!/bin/sh
#PJM -L "node=2"
#PJM -L "rscunit=rscunit_ft01"
#PJM -L "rscgrp=small"
#PJM -L "elapse=72:00:00"
#PJM --mpi max-proc-per-node=48
#PJM --llio cn-cache-size=1Gi # default = 128Mb
#PJM --llio sio-read-cache=on
#PJM -s

# submit as
# pjsub -x "TAG=c8cf1,NX=6,NY=8"  -L "node=1"  dales-fftw-fugaku.job

# defaults:                                                                                  
#NX=8  NY=12   TAG=78364  2 nodes

# do not create empty stdout/stderr files
export PLE_MPI_STD_EMPTYFILE=off

# load local spack environment
. ~/spack/share/spack/setup-env.sh
spack load netcdf-fortran%fj
spack load fftw%fj

if [ -z "$TAG" ]
then
    TAG=78364
fi

if [ -z "$NX" ]
then
    NX=8
fi

if [ -z "$NY" ]
then
    NY=12
fi


SYST=FX-Fujitsu
NAMOPTIONS=namoptions.001

NTOT=$(($NX*NY))

DALES=/vol0004_cache/your-home-directory/dales-tester/build-$TAG-$SYST/src/dales4
# note use full path here, starting with /vol0004_cache/

llio_transfer ${DALES} 
# distribute the binary to the compute nodes
# not required but might help performance when using many nodes
# the spack shared libraries are still accessed through /vol_0004/...


WORK=`pwd -P | sed 's/vol[0-9]*/&_cache/'`
# get current directory, resolving symlinks. E.g. /vol0004/hp120279/u00892/runs/             
# use sed to edit volXXXX to volXXXX_cache                            
cd $WORK

# make symlinks for RRTMG
ln -s ../../rrtmg_lw.nc ./
ln -s ../../rrtmg_sw.nc ./
ln -s ../../backrad.inp.001.nc ./

# edit nprocx, nprocy in namelist                                                            
sed -i -r "s/nprocx.*=.*/nprocx = $NX/;s/nprocy.*=.*/nprocy = $NY/" $NAMOPTIONS

# do a restart if files for that exist                                                     
if [ -f "initdlatestmx000y000.001" ]
then
    # edit lwarmstart to true                                                                
    sed -i -r "s/lwarmstart.*=.*/lwarmstart =  .true./" $NAMOPTIONS
    # edit startfile                                                                         
    sed -i -r "s/startfile.*=.*/startfile = \"initdlatestmx000y000.001\"/" $NAMOPTIONS
fi


echo SYST $SYST
echo DALES $DALES
echo WORK $WORK
echo NTOT $NTOT
echo NX,NY $NX,$NY

mpiexec -n $NTOT $DALES $NAMOPTIONS

Postprocessing and merging NetCDF tiles

Fugaku contains some post-processing nodes with x86 CPUs and more memory than the compute nodes. https://www.fugaku.r-ccs.riken.jp/doc_root/en/user_guides/pps-slurm-1.1/ These nodes use the slurm queue system.

merge_grids.py from https://github.com/CloudResolvingClimateModeling/dalesview can be used on the post-processing nodes.

Sample job script.

#!/bin/bash                                                                                  
#SBATCH -p ppmq     # Specify a queue
#SBATCH -N 1        # Specify the number of allocated nodes
#SBATCH -J merge    # Specify job name

# merge script for the post-processing queue

. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load py-pip%gcc

# Setup, on login node:
# spack load py-pip%gcc
# pip install --user netCDF4

for d in Run_76 Run_77 Run_81 ; do
    pushd runs/$d
    /usr/bin/time --format='(%e s   %M kB)' ~/dalesview/merge_grids.py -j 1 --cross &
    /usr/bin/time --format='(%e s   %M kB)' ~/dalesview/merge_grids.py -j 1 --fielddump &
    wait
    popd
done

Postprocessing (computing cloudmetrics) using python and conda

Easiest here to bypass spack entirely, since it doesn't contain all the right versions of conda to match the various systems you'll encounter. On login nodes and postprocessing nodes, we need the x86_64 version, on regular compute nodes the aarch64 version. One way to have everything is to install both:

# On login node
mkdir ~/miniconda3
cd miniconda3/
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh
bash Miniconda3-py37_4.9.2-Linux-x86_64.sh -u
source ~/.bashrc
conda create -n cloudmetenv python=3.7
conda activate cloudmetenv
conda install numpy scipy matplotlib netcdf4 pandas pytables pywavelets scikit-image scikit-learn seaborn spyder tqdm

# For compute node
mkdir ~/miniconda3-aarch64
cd miniconda3-aarch64/
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh

To install on a compute node, run an interactive job:

pjsub --interact -L "node=1" -L "rscgrp=int" -L "elapse=4:00:00" --sparam wait-time=900 --mpi max-proc-per-node=48
bash Miniconda3-latest-Linux-aarch64.sh -u
source ~/.bashrc

conda create -n cloudmetenv python=3.7
conda activate cloudmetenv
conda install numpy scipy matplotlib netcdf4 pandas pytables pywavelets scikit-image scikit-learn seaborn tqdm

It is possible that conda install will report (phantom) merge conflicts for some of these packages. While I (Martin) haven't figured out exactly what causes this, I've managed to get around it by installing the offending packages from conda-forge, i.e. using:

conda install -c conda-forge netcdf4

Once the environment is set up, this now requires turning off the automatic activation of the base conda environment, as it will fail on a login node that's not aarch64:

conda config --set auto_activate_base false

The version that's installed the latest is the default to which conda commands point. In this case, if you're on an x86_64 node and would like to run conda, you have to manually source it from the right location:

source ~/miniconda3/bin/activate

From here on, conda should work normally.

Analysis with jupyter notebook

The graphical forwarding you can get with ssh -Y is too slow to be practical, making the use of a graphical IDE (such as Spyder) unpractical. To avoid this, you can run interactive python in iPython notebooks, which can be forwarded to your local computer. To do so, activate your conda environment and run:

login2$ source <path_where_conda_is_installed>/miniconda3/bin/activate # Activate the correct conda, depending on if you are on login/compute node
(base) login2$ conda activate cloudmetenv # activate the right environment
(base) login2$ conda install -c anaconda jupyter # installs jupyter and dependencies

The last line is obviously only necessary the first time you run this. Now fire up a notebook (with port argument indicating which port you'll connect to it with from your local machine):

(cloudmetenv) login2$ jupyter notebook --no-browser --port=8080

And, to get to this, run, on your local machine in a new terminal:

(base) MacBook-Pro-van-Martin:~ martinjanssens$ ssh -N -L localhost:8080:localhost:8080 <fugaku_user>@login2.fugaku.r-ccs.riken.jp

Here, make sure you connect to the same login node that you ran the notebook server in, and that the second port in the above command is the same as that you've opened the server on. Finally, you can access the notebook (make sure you have jupyter installed locally) by navigating to

localhost:8080

in a browser on the local machine. You may have to enter an access token to do so, which you'll have seen printed to your fugaku terminal after you fired the notebook up.

Merging with CDO

Merging with cdo is much faster than with the Python script. From version 2.0.4 (released 14 Feb 2022) onward, cdo has improved support for non-geographic grids like the ones we have in DALES. Improvements: no longer need to specify NX or list the tiles in the right order, and horizontal coordinate variables (xm, xt, ym, yt) are preserved. Previously these variables were lost.

Installing cdo with spack on compute node

(todo: how to get version 2.0.4? wait for new spack or install from source?)

Set up a local spack as above.

. ~/spack/share/spack/setup-env.sh
spack install cdo

Installation tested on compute node. On the compute node, installation took 3.5h - make the interactive job long enough for this to finish. (Jan 2022: cdo installation of spack on login node does not work, see below for installing from source)

Install cdo from source on login node

. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load /eh42puo # [email protected]%[email protected] arch=linux-rhel8-cascadelake
spack load /mutkzkd # [email protected]%[email protected] arch=linux-rhel8-skylake_avx512

mkdir src
cd src
wget https://code.mpimet.mpg.de/attachments/download/26761/cdo-2.0.4.tar.gz
tar -xzf cdo-2.0.4.tar.gz
cd cdo-2.0.4

./configure --prefix=${HOME}/OSS/cdo-x86 --with-netcdf=yes CC="gcc" CXX"=g++" F77=gfortran
make
# make check
# FAIL: tsformat.test 8 - chaining set 1 with netCDF4
make install

The one test failure is probably due to HDF5 library lacking thread support.

Use cdo

. ~/spack/share/spack/setup-env.sh
spack load cdo
cdo

CDO requires that the time dimension has units in the form seconds since 2020-01-02T00:00:00. DALES by default only gives the unit s, and then CDO outputs nonsense time values. A simple fix is to add xyear = 2020 in the &DOMAIN namelist - if both xyear and xday are present a proper time unit is written in the netCDF. (There seems to be an off-by-one bug in the date output. Day-of-year starts at 1 for the first day of the year.)

NX=`ls surf_xy.x*y000.001.nc | wc -l`  # find number of tiles in X direction.
cdo -f nc4 -z zip_6 -r -O collgrid,$NX `ls fielddump.*.001.nc | sort -t . -k 3` 3d.nc

cdo -f nc4 -z zip_6 -r -O collgrid,$NX `ls cape.*.nc | sort -t y -k 2` merged-cape.nc

# can specify a single variable to merge:
cdo -f nc4 -z zip_6 -r -O collgrid,$NX,thlxy `ls crossxy.0001.*.nc | sort -t y -k 3` merged-crossxy-thl.nc

The files should be ordered so that consecutive tiles are adjacent in X, sort takes care of this. cdo doesn't like (output) files where the variables use different grids e.g. mixing velocity and temperature. A work-around is to output these variables to different files. -z zip_6 controls compression.

To do

  • copy the files to the node's local storage before merging. Use /tmp/ ?
  • how many merge jobs can be run at once?
  • avoid saving fields we don't need. CAPE contains a lot, we mainly want LWP, TWP, RWP, cloud-top-height.

Profiling & Tuning

There are two profilers, FIPP and FAPP. FAPP is more advanced. Profiling section in manual Profiler manual, pdf

FIPP example

# run the program to profile with fipp:
fipp -m 128000 -C -d profiling_data -Icall,cpupa,mpi mpiexec -n $NTOT $DALES $NAMOPTIONS
# -m sets the amount of memory to reserve for the measurements

# analyse
fipppx -A -pall -d profiling_data/ > fipp-output.txt

One can isolate a region of the program for measurement by inserting calls to start/stop functions. If that is not done, the whole program is profiled. No special compiler flags are needed for profiling. I have not managed to get MPI measurements from FIPP.

Compiler optimization messages

Add the flag -Koptmsg=2. Output is quite verbose.

Automatic OpenMP parallelization

Add the flag -Kfast,parallel to the compiler. Reduce the number of processes per node, e.g. #PJM --mpi max-proc-per-node=8 in the job script. Set the number of threads per process in the job script:

export PARALLEL=6                          # Specify the number of thread
export OMP_NUM_THREADS=${PARALLEL}         # Specify the number of thread

This seems to work, but is not faster than flat MPI.

Debuging in parallel

GDB seems to work, on an interactive node the --gdbx command can be used like such with a gdb commands file: mpirun -n 1 --gdbx gdb_cmds ../../../build_debug/src/dales4.3 ../namoptions.002

where the file gdb_cmds contains:

run
bt
quit