Skip to content

Building madgraph4gpu and measuring throughput

Andrea Valassi edited this page Apr 8, 2022 · 11 revisions

Building madgraph4gpu and measuring throughput (epochX/ggttgg)

This page describes how to build the madraph4gpu code and measure its throughput in terms of matrix elements (MEs) per second. It is meant to address issue #249.

For the moment, this is a single page focusing on the latest implementation for CUDA and vectorized C++, for the ggttgg physics process (epochX/cudacpp/ggttgg). Other physics processes are also available and can be built and tested in an analogous way (epochX/cudacpp/eemumu and epochX/cudacpp/eemumu).

This page replaces a previous version of this wiki describing the older epoch1 code for eemumu (epoch1/cuda/eemumu), which is now obsolete but is still available here.

Eventually, other implementations alternative to CUDA/C++, based on Kokkos, Alpaka and Sycl may also be described.

For the impatient: jump to the example which puts it all together.

For the very impatient: jump to the section explaining how to use the throughputX.sh script to get detailed performance comparisons.

Download the code

In a chosen directory (here: /data/valassi) and using your preferred authentication mechanism (here: https) download the latest master

  cd /data/valassi
  git clone https://github.com/madgraph5/madgraph4gpu.git
  cd madgraph4gpu
  git checkout master

If you want to be sure that you are using the latest stable version of the epochX/cudacpp code, use the following commit:

   git reset --hard 26d40755be840a55ef2e357392492546375ee34a

For convenience, the download directory will be referred to as MADGRAPH4GPU_HOME in the following (but this environment variable is not used anywhere inside the code or Makefiles).

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

Understand the code structure

In production releases of the MadGraph5_aMC@NLO software, the physics code (Fortran, C++, CUDA...) to calculate a given physics process is automatically generated by a Python code generator.

Our developments in the madgraph4gpu project are based on an iterative process, where fixes and new features are added by modifying an existing auto-generated CUDA/C++ code base, and must then be back-ported to the Python code generator.

Previous versions of our code (namely in epoch1 and epoch2) were based on a one-off code generation, followed by many additions to the existing CUDA/C++. A new epoch was started when the features and fixes were back-ported to the Python code generator, maintained in a repository external to the project.

The new epochX developments follow a different approach, where the Python code generator is also included in the madgraph4gpu repository, and any new fixes and features in CUDA/C++ may be back-ported immediately to the code generator. For the two main physics processes we currently use for development (eemumu and ggttgg), both the manually developed and the auto-generated code are included in the repository. The iterative development process is the following: start from an auto-generated code and from an identical copy in the manually developed directory; add fixes and features to the latter; backport them to the code generator, regenerate the auto-generated directory; modify the code generator until the auto and manual directories are identical again; iterate by adding new fixes and features to the manual directory. More details are available in issue #244, which described how the current epochX structure was achieved from the previous epoch1 and epoch2.

The latest version of the code is in the epochX/cudacpp directory, which has the following contents:

> \ls -1F $MADGRAPH4GPU_HOME/epochX/cudacpp 
CODEGEN/
ee_mumu/
ee_mumu.auto/
gg_tt.auto/
gg_ttgg/
gg_ttgg.auto/
tput/

In particular:

  • CODEGEN contains the Python code-generator (as a "plugin" for an official MadGraph_aMC@NLO software release)
  • ee_mumu and ee_mumu.auto contain the manually developed and auto-generated code for the eemumu physics process
  • gg_ttgg and gg_ttgg.auto contain the manually developed and auto-generated code for the ggttgg physics process
  • gg_tt.auto contains the auto-generated code for the ggtt physics process (where we do no manual developments)
  • tput contains a collection of scripts and logfiles for performance measurements

In the "steady state", typically after a major pull request:

  • the auto generated code is that coming from the generator in the repository
  • the manual code is identical to the auto generated code
  • the throughput logs are those obtained with the latest auto and manual codes in the repository

Code generation itself is not described in detail in this twiki page. All *.auto directories have been created using the CODEGEN/generateAndCompare.sh script. Internally, the MG5aMC command that is used to build the standalone cuda/c++ code is

output standalone_cudacpp <directory>

Set the runtime environment for the build (compilers, ccache etc)

To build and run the code you need

  • O/S installation using the tsc clocksource (baseline is CentOS8), see issue #116 for details
  • C++ compiler and runtime libraries (baseline is gcc10.2): the CXX environment variable must be set
  • optionally, CUDA compiler and runtime libraries (baseline is nvcc 11.1): CUDA_HOME must be set (or nvcc must be in PATH); if CUDA_HOME points to an invalid path, a C++-only build is performed (using C++ random numbers instead of curand)
  • optionally, set up ccache

In addition (if you use the custom profiling scripts such as throughputX.sh):

  • optionally, set up the perf profiling tool
  • optionally, set up the Nvidia nsight profiling tools
  • optionally, set up python 3.8 or later

C++ compiler

The following C++ compilers are supported

  • gcc9 or later
  • clang10 or later (see issue #172)
  • icx 202110 or later (icc is no longer supported because it has no support for compiler vector extensions, see issue #220)

At CERN, the baseline configuration with gcc10.2 is set up using

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/setup.sh

The line above sets up all relevant runtime libraries and also sets the CXX environment variable:

  echo $CXX
  /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++

CUDA

To enable CUDA builds, you must set CUDA_HOME, or alternatively have nvcc in your PATH. For instance:

  export PATH=/usr/local/cuda-11.1/bin:${PATH}

Or (better):

  export CUDA_HOME=/usr/local/cuda-11.1
  export PATH=${CUDA_HOME}/bin:${PATH}

Note that a CUDA runtime library (the CURAND random number library) is used not only in the GPU/CUDA application, but also in the CPU/C++ application (in the former case, the device version is used and random numbers are generated on the GPU, while in the latter case the host version is used and random numbers are generated on the CPU). This is meant to ensure that the same physics results (average matrix elements) are obtained in both cases, as the same random number seed is always used.

If nvcc is not in PATH, or if it is but CUDA_HOME is set to an invalid path, then no CUDA runtime libraries are used and the C++ application uses an alternative implementation of random numbers using the C++ standard library. This changes physics results slightly but has no impact on performance studies about the throughput of the matrix element calaculation alone.

ccache

To use ccache

  • you must have ccache in your PATH
  • you must set USECCACHE=1 to tell madgraph4gpu to use cacche
  • optionally, you must set CCACHE_DIR to your preferred ccache directory

At CERN, you may use

  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH

Build the code

Go to the appropriate P1_Sigma subdirectory for the chosen epoch and process. The build is done here.

  cd $MADGRAPH4GPU_HOME
  cd epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/

The following make variables (which can be set also via environment variables) control how the build is performed

  • AVX=[none|sse4|avx2|512y|512z]
    • AVX=none: disable C++ vectorization
    • AVX=sse4: enable C++ vectorization with SSE4.2 (128 bit registers, i.e. 2 doubles or 4 floats per vector)
    • AVX=avx2: enable C++ vectorization with AVX2 (256 bit registers, i.e. 4 doubles or 8 floats per vector)
    • AVX=512y (default): enable C++ vectorization with AVX512, limited to 256 bit ymm vector instructions (i.e. 4 doubles or 8 floats per vector)
    • AVX=512z: enable C++ vectorization with AVX512, including 512 bit zmm vector instructions (i.e. 8 doubles or 16 floats per vector)
  • FPTYPE=[d|f]
    • FPTYPE=d (default): use double precision floating-point variables (double)
    • FPTYPE=f: use single precision floating-point variables (float)
  • HELINL=[0|1]
    • HELINL=0 (default): do not use aggressive inlining
    • HELINL=1: use aggressive inlining (emulate LTO optimizations)
  • USEBUILDDIR=[0|1]
    • USEBUILDDIR=0 (default): place binaries (.o, .exe etc) in the P1_Sigma directory itself; if you attempt to recompile using different AVX, FPTYPE or HELINL settings, you will get an error
    • USEBUILDDIR=1 (recommended): place binaries (.o, .exe etc) in a subdirectory of P1_Sigma directory specific to the chosen AVX, FPTYPE or HELINL settings; you may perform several builds in parallel for different AVX, FPTYPE or HELINL settings using different build directories

For detailed performance comparisons, USEBUILDDIR=1 is recommended to allow simultaneous builds with different FPTYPE's (see PR #213). You can use make cleanall to remove all build subdirectories.

The AVX settings refer to Intel CPUs, but the code builds and runs with C++ vectorizations on AMD CPUs too (see PR #238).

Aggressive inlining had been found to give large speedups for the eemumu process (almost x4 with no vectorization, and almost x2 with the best vectorization, see issue #229), for reasons that are not yet well understood. However, for the more complex and more relevant ggttgg process, it does not seem useful, and it is currently disabled by default.

Running the standalone executable

Two standalone executables are presently built in parallel in each build:

  • the C++ executable check.exe (where the matrix element calculation is performed using vectorized C++ on the CPU)
  • the CUDA executable gcheck.exe (where the matrix element calculation is performed using CUDA on the GPU)

Both executables accept the same command line arguments, which were actually designed for CUDA, but were kept also for C++. The baseline for performance tests for the ggttgg process is performed using the following arguments:

  • For GPU/CUDA tests:
    • gcheck.exe -p 2048 256 1
  • FOR CPU/C++ tests:
    • check.exe -p 64 256 1
    • gcheck.exe -p 64 256 1 (as a cross-check in GPU/CUDA)

The first choice of parameters computes 524k matrix elements, using a GPU grid of 2048 blocks per grid and 256 threads per block, over a single iteration of a full grid. This essentially achieves the top throughput we have observed on a V100. Going to even larger grids does not gain any additional performance, except maybe for a few percent.

The second choice of parameters only computes 16k events, in a single iteration. This is enough to test the C++ application while keeping the test relatively short to only a few seconds. Using the same choice of parameters on the GPU also computes 16k events, using a GPU grid of 64 blocks per grid and 256 threads per block. This achieves a throughput that is only marginally lower on a GPU, around 20% lower than that observed with the first choice of parameters. Using the same choice of parameters on the CPU and the GPU reproduces in C++ the random number generation mechanism of the CUDA application (same random number seeds and same mapping of the random number arrays to assign them to different matrix element calculations, using the same CURAND library), yielding exactly the same physics results in the end, which is quite useful for cross-checks of the validity of the two calculations.

Note that the C++ executable includes OpenMP multithreading, but this is disabled by default (see PR $84). You may enable it by setting OMP_NUM_THREADS explicitly. However this implementation is presently found to be suboptimal and may soon be replaced by a custom MT implementation (see issue #196.

The relevant lines describing the throughput of the matrix element calculation are those including EvtsPerSec[MECalcOnly] (3a). The previous lines including EvtsPerSec[MatrixElems] (3) show lower throughputs on the GPU, because they also include data copies between the host and device memory.

Putting it all together - an example:

Build the code using USEBUILDDIR=1 in the baseline configuration based on gcc9 and cuda 11.4, using the default AVX=512y, FPTYPE=d and HELINL=0. Then run the C++ and the CUDA application.

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

  # Set up gcc10.2
  . /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/setup.sh

  # Set up cuda11.1
  export CUDA_HOME=/usr/local/cuda-11.1
  export PATH=${CUDA_HOME}/bin:${PATH}

  # Optionally set up ccache
  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
  export USECCACHE=1
  export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR

  # Use separate build directories for different floating point precision and SIMD choices
  export USEBUILDDIR=1

  # Make a clean build of the ggttgg code
  cd $MADGRAPH4GPU_HOME
  cd epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg
  make cleanall
  make AVX=512y FPTYPE=d HELINL=0

  # Run the GPU application in the optimal CUDA configuration
  ./build.512y_d_inl0/gcheck.exe -p 2048 256 1

  # Run the CPU application in the optimal C++ configuration (and cross check the results with the GPU application)
  ./build.512y_d_inl0/check.exe -p 64 256 1
  ./build.512y_d_inl0/gcheck.exe -p 64 256 1

Using the custom throughputX.sh script

For more detailed comparisons of performances in different vectorization scenarios, you may use the throughputX.sh script. This builds the code in all relevant configurations, then runs the application selecting only some relevant lines of output, and adding additional information from perf and an objdump-based script.

You only need to set the runtime environment to the compilers and tools, prior to running this script. The script internally sets USEBUILDDIR=1 and uses the appropriate AVX, FPTYPE and HELINL settings.

The script is in the top-level epochX/cudacpp/tput directory because it can be used for all physics processes(by selecting one or more or -eemumu, -ggttgg or -ggtt), and even for their auto versions (by selecting -auto or -autoonly).

To compare the five AVX scenarios for manual ggttgg, for the default FPTYPE=d and HELINL=0 settings, just type

  ./throughputX.sh -avxall -ggttgg

To compare the five AVX scenarios for manual ggttg, using both FPTYPE=d and FPTYPE=f, and using both HELINL=0 and HELINL=1, just type

  ./throughputX.sh -avxall -flt -inl

To make a clean build, just add the -makeclean option to the command.

For instance, this is a typical output:

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/setup.sh
  export CUDA_HOME=/usr/local/cuda-11.1
  export PATH=${CUDA_HOME}/bin:${PATH}
  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
  export USECCACHE=1
  export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR

  cd $MADGRAPH4GPU_HOME
  cd epochX/cudacpp
  ./tput/throughputX.sh -ggttgg -makeclean -avxall

DATE: 2021-10-31_11:03:31

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.none_d_inl0/gcheck.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 4.331189e+05                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 4.334234e+05                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     0.599367 sec
       193,835,905      cycles:u                  #    0.236 GHz                    
       286,061,820      instructions:u            #    1.48  insn per cycle         
       0.886496479 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
.........................................................................
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.none_d_inl0/gcheck.exe -p 2048 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 4.967187e+05                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 4.968849e+05                 )  sec^-1
MeanMatrixElemValue         = ( 6.665112e+00 +- 5.002651e+00 )  GeV^-4
TOTAL       :     2.713913 sec
     2,103,817,275      cycles:u                  #    0.711 GHz                    
     4,400,438,944      instructions:u            #    2.09  insn per cycle         
       3.017892050 seconds time elapsed
=========================================================================
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.none_d_inl0/check.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MatrixElems] (3) = ( 1.834610e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.834610e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     8.959483 sec
    23,916,730,877      cycles:u                  #    2.668 GHz                    
    74,429,930,429      instructions:u            #    3.11  insn per cycle         
       8.966037747 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1470) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.sse4_d_inl0/check.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
EvtsPerSec[MatrixElems] (3) = ( 3.362536e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.362536e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     4.897404 sec
    13,070,735,721      cycles:u                  #    2.667 GHz                    
    39,630,609,437      instructions:u            #    3.03  insn per cycle         
       4.904254608 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 9242) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.avx2_d_inl0/check.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
EvtsPerSec[MatrixElems] (3) = ( 6.872697e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 6.872697e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     2.405306 sec
     5,459,154,319      cycles:u                  #    2.266 GHz                    
    13,651,534,406      instructions:u            #    2.50  insn per cycle         
       2.411863815 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 7704) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.512y_d_inl0/check.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
EvtsPerSec[MatrixElems] (3) = ( 7.678023e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 7.678023e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     2.155357 sec
     4,888,594,497      cycles:u                  #    2.264 GHz                    
    12,417,337,692      instructions:u            #    2.54  insn per cycle         
       2.161830878 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 7358) (512y:   61) (512z:    0)
-------------------------------------------------------------------------
runExe /data/valassi/madgraph4gpu/epochX/cudacpp/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/build.512z_d_inl0/check.exe -p 64 256 1 OMP=
Process                     = SIGMA_SM_GG_TTXGG_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
EvtsPerSec[MatrixElems] (3) = ( 6.521298e+03                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 6.521298e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     2.535186 sec
     3,999,026,966      cycles:u                  #    1.574 GHz                    
     6,319,967,910      instructions:u            #    1.58  insn per cycle         
       2.543318054 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1836) (512y:   74) (512z: 6010)
=========================================================================

If you want to save these logfiles, you may also use the teeThroughputX.sh script. This runs the -avxall version of the performance tests, and in addition it runs some functional tests. It dumps the results to stdout and copies them to a logfile below the tput directory. For instance:

  export MADGRAPH4GPU_HOME=/data/valassi/madgraph4gpu

  . /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/setup.sh
  export CUDA_HOME=/usr/local/cuda-11.1
  export PATH=${CUDA_HOME}/bin:${PATH}
  export PATH=/cvmfs/sft.cern.ch/lcg/releases/ccache/4.3-ed8d3/x86_64-centos7-gcc8-opt/bin:$PATH
  export USECCACHE=1
  export CCACHE_DIR=$MADGRAPH4GPU_HOME/CCACHE_DIR

  cd $MADGRAPH4GPU_HOME
  cd epochX/cudacpp
  ./tput/teeThroughputX.sh -ggttgg -makeclean