Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) #575

valassi · 2022-12-16T23:48:28Z

With the changes for the random choice of helicity (#403, MR #570 and especially #415), the OMP multithreading loop has moved inside cudacpp. It is now in a place where maybe it could work better out of the box.

Note in fact that also Fortran OMP is now quite good (see #561), so I would expect something similar in cudacpp.

While doing the code move I disabled (commented out) the OMP pragmas. They should be reenabled and tested..

madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc

Line 878 in 3780502

#ifdef _OPENMP

#ifdef _OPENMP
    // (NB gcc9 or higher, or clang, is required)
    // - default(none): no variables are shared by default
    // - shared: as the name says
    // - private: give each thread its own copy, without initialising
    // - firstprivate: give each thread its own copy, and initialise with value from outside
#pragma omp parallel for default( none ) shared( allmomenta, allcouplings, allMEs, channelId, allNumerators, allDenominators )
#endif // _OPENMP
    */
    for( int ipagV2 = 0; ipagV2 < npagV2; ++ipagV2 )
    {

The text was updated successfully, but these errors were encountered:

valassi · 2022-12-16T23:53:19Z

The idea is essentially the following:

a C++ "grid" with VECSIZE_USED events is sent in parallel
this includes ideally NPROC (or NPROC times a factor) SIMD vectors, and each SIMD vector or group thereof goes to one thread
each vector within a thread uses SIMD

For instance on a 4-core machine with AVX2

in double mode, launch grids of 16 events: 4 threads process each one SIMD vector of 4 events
in float mode, launch grids of 32 events: 4 threads process each one SIMD vector of 8 events
in mixed mode, launch grids of 32 events: 4 threads process each two SIMD vectors of 4 events (Feynman) or one SIMD vector of 8 events (color matrix)

valassi · 2022-12-17T10:03:39Z

In particular this hould be tested against pmpe04 or another node with 30+ cores. See the previous suboptimal results in #196

valassi · 2022-12-17T10:17:36Z

Rather than open a new issue, I add a few ideas here.

OMP is one solution for MT in cudacpp. But custom multithreading is another possibility. What I am thinking of is the following

Custom multithreading may be more efficient and tunable, if we put constraints on the allowed numbers of threads given the cores available and the NUMA domains too.
The idea would be to have a custom policy for splitting the work across the threads, and in particular for splitting the large arrays acorss threads, possibly making the memory more easily accessible to each thread.
We need some sort of thread pool manager. One options is that we could assign work to each thread more or less where now we have OMP (the ipagV2 loop). The thing that would be really nice however is to do the explicit cpp threading instead in the same place where we launch cuda kernels, the MatrixElementKernel. Instead of launching <<<>>> cuda threads, the cpp would assign the work to the different threads. Essentially this is a more interesting - and MT - solution to the "kernel launchers and SIMD, with multithreading" ideas in kernel launchers and SIMD vectorization #71 and in all "klas" branches, it would be a klasmt branch. The point here is that the MemoryAccess infrastructure that is used for cuda could be used in the same way for C++. Just like in CUDA each thread picks up a specific data item from the large array, the same could be done in C++ (we just need a way for each thread to know its ID, just like CUDA does, and then the implementation would be very similar). The only difference between CUDAand C++is that in addition each C++ thread would not handle a single event, but rather a SIMD vector of events (or even more than one SIMD vectors of events).

valassi · 2022-12-19T10:49:50Z

I reenabled OMP MT and I did a few tests.

It works, but I still get suboptimal results. I will followup here with ggttgg on the previous results in #196 for eemumu (and I will close that ticket).

My observations

MT scaling is lower than expected with OMP MT
this is for the ggttgg new code in Dec2022, while OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) #196 saw similar things for eemumu in May 2021
there are some differences with/without SIMD
one new observation is that actually things get better as I process more and more events (which is good)

Things to do

try some systematic tests and also plots for this OMP MT
possibly add this in a container
and in a container compre this also to many copies MP
if MP is better than OMP MT, maybe try to do a custom MT...

Anyway, below are the numbers. On pmpe04 (16 physical cores with avx2, 2xHT so 32 maximum threads). There is no cuda, cso built essentially with CUDA_HOME=none. These are not systematic tests, they ar emore or less the first numbers I got...

Without SIMD, 16k events

1.85E3 with 1 thread (16k events)
6.72E3 with 4 threads (16k events) i.e. x3.6
2.16E4 with 16 threads (16k events) i.e. x11.7
1.94E4 to 2.26E4 with 32 threads (16k events) i.e. x10.6 to x12.2, note these instabilities...

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.856625e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 6.716730e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.159144e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.938153e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.257169e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.888137e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

Without SIMD, more events

1.88E3 with 1 thread (64k events), so similar to the 1.86, stable without MT
6.89E3 with 4 threads (64k events) i.e. x3.7
2.37E4 with 16 threads (64k events) i.e. x12.6
2.48E4 with 32 threads (64k events) i.e. x13.2, much better
2.52E4 with 32 threads (256k events, with larger grids or more grid cycles) i.e. x13.4, even better
2.51E4 with 32 threads (1M events) i.e. x13.3, seems to be the limit

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 64 256 1 | egre
p '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.857226e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.888137e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 6.885676e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.356782e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.474947e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 256 1 | eg
rep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.487260e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 1024 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.515134e+04                 )  sec^-1
MeanMatrixElemValue         = ( 2.475533e+02 +- 2.468621e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 64 256 16 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.536720e+04                 )  sec^-1
MeanMatrixElemValue         = ( 8.334117e+00 +- 6.373555e+00 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.none_d_inl0_hrd0/check.exe -p 256 1024 4 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.508096e+04                 )  sec^-1
MeanMatrixElemValue         = ( 6.551217e+01 +- 6.174046e+01 )  GeV^-4

With AVX2 SIMD, 16k events

7.11E3 with 1 thread (16k events) which is x3.84 over no-SIMD
2.23E4 with 4 threads (16k events) i.e. x3.2
5.59E4 to 7.54E4 with 16 threads (16k events) i.e. x7.9 to 10.6, lower than no-SIMD, and see the instabilities
8.36E4 to 8.46E4 with 32 threads (16k events) i.e. x11.8 to x11.9
[7.32E3 with 1 thread (16k events) in a later test... which is x3.89 over no-SIMD, but I use 7.11 as reference above]

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egre
p '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.108860e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.226741e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.540322e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 5.587768e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 8.366525e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 8.459457e+04                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 64 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.319292e+03                 )  sec^-1
MeanMatrixElemValue         = ( 4.197467e-01 +- 3.250467e-01 )  GeV^-4

With AVX2 SIMD, more events

7.28E3 to 7.57E3 with 1 thread (64k events) which is x4.06 over no-SIMD 1.85... use 7.57 as reference
2.76E4 with 4 threads (64k events), much better, i.e. x3.6
7.97E4 to 8.78E4 with 16 threads (64k events) i.e. x10.6 to 11.6
9.85E4 with 32 threads (64k events) i.e. x13.0
9.51E4 to 9.86E4 with 32 threads (256k events) i.e. up to x13.0
9.95E4 to 1.08E5 with 32 threads (1M events) i.e. x13.1 to x14.2
1.10E5 with 32 threads (4M events) i.e. x14.5

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.289656e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=1 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egr
ep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.566718e+03                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=4 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 2.755945e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 7.972108e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=16 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 8.783962e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 1 | eg
rep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.849974e+04                 )  sec^-1
MeanMatrixElemValue         = ( 9.878420e+02 +- 9.874419e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 4 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.510871e+04                 )  sec^-1
MeanMatrixElemValue         = ( 2.558300e+02 +- 2.469487e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 4 | eg
rep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.855484e+04                 )  sec^-1
MeanMatrixElemValue         = ( 2.558300e+02 +- 2.469487e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 16 | e
grep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.013572e+05                 )  sec^-1
MeanMatrixElemValue         = ( 6.863526e+01 +- 6.177879e+01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 256 16 | e
grep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 9.954735e+04                 )  sec^-1
MeanMatrixElemValue         = ( 6.863526e+01 +- 6.177879e+01 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 1024 1 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.019225e+05                 )  sec^-1
MeanMatrixElemValue         = ( 2.475533e+02 +- 2.468621e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 1024 1 | e
grep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.080913e+05                 )  sec^-1
MeanMatrixElemValue         = ( 2.475533e+02 +- 2.468621e+02 )  GeV^-4

[avalassi@pmpe04 gcc11.2/cvmfs] /data/avalassi/gpu2021/madgraph4gpu/epochX/cudacpp/gg_ttgg.mad/SubProcesses/P1_gg_ttxgg> OMP_NUM_THREADS=32 ./build.avx2_d_inl0_hrd0/check.exe -p 256 1024 16 | egrep '(OMP th|EvtsPerSec\[MECalcOnly|MeanMatrixElemValue)'
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MECalcOnly] (3a) = ( 1.103730e+05                 )  sec^-1
MeanMatrixElemValue         = ( 2.476884e+06 +- 2.476607e+06 )  GeV^-4

Note also that 'top' shows a varying load on the system. in some of the fastest tests it was 100% (3200 load) at points but then falling temporarely to 70%. In other tests it was showing 92% constant... So in summary,

OMP MT here is already not bad, but more studies and tests are needed
the more events are processed, the more the benfits of OMP MT

Aagain, all this should be compared to several independent processes single-threaded (and or eventually to home-made MT)

valassi · 2022-12-19T10:53:26Z

I will create and merge a MR

NB One thing that I have not done is to reenable OMP tests in tmad/tput scripts.
You need very large number of events and long tests to get meaningful results

Maybe something for @Jooorgen to test in your infrastructure?

I made a few tests manually, see logs in madgraph5#575 NB One thing that I have not done is to reenable OMP tests in tmad/tput scripts. You need very large number of events and long tests to get meaningful results

…6 sa I made a few tests manually, see logs in madgraph5#575 NB One thing that I have not done is to reenable OMP tests in tmad/tput scripts. You need very large number of events and long tests to get meaningful results

…adgraph5#568

…nabling of OMP MT madgraph5#575

valassi · 2022-12-19T13:47:39Z

I have reenabled this in gcc, but failed in icpx and clang, see #578

Anyway this one stays open for more performance studies

…adgraph5#575 (but clang fails build)

…enabling on gcc madgraph5#575 (not yet on icpx clang madgraph5#578)

…graph5#575 (not yet on icpx clang madgraph5#578)

…#575

valassi changed the title ~~Retry OMP multithreading in cudacpp~~ Retry OMP multithreading in cudacpp (and prototype custom multithreading) Dec 17, 2022

valassi mentioned this issue Dec 19, 2022

OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) #196

Closed

valassi changed the title ~~Retry OMP multithreading in cudacpp (and prototype custom multithreading)~~ Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) Dec 19, 2022

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] in gg_tt.mad reenable OpenMP MT in cudacpp madgraph5#575

cacd48d

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] in CODEGEN backport reenabling of OMP MT madgraph5#575

9172091

valassi mentioned this issue Dec 19, 2022

Reenable OPENMP multithreading in cudacpp #577

Merged

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] in ggtt.sa fix OMP madgraph5#575 when MULTICHANNEL is disabled m…

3b886c3

…adgraph5#568

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] ** COMPLETE OMP ** regenerate all 6 sa and 5 mad - complete ree…

d263e5d

…nabling of OMP MT madgraph5#575

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] regenerate all 6 sa and 5 mad - complete reenabling of OMP MT m…

80b53a1

…adgraph5#575 (but clang fails build)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] ** COMPLETE OMP ** regenerate 5 mad and 6 sa - completed OMP re…

f9fed4f

…enabling on gcc madgraph5#575 (not yet on icpx clang madgraph5#578)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 19, 2022

[omp] regenerate 5 mad and 6 sa - completed OMP reenabling on gcc mad…

e7a3a0e

…graph5#575 (not yet on icpx clang madgraph5#578)

valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023

[omp] in CODEGEN backport reenabling of OMP MT madgraph5/madgraph4gpu…

2062676

…#575

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) #575

Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) #575

valassi commented Dec 16, 2022

valassi commented Dec 16, 2022

valassi commented Dec 17, 2022

valassi commented Dec 17, 2022

valassi commented Dec 19, 2022

valassi commented Dec 19, 2022

valassi commented Dec 19, 2022

Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) #575

Retry OMP multithreading in cudacpp (and prototype custom multithreading, and compare to MP) - suboptimal results in ggttgg (Dec 2022) #575

Comments

valassi commented Dec 16, 2022

valassi commented Dec 16, 2022

valassi commented Dec 17, 2022

valassi commented Dec 17, 2022

valassi commented Dec 19, 2022

valassi commented Dec 19, 2022

valassi commented Dec 19, 2022