Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) #196

Closed
valassi opened this issue May 3, 2021 · 3 comments
Assignees
Labels
performance How fast is it? Make it go faster!

Comments

@valassi
Copy link
Member

valassi commented May 3, 2021

So far I have been mainly checking OMP performances on my GPU-aware VM with 4 virtual cores. The throughput always seemed to scale more or less as expected, almost a factor 4.

See 6f4916e, from itscrd70:

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.248269 sec
real    0m7.258s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.630541 sec
real    0m3.640s
-------------------------------------------------------------------------

and

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.764206 sec
real    0m3.774s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.645810 sec
real    0m2.655s
-------------------------------------------------------------------------

In other words, going to 4 threads is increasing:

  • with no SIMD, from 1.29E6 to 5.11E6 (x3.96)
  • with SIMD 512y, from 4.75E6 to 1.83E7 (x2.59... already less, actually)

So, actually already above one issue can be seen, namely OMP scaling is suboptimal when SIMD is also enabled. This should be cross checked, but I would not be surprised if it is also some processor slowdown (check the clock...).

HOWEVER. The situation looks quite different on a machine with many more nodes. I did the following tests on pmpe04 (16 physical cores with 2x hyper threading) and @lfield also did some tests on a machine with 64 logical cores, getting similar results.

FIRST issue: this is very unstable: this is with no SIMD, you can see that the result with 32 threads fluctuates wildly between 3.0E6 and 1.3E7 (more than a factor 4 fluctuation!). Something to do with NUMA?...

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.485243e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.422320e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.223990e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.133416e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.900442e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.266690e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.786840e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.195912e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.220566e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.128356e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.927956e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.979217e+06                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 256 256 1 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.948936e+05                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.202562e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.184464e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.095331e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.776658e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.585580e+06                 )  sec^-1

SECOND issue: the actual scaling is suboptimal. In the example above, the best scaling at 16 threads (the number of physical cores) is 7.9E6 divided by 8.9E5, which is less than a factor 10, while I would expect a solid factor 16 here...

The numbers get better with a larger number of events, even without OMP (why?!...), but they are still suboptimal and fluctuating, on pmpe04

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.310010e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.703815e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.086431e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.266153e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 6.276633e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 8.829229e+06                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.305356e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.521695e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.100592e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.261600e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 9.154927e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.007438e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.293470e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.466667e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 3.104600e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 5.308291e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 9.185191e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.001564e+07                 )  sec^-1

Compare to the same on itscrd70:

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.none/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.317429e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.621517e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.183589e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.730554e+06                 )  sec^-1
OMP threads / `nproc --all` = 16 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.070596e+06                 )  sec^-1
OMP threads / `nproc --all` = 32 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.027287e+06                 )  sec^-1

This is even more obvious with SIMD, 32 threads gain a maximum factor 6?...

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.052396e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.547816e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.733957e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.381696e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.946232e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.525760e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.034425e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.668128e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.894043e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.383712e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.351971e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.520908e+07                 )  sec^-1

for n in 1 2 4 8 16 32; do export OMP_NUM_THREADS=$n; ./build.avx2/check.exe -p 2048 256 12 | egrep '(OMP threads|EvtsPerSec\[MatrixElems)'; done
OMP threads / `nproc --all` = 1 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.035978e+06                 )  sec^-1
OMP threads / `nproc --all` = 2 / 32
EvtsPerSec[MatrixElems] (3) = ( 4.721879e+06                 )  sec^-1
OMP threads / `nproc --all` = 4 / 32
EvtsPerSec[MatrixElems] (3) = ( 7.800313e+06                 )  sec^-1
OMP threads / `nproc --all` = 8 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.382836e+07                 )  sec^-1
OMP threads / `nproc --all` = 16 / 32
EvtsPerSec[MatrixElems] (3) = ( 1.917745e+07                 )  sec^-1
OMP threads / `nproc --all` = 32 / 32
EvtsPerSec[MatrixElems] (3) = ( 2.157618e+07                 )  sec^-1

TODO:

@valassi valassi added the performance How fast is it? Make it go faster! label May 3, 2021
@valassi
Copy link
Member Author

valassi commented May 4, 2021

I made a few quick plots because I want them for the talk and because I reused the infrastructure I already had for benchmarking/HPC from last year. Eventually this stuff could be cleaned up and integraed by @Andykyu12 in his infrastructure.

https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-nosimd.png
https://github.com/madgraph5/madgraph4gpu/blob/51d7f52bf34d7ce35533cdb9a1bc67daa6ef4ee7/tools/benchmarking/plots/pmpe-simd.png

These are VERY PRELIMINARY - I am somewhatconfident that the scaling with multiple copies is correct and not just an artefact of how these short scripts are run, but this should be thoroughly cross checked...

Indeed/instead the OMP multithreading is suboptimal. I had a quick look at numactl but it does not seem to improve things. I would maybe reconsider an alternative to OMP, this is such a basic one loop that we can do it by hand with custom threads. Maybe we also need to split up the memory ourselves in advance across threads.

Anyway, the good news are:

  • with multiple copies, also on pmpe you easiy reach 16x the single thread version... and even a bit beyond 16x, going to 2xHT and in overcommit
  • the performance gain from multiple copies remains also in SIMD modes
  • especially, as expected, the SIMD gain and the multicore gains are independent and they sum up (rather, they multiply): we get almost 4x from AVX2 and almost 16x from the physical cores on this machine, a total of almost 64x with respect to single core no-simd

@valassi
Copy link
Member Author

valassi commented May 4, 2021

PS the memory plots are just a sanity check here, not the main point of these plots - not clear where each applicaton spends most of its memory, but it may well be in the allocation of the large pages

PPS forgot to mention on the plots, these are all 'check.exe -p 2048 256 40'... around 20-30 seconds for each test

@valassi
Copy link
Member Author

valassi commented Dec 19, 2022

I am closing this old ticket for eemumu OMP mult threading with an old software version.

I now have a newer software version and reenabled OMP MT there. I still gete suboptimal results also for ggttgg, but the results get better as I process more and more events. Anyway, I will follow up in this ticket: #575

Closing.

@valassi valassi closed this as completed Dec 19, 2022
@valassi valassi self-assigned this Dec 19, 2022
@valassi valassi changed the title OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 OMP multithreading gives very unstable and suboptimal throughputs on pmpe04 (eemumu, May 2021 software) Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance How fast is it? Make it go faster!
Projects
None yet
Development

No branches or pull requests

1 participant