Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

Merged
merged 365 commits into from
Apr 23, 2021

Conversation

valassi
Copy link
Member

@valassi valassi commented Apr 23, 2021

This is a (identical!) replacement of #152. For some reason that I do not understand, that #152 now says there are conflicts. So I just recreate the PR from an identical branch.

As mentioned for #152, this originally merged together klas2 #132 (replacing klas #72) and epoch12 #151.
Thus it was replacing #72 and #132. Now it is also replacing #152.

I am finally about to merge. I will document the changes later, here and in the original issue #71.

…outs.

No impact on CPU or GPU performance.

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid            = 16384
NumThreadsPerBlock          = 32
NumIterations               = 1
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[8]
Internal loops fptype_sv    = VECTOR[8] (AVX512F)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
-----------------------------------------------------------------------
NumberOfEntries             = 1
TotalTime[Rnd+Rmb+ME] (123) = ( 3.124263e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 2.845789e-01                 )  sec
TotalTime[RndNumGen]    (1) = ( 2.784739e-02                 )  sec
TotalTime[Rambo]        (2) = ( 9.678157e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.877973e-01                 )  sec
MeanTimeInMatrixElems       = ( 1.877973e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.877973e-01 ,  1.877973e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 524288
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.678117e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.842329e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 2.791776e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 524288
MeanMatrixElemValue         = ( 1.372469e-02 +- 1.132952e-05 )  GeV^0

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid            = 16384
NumThreadsPerBlock          = 32
NumIterations               = 1
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
-----------------------------------------------------------------------
NumberOfEntries             = 1
TotalTime[Rnd+Rmb+ME] (123) = ( 1.142844e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.114948e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 2.789544e-02                 )  sec
TotalTime[Rambo]        (2) = ( 9.857677e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.016372e+00                 )  sec
MeanTimeInMatrixElems       = ( 1.016372e+00                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.016372e+00 ,  1.016372e+00 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 524288
EvtsPerSec[Rnd+Rmb+ME](123) = ( 4.587573e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 4.702352e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 5.158428e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 524288
MeanMatrixElemValue         = ( 1.372469e-02 +- 1.132952e-05 )  GeV^0

./check.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid            = 16384
NumThreadsPerBlock          = 32
NumIterations               = 1
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Internal loops fptype_sv    = VECTOR[4] (AVX2)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
-----------------------------------------------------------------------
NumberOfEntries             = 1
TotalTime[Rnd+Rmb+ME] (123) = ( 3.013752e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 2.735730e-01                 )  sec
TotalTime[RndNumGen]    (1) = ( 2.780214e-02                 )  sec
TotalTime[Rambo]        (2) = ( 9.698159e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.765914e-01                 )  sec
MeanTimeInMatrixElems       = ( 1.765914e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.765914e-01 ,  1.765914e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 524288
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.739652e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.916446e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 2.968932e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 524288
MeanMatrixElemValue         = ( 1.372469e-02 +- 1.132952e-05 )  GeV^0

./gcheck.exe -p 16384 32 1
***********************************************************************
NumBlocksPerGrid            = 16384
NumThreadsPerBlock          = 32
NumIterations               = 1
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Wavefunction GPU memory     = LOCAL
Random number generation    = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries             = 1
TotalTime[Rnd+Rmb+ME] (123) = ( 9.136356e-03                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.904635e-03                 )  sec
TotalTime[RndNumGen]    (1) = ( 1.231721e-03                 )  sec
TotalTime[Rambo]        (2) = ( 7.034133e-03                 )  sec
TotalTime[MatrixElems]  (3) = ( 8.705020e-04                 )  sec
MeanTimeInMatrixElems       = ( 8.705020e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 8.705020e-04 ,  8.705020e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 524288
EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.738480e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.632665e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.022824e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 524288
MeanMatrixElemValue         = ( 1.372469e-02 +- 1.132952e-05 )  GeV^0
Fix conflicts in Makefiles and CPPProcess.cc (the #pragma omp loop is very different)
…ally.

time ./build.avx2/gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.218126e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.142967e-01                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.515897e-03                 )  sec
TotalTime[Rambo]        (2) = ( 1.026943e-01                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.160241e-02                 )  sec
MeanTimeInMatrixElems       = ( 9.668675e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 9.625210e-04 ,  9.738580e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.164867e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 5.504497e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 5.422542e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     1.173272 sec
0a ProcInit :     0.000576 sec
0b MemAlloc :     0.037912 sec
0c GenCreat :     0.011352 sec
0d SGoodHel :     0.002065 sec
1a GenSeed  :     0.000019 sec
1b GenRnGen :     0.007497 sec
2a RamboIni :     0.000092 sec
2b RamboFin :     0.000047 sec
2c CpDTHwgt :     0.008357 sec
2d CpDTHmom :     0.094198 sec
3a SigmaKin :     0.000094 sec
3b CpDTHmes :     0.011508 sec
4a DumpLoop :     0.084650 sec
8a CompStat :     0.044464 sec
9a GenDestr :     0.000067 sec
9b DumpScrn :     0.000229 sec
9c DumpJson :     0.000002 sec
TOTAL       :     1.476400 sec
TOTAL (123) :     0.121813 sec
TOTAL  (23) :     0.114297 sec
TOTAL   (1) :     0.007516 sec
TOTAL   (2) :     0.102694 sec
TOTAL   (3) :     0.011602 sec
***********************************************************************
real    0m1.797s
user    0m0.482s
sys     0m0.801s

time ./build.avx2/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Internal loops fptype_sv    = VECTOR[4] (AVX2)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 4.156547e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 3.828234e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.283136e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.941495e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.886739e+00                 )  sec
MeanTimeInMatrixElems       = ( 1.572282e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.569412e-01 ,  1.577457e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.513625e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.643436e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 3.334566e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000370 sec
0b MemAlloc :     0.074970 sec
0c GenCreat :     0.000950 sec
0d SGoodHel :     0.000099 sec
1a GenSeed  :     0.000027 sec
1b GenRnGen :     0.328287 sec
2a RamboIni :     0.122387 sec
2b RamboFin :     1.819108 sec
3a SigmaKin :     1.886739 sec
4a DumpLoop :     0.080112 sec
8a CompStat :     0.035178 sec
9a GenDestr :     0.000103 sec
9b DumpScrn :     0.012321 sec
9c DumpJson :     0.000002 sec
TOTAL       :     4.360653 sec
TOTAL (123) :     4.156547 sec
TOTAL  (23) :     3.828234 sec
TOTAL   (1) :     0.328314 sec
TOTAL   (2) :     1.941495 sec
TOTAL   (3) :     1.886739 sec
***********************************************************************
real    0m4.390s
user    0m4.245s
sys     0m0.143s
Fix merge conflicts:
-	epoch1/cuda/ee_mumu/SubProcesses/Makefile
-	epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc
-	epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/runTest.cc
-	epoch1/cuda/ee_mumu/src/Makefile

This is the first commit/merge on the klas vectorization branch since Dec 2020.
Note that master/upstream before klas had 1.15E6 MEs/s in C++ and 6.3E8 in CUDA.
After this merge, throughput is 4.38E6 in C++ (almost 4x) and 6.1E8 in CUDA (slightly lower?)

Note also that, before merging fast math from upstream master,
klas had 3.3E6 in C++ and 5.4E8 in CUDA. Fast math did improve performance.

Note also that the CUDA test builds but does not seem to do anything?

time ./build.avx2/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Internal loops fptype_sv    = VECTOR[4] (AVX2)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 3.682679e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 3.354335e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.283440e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.918015e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.436320e+00                 )  sec
MeanTimeInMatrixElems       = ( 1.196933e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.193237e-01 ,  1.204967e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.708391e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.875619e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 4.380262e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000369 sec
0b MemAlloc :     0.073457 sec
0c GenCreat :     0.000980 sec
0d SGoodHel :     0.000097 sec
1a GenSeed  :     0.000028 sec
1b GenRnGen :     0.328316 sec
2a RamboIni :     0.111770 sec
2b RamboFin :     1.806245 sec
3a SigmaKin :     1.436320 sec
4a DumpLoop :     0.076820 sec
8a CompStat :     0.024451 sec
9a GenDestr :     0.000102 sec
9b DumpScrn :     0.011718 sec
9c DumpJson :     0.000015 sec
TOTAL       :     3.870689 sec
TOTAL (123) :     3.682679 sec
TOTAL  (23) :     3.354335 sec
TOTAL   (1) :     0.328344 sec
TOTAL   (2) :     1.918015 sec
TOTAL   (3) :     1.436320 sec
***********************************************************************
real    0m3.897s
user    0m3.760s
sys     0m0.134s

time ./build.avx2/gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.064685e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 9.877649e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.691984e-03                 )  sec
TotalTime[Rambo]        (2) = ( 8.849376e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.028273e-02                 )  sec
MeanTimeInMatrixElems       = ( 8.568944e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 8.510760e-04 ,  8.680110e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.909220e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.369386e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.118467e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.997294 sec
0a ProcInit :     0.000502 sec
0b MemAlloc :     0.035057 sec
0c GenCreat :     0.011721 sec
0d SGoodHel :     0.002068 sec
1a GenSeed  :     0.000019 sec
1b GenRnGen :     0.007673 sec
2a RamboIni :     0.000097 sec
2b RamboFin :     0.000049 sec
2c CpDTHwgt :     0.007027 sec
2d CpDTHmom :     0.081322 sec
3a SigmaKin :     0.000081 sec
3b CpDTHmes :     0.010202 sec
4a DumpLoop :     0.083121 sec
8a CompStat :     0.044625 sec
9a GenDestr :     0.000067 sec
9b DumpScrn :     0.000257 sec
9c DumpJson :     0.000002 sec
TOTAL       :     1.281183 sec
TOTAL (123) :     0.106468 sec
TOTAL  (23) :     0.098776 sec
TOTAL   (1) :     0.007692 sec
TOTAL   (2) :     0.088494 sec
TOTAL   (3) :     0.010283 sec
***********************************************************************
real    0m1.586s
user    0m0.507s
sys     0m0.875s
Checked that ./build.avx2/gcheck.exe -v 1 8 1 and ./build.avx2/check.exe -v 1 8 1
now return the same printuouts of momenta as expected.
This is what I did:
- Changed "constexpr bool dumpEvents = false;" to "true"
- Rebuilt, launched runTest.exe
- Copied cp dump_eemumu_0.txt ../../../../../test/eemumu/dump_CPUTest.eemumu.txt"

This may break epoch2? In case, fix it by hardcoding neppR=8 there too
(they must be the same because the same reference file is used for tests)
…ang++)

Fix cherry-pick conflict: epoch1/cuda/ee_mumu/SubProcesses/Makefile
Performance is 1.25E6, slightly better than gcc9 1.15E6 but lower than Fortran 1.50E6

time ./build.none/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = clang 10.0.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.234199e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 6.911213e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.229851e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.849719e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 5.061495e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.217912e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.214358e-01 ,  4.223094e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.696825e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 9.103258e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.243004e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000383 sec
0b MemAlloc :     0.070821 sec
0c GenCreat :     0.000904 sec
0d SGoodHel :     0.000438 sec
1a GenSeed  :     0.000030 sec
1b GenRnGen :     0.322956 sec
2a RamboIni :     0.081141 sec
2b RamboFin :     1.768578 sec
3a SigmaKin :     5.061495 sec
4a DumpLoop :     0.074358 sec
8a CompStat :     0.084354 sec
9a GenDestr :     0.000020 sec
9b DumpScrn :     0.009514 sec
9c DumpJson :     0.000002 sec
TOTAL       :     7.474991 sec
TOTAL (123) :     7.234199 sec
TOTAL  (23) :     6.911214 sec
TOTAL   (1) :     0.322985 sec
TOTAL   (2) :     1.849719 sec
TOTAL   (3) :     5.061495 sec
***********************************************************************
real    0m7.499s
user    0m7.376s
sys     0m0.121s

time ./build.none/gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 9.123791e-02                 )  sec
TotalTime[Rambo+ME]    (23) = ( 8.373227e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.505641e-03                 )  sec
TotalTime[Rambo]        (2) = ( 7.402575e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 9.706521e-03                 )  sec
MeanTimeInMatrixElems       = ( 8.088767e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 8.009510e-04 ,  8.176020e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.895660e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 7.513777e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.481680e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.802752 sec
0a ProcInit :     0.000472 sec
0b MemAlloc :     0.032316 sec
0c GenCreat :     0.009958 sec
0d SGoodHel :     0.002051 sec
1a GenSeed  :     0.000017 sec
1b GenRnGen :     0.007489 sec
2a RamboIni :     0.000106 sec
2b RamboFin :     0.000051 sec
2c CpDTHwgt :     0.006522 sec
2d CpDTHmom :     0.067347 sec
3a SigmaKin :     0.000081 sec
3b CpDTHmes :     0.009625 sec
4a DumpLoop :     0.079669 sec
8a CompStat :     0.046016 sec
9a GenDestr :     0.000063 sec
9b DumpScrn :     0.000268 sec
9c DumpJson :     0.000002 sec
TOTAL       :     1.064805 sec
TOTAL (123) :     0.091238 sec
TOTAL  (23) :     0.083732 sec
TOTAL   (1) :     0.007506 sec
TOTAL   (2) :     0.074026 sec
TOTAL   (3) :     0.009707 sec
***********************************************************************
real    0m1.365s
user    0m0.447s
sys     0m0.478s
…hout SIMD!

Now AVX=none with gcc9 is 1.28E6, it was 1.15E6 (remember fortran is 1.50E6).
It means that with AVX=none gcc9 and clang10 are completely comparable.

Note however that the speedup between AVX=none and AVX=avx2 is lower than 4:
4.40E6 / 1.28E6 is only 3.4, we can do better...

time ./build.none/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
Internal loops fptype_sv    = VECTOR[1] == SCALAR (no SIMD)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.160223e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 6.836318e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.239050e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.939587e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 4.896731e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.080609e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.074413e-01 ,  4.092229e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.786676e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 9.202989e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.284828e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000369 sec
0b MemAlloc :     0.070329 sec
0c GenCreat :     0.000909 sec
0d SGoodHel :     0.000105 sec
1a GenSeed  :     0.000026 sec
1b GenRnGen :     0.323879 sec
2a RamboIni :     0.077785 sec
2b RamboFin :     1.861802 sec
3a SigmaKin :     4.896730 sec
4a DumpLoop :     0.073871 sec
8a CompStat :     0.025105 sec
9a GenDestr :     0.000082 sec
9b DumpScrn :     0.008952 sec
9c DumpJson :     0.000006 sec
TOTAL       :     7.339950 sec
TOTAL (123) :     7.160223 sec
TOTAL  (23) :     6.836318 sec
TOTAL   (1) :     0.323905 sec
TOTAL   (2) :     1.939587 sec
TOTAL   (3) :     4.896730 sec
***********************************************************************
real    0m7.362s
user    0m7.236s
sys     0m0.123s

time ./build.avx2/check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
Internal loops fptype_sv    = VECTOR[4] (AVX2)
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 3.598255e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 3.275359e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.228953e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.845746e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.429614e+00                 )  sec
MeanTimeInMatrixElems       = ( 1.191345e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.187156e-01 ,  1.201074e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.748474e+06                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.920844e+06                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 4.400809e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202858e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000379 sec
0b MemAlloc :     0.070129 sec
0c GenCreat :     0.000908 sec
0d SGoodHel :     0.000100 sec
1a GenSeed  :     0.000025 sec
1b GenRnGen :     0.322871 sec
2a RamboIni :     0.110108 sec
2b RamboFin :     1.735638 sec
3a SigmaKin :     1.429614 sec
4a DumpLoop :     0.075421 sec
8a CompStat :     0.024105 sec
9a GenDestr :     0.000091 sec
9b DumpScrn :     0.008895 sec
9c DumpJson :     0.000002 sec
TOTAL       :     3.778286 sec
TOTAL (123) :     3.598255 sec
TOTAL  (23) :     3.275360 sec
TOTAL   (1) :     0.322895 sec
TOTAL   (2) :     1.845746 sec
TOTAL   (3) :     1.429614 sec
***********************************************************************
real    0m3.799s
user    0m3.677s
sys     0m0.120s
Fix conflicts in epoch1/cuda/ee_mumu/src/Makefile
valassi added 14 commits April 22, 2021 21:09
Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/testxxx.cc

The test for sxxxxx is fixed but other testxxx still fail with no cxtype_ref
[NB: the test does succeed in gcc after this merge for 512y with cxtype_ref]
…_ref

All tests pass for gcc/double

Keep this enabled for the moment for gcc too - eventually use this as default?!

Performance is the following
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.306067e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.003754 sec
real    0m7.011s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.188863e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.148541 sec
real    0m1.448s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.489082e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.705370 sec
real    0m4.713s
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.437350e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.526841 sec
real    0m3.534s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2640) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.752122e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.504417 sec
real    0m3.512s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2525) (512y:   37) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.723845e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.822494 sec
real    0m3.830s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  997) (512y:   84) (512z: 2135)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.145092e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.682285 sec
real    0m7.690s
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.328923e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.092988 sec
real    0m1.386s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
…xing the issues

-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.207457e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.152017 sec
real    0m7.162s
=Symbols in CPPProcess.o= (~sse4:  577) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.458041e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     1.061856 sec
real    0m1.349s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 48
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.454632e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270375e-06 )  GeV^0
TOTAL       :     3.386881 sec
real    0m3.396s
=Symbols in CPPProcess.o= (~sse4: 3926) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.986212e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.690546 sec
real    0m2.700s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3004) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 8.563670e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.649917 sec
real    0m2.660s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2889) (512y:   19) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.139157e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.786044 sec
real    0m2.795s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1540) (512y:   53) (512z: 2241)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.075523e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.819291 sec
real    0m7.829s
=Symbols in CPPProcess.o= (~sse4:  542) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.526834e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.991750 sec
real    0m1.282s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 72
-------------------------------------------------------------------------
Note that the physics results are the same in all AVXs of clang,
but they differ from thos on gcc...?

Baseline double/clang:
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MatrixElems] (3) = ( 1.263836e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.165586 sec
real    0m7.172s
=Symbols in CPPProcess.o= (~sse4: 1241) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 2.615041e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     4.634355 sec
real    0m4.642s
=Symbols in CPPProcess.o= (~sse4: 3601) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 5.206133e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.425824 sec
real    0m3.433s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3004) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 5.140326e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.435411 sec
real    0m3.443s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2727) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 3.714051e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.919521 sec
real    0m3.927s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3524) (512y:    0) (512z: 1193)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [clang 11.0.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 1.216910e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.425224 sec
real    0m7.432s
=Symbols in CPPProcess.o= (~sse4: 1165) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Baseline float performance
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.210226e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.137704 sec
real    0m7.147s
=Symbols in CPPProcess.o= (~sse4:  577) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.451026e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.807972 sec
real    0m1.108s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 48
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.444181e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270375e-06 )  GeV^0
TOTAL       :     3.375671 sec
real    0m3.386s
=Symbols in CPPProcess.o= (~sse4: 3736) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.949196e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.702257 sec
real    0m2.712s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3147) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 8.533158e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.654457 sec
real    0m2.664s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2987) (512y:   81) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 7.227639e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.776870 sec
real    0m2.786s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1738) (512y:  179) (512z: 2150)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.075260e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.817157 sec
real    0m7.826s
=Symbols in CPPProcess.o= (~sse4:  542) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.507709e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.669016 sec
real    0m0.967s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 72
-------------------------------------------------------------------------
…aph4gpu into klas2ep12

(A merge is necessary as I have two directories in paralle for gcc9 and clang11)

Test clang float performance
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
EvtsPerSec[MatrixElems] (3) = ( 1.289271e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268978e-06 )  GeV^0
TOTAL       :     6.978230 sec
real    0m6.985s
=Symbols in CPPProcess.o= (~sse4: 1625) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 5.087213e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268977e-06 )  GeV^0
TOTAL       :     3.423323 sec
real    0m3.430s
=Symbols in CPPProcess.o= (~sse4: 4258) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 1.048762e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     2.734676 sec
real    0m2.742s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3727) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 1.051075e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     2.734637 sec
real    0m2.741s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3204) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO]
EvtsPerSec[MatrixElems] (3) = ( 7.302535e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371786e-02 +- 3.269407e-06 )  GeV^0
TOTAL       :     2.984331 sec
real    0m2.991s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3928) (512y:    0) (512z: 1872)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [clang 11.0.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.239338e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268978e-06 )  GeV^0
TOTAL       :     7.287751 sec
real    0m7.295s
=Symbols in CPPProcess.o= (~sse4: 1544) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
This should be the last performance test before the merge

** NB: there seems to be a numerical difference in average MEs
between cxtype_ref and nocxtyperef implementations (even for none???).
This should be investigated, then we can move to the no-cxtype-ref

*** Final baseline perfomance before the merge ***
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.305527e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.191895 sec
real    0m7.202s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.118856e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.908404 sec
real    0m1.201s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.531723e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.845304 sec
real    0m4.855s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.431103e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.717100 sec
real    0m3.727s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2780) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.757142e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.631310 sec
real    0m3.641s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.698749e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.014051 sec
real    0m4.024s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1205) (512y:  209) (512z: 2044)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.143650e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.865311 sec
real    0m7.875s
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
EvtsPerSec[MatrixElems] (3) = ( 7.063732e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.103881 sec
real    0m1.404s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
…aph4gpu into klas2ep12

(A merge is necessary as I have two directories in paralle for gcc9 and clang11)
…och1/2)

This is one of the last changes before merging the vectorization PR

*** FINAL OMP/AVXALL DOUBLE GCC PERFORMANCE BEFORE MERGING ***
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.248269 sec
real    0m7.258s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.630541 sec
real    0m3.640s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.766821e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.919429 sec
real    0m1.231s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.535637e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.831976 sec
real    0m4.842s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 9.907296e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.987761 sec
real    0m2.997s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.431989e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.712352 sec
real    0m3.722s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2780) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.712174e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.662205 sec
real    0m2.672s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.764206 sec
real    0m3.774s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.645810 sec
real    0m2.655s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.707833e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.012550 sec
real    0m4.023s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1205) (512y:  209) (512z: 2044)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.388584e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.768061 sec
real    0m2.778s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.148017e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.843966 sec
real    0m7.854s
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.548427e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.746793 sec
real    0m3.756s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.870215e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.920510 sec
real    0m1.227s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------
@valassi
Copy link
Member Author

valassi commented Apr 23, 2021

@roiser @oliviermattelaer @hageboeck I am finally self merging the vectorization PR - hope it all goes well!

I will document what I have done here and on issue #71.

Still a few things I'd like to do, but at least this should allow Stefan to branch off for cuda graphs.

NB: up until this PS, the code was essentially identical (for eemumu) in epoch1 and epoch2. As of now, epoch1 is the most advanced branch, including vectorization. Epoch2 remains as an older control branch.

My present baseline performance (in double, in gcc, and in my old pre-clang vector implementation that I still use as default) is as follows, copied from the last commit 6f4916e

(For completeness I am giving also the OMP=4 numbers, but generally they are just 4x times the others... so you can focus on OMP=1)

*** FINAL OMP/AVXALL DOUBLE GCC PERFORMANCE BEFORE MERGING ***
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.248269 sec
real    0m7.258s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.630541 sec
real    0m3.640s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.766821e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.919429 sec
real    0m1.231s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.535637e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.831976 sec
real    0m4.842s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 9.907296e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.987761 sec
real    0m2.997s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.431989e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.712352 sec
real    0m3.722s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2780) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.712174e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.662205 sec
real    0m2.672s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.764206 sec
real    0m3.774s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.645810 sec
real    0m2.655s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.707833e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.012550 sec
real    0m4.023s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1205) (512y:  209) (512z: 2044)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.388584e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.768061 sec
real    0m2.778s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.148017e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.843966 sec
real    0m7.854s
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.548427e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.746793 sec
real    0m3.756s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.870215e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.920510 sec
real    0m1.227s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

@valassi
Copy link
Member Author

valassi commented Apr 23, 2021

Pushing the merge button.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop performance How fast is it? Make it go faster!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant