vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

valassi · 2021-04-23T15:37:33Z

This is a (identical!) replacement of #152. For some reason that I do not understand, that #152 now says there are conflicts. So I just recreate the PR from an identical branch.

As mentioned for #152, this originally merged together klas2 #132 (replacing klas #72) and epoch12 #151.
Thus it was replacing #72 and #132. Now it is also replacing #152.

I am finally about to merge. I will document the changes later, here and in the original issue #71.

…outs. No impact on CPU or GPU performance. ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[8] Internal loops fptype_sv = VECTOR[8] (AVX512F) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123) = ( 3.124263e-01 ) sec TotalTime[Rambo+ME] (23) = ( 2.845789e-01 ) sec TotalTime[RndNumGen] (1) = ( 2.784739e-02 ) sec TotalTime[Rambo] (2) = ( 9.678157e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.877973e-01 ) sec MeanTimeInMatrixElems = ( 1.877973e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.877973e-01 , 1.877973e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.678117e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.842329e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 2.791776e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372469e-02 +- 1.132952e-05 ) GeV^0 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS Internal loops fptype_sv = VECTOR[1] == SCALAR (no SIMD) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123) = ( 1.142844e+00 ) sec TotalTime[Rambo+ME] (23) = ( 1.114948e+00 ) sec TotalTime[RndNumGen] (1) = ( 2.789544e-02 ) sec TotalTime[Rambo] (2) = ( 9.857677e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.016372e+00 ) sec MeanTimeInMatrixElems = ( 1.016372e+00 ) sec [Min,Max]TimeInMatrixElems = [ 1.016372e+00 , 1.016372e+00 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123) = ( 4.587573e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 4.702352e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 5.158428e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372469e-02 +- 1.132952e-05 ) GeV^0 ./check.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123) = ( 3.013752e-01 ) sec TotalTime[Rambo+ME] (23) = ( 2.735730e-01 ) sec TotalTime[RndNumGen] (1) = ( 2.780214e-02 ) sec TotalTime[Rambo] (2) = ( 9.698159e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.765914e-01 ) sec MeanTimeInMatrixElems = ( 1.765914e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.765914e-01 , 1.765914e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.739652e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.916446e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 2.968932e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372469e-02 +- 1.132952e-05 ) GeV^0 ./gcheck.exe -p 16384 32 1 *********************************************************************** NumBlocksPerGrid = 16384 NumThreadsPerBlock = 32 NumIterations = 1 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 1 TotalTime[Rnd+Rmb+ME] (123) = ( 9.136356e-03 ) sec TotalTime[Rambo+ME] (23) = ( 7.904635e-03 ) sec TotalTime[RndNumGen] (1) = ( 1.231721e-03 ) sec TotalTime[Rambo] (2) = ( 7.034133e-03 ) sec TotalTime[MatrixElems] (3) = ( 8.705020e-04 ) sec MeanTimeInMatrixElems = ( 8.705020e-04 ) sec [Min,Max]TimeInMatrixElems = [ 8.705020e-04 , 8.705020e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 524288 EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.738480e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.632665e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.022824e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 524288 MeanMatrixElemValue = ( 1.372469e-02 +- 1.132952e-05 ) GeV^0

Fix conflicts in Makefiles and CPPProcess.cc (the #pragma omp loop is very different)

…uild directories

…ally. time ./build.avx2/gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.218126e-01 ) sec TotalTime[Rambo+ME] (23) = ( 1.142967e-01 ) sec TotalTime[RndNumGen] (1) = ( 7.515897e-03 ) sec TotalTime[Rambo] (2) = ( 1.026943e-01 ) sec TotalTime[MatrixElems] (3) = ( 1.160241e-02 ) sec MeanTimeInMatrixElems = ( 9.668675e-04 ) sec [Min,Max]TimeInMatrixElems = [ 9.625210e-04 , 9.738580e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.164867e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 5.504497e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 5.422542e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 1.173272 sec 0a ProcInit : 0.000576 sec 0b MemAlloc : 0.037912 sec 0c GenCreat : 0.011352 sec 0d SGoodHel : 0.002065 sec 1a GenSeed : 0.000019 sec 1b GenRnGen : 0.007497 sec 2a RamboIni : 0.000092 sec 2b RamboFin : 0.000047 sec 2c CpDTHwgt : 0.008357 sec 2d CpDTHmom : 0.094198 sec 3a SigmaKin : 0.000094 sec 3b CpDTHmes : 0.011508 sec 4a DumpLoop : 0.084650 sec 8a CompStat : 0.044464 sec 9a GenDestr : 0.000067 sec 9b DumpScrn : 0.000229 sec 9c DumpJson : 0.000002 sec TOTAL : 1.476400 sec TOTAL (123) : 0.121813 sec TOTAL (23) : 0.114297 sec TOTAL (1) : 0.007516 sec TOTAL (2) : 0.102694 sec TOTAL (3) : 0.011602 sec *********************************************************************** real 0m1.797s user 0m0.482s sys 0m0.801s time ./build.avx2/check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 4.156547e+00 ) sec TotalTime[Rambo+ME] (23) = ( 3.828234e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.283136e-01 ) sec TotalTime[Rambo] (2) = ( 1.941495e+00 ) sec TotalTime[MatrixElems] (3) = ( 1.886739e+00 ) sec MeanTimeInMatrixElems = ( 1.572282e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.569412e-01 , 1.577457e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.513625e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.643436e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 3.334566e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000370 sec 0b MemAlloc : 0.074970 sec 0c GenCreat : 0.000950 sec 0d SGoodHel : 0.000099 sec 1a GenSeed : 0.000027 sec 1b GenRnGen : 0.328287 sec 2a RamboIni : 0.122387 sec 2b RamboFin : 1.819108 sec 3a SigmaKin : 1.886739 sec 4a DumpLoop : 0.080112 sec 8a CompStat : 0.035178 sec 9a GenDestr : 0.000103 sec 9b DumpScrn : 0.012321 sec 9c DumpJson : 0.000002 sec TOTAL : 4.360653 sec TOTAL (123) : 4.156547 sec TOTAL (23) : 3.828234 sec TOTAL (1) : 0.328314 sec TOTAL (2) : 1.941495 sec TOTAL (3) : 1.886739 sec *********************************************************************** real 0m4.390s user 0m4.245s sys 0m0.143s

Fix merge conflicts: - epoch1/cuda/ee_mumu/SubProcesses/Makefile - epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc - epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/runTest.cc - epoch1/cuda/ee_mumu/src/Makefile This is the first commit/merge on the klas vectorization branch since Dec 2020. Note that master/upstream before klas had 1.15E6 MEs/s in C++ and 6.3E8 in CUDA. After this merge, throughput is 4.38E6 in C++ (almost 4x) and 6.1E8 in CUDA (slightly lower?) Note also that, before merging fast math from upstream master, klas had 3.3E6 in C++ and 5.4E8 in CUDA. Fast math did improve performance. Note also that the CUDA test builds but does not seem to do anything? time ./build.avx2/check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 3.682679e+00 ) sec TotalTime[Rambo+ME] (23) = ( 3.354335e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.283440e-01 ) sec TotalTime[Rambo] (2) = ( 1.918015e+00 ) sec TotalTime[MatrixElems] (3) = ( 1.436320e+00 ) sec MeanTimeInMatrixElems = ( 1.196933e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.193237e-01 , 1.204967e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.708391e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.875619e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 4.380262e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000369 sec 0b MemAlloc : 0.073457 sec 0c GenCreat : 0.000980 sec 0d SGoodHel : 0.000097 sec 1a GenSeed : 0.000028 sec 1b GenRnGen : 0.328316 sec 2a RamboIni : 0.111770 sec 2b RamboFin : 1.806245 sec 3a SigmaKin : 1.436320 sec 4a DumpLoop : 0.076820 sec 8a CompStat : 0.024451 sec 9a GenDestr : 0.000102 sec 9b DumpScrn : 0.011718 sec 9c DumpJson : 0.000015 sec TOTAL : 3.870689 sec TOTAL (123) : 3.682679 sec TOTAL (23) : 3.354335 sec TOTAL (1) : 0.328344 sec TOTAL (2) : 1.918015 sec TOTAL (3) : 1.436320 sec *********************************************************************** real 0m3.897s user 0m3.760s sys 0m0.134s time ./build.avx2/gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.064685e-01 ) sec TotalTime[Rambo+ME] (23) = ( 9.877649e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.691984e-03 ) sec TotalTime[Rambo] (2) = ( 8.849376e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.028273e-02 ) sec MeanTimeInMatrixElems = ( 8.568944e-04 ) sec [Min,Max]TimeInMatrixElems = [ 8.510760e-04 , 8.680110e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.909220e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.369386e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.118467e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.997294 sec 0a ProcInit : 0.000502 sec 0b MemAlloc : 0.035057 sec 0c GenCreat : 0.011721 sec 0d SGoodHel : 0.002068 sec 1a GenSeed : 0.000019 sec 1b GenRnGen : 0.007673 sec 2a RamboIni : 0.000097 sec 2b RamboFin : 0.000049 sec 2c CpDTHwgt : 0.007027 sec 2d CpDTHmom : 0.081322 sec 3a SigmaKin : 0.000081 sec 3b CpDTHmes : 0.010202 sec 4a DumpLoop : 0.083121 sec 8a CompStat : 0.044625 sec 9a GenDestr : 0.000067 sec 9b DumpScrn : 0.000257 sec 9c DumpJson : 0.000002 sec TOTAL : 1.281183 sec TOTAL (123) : 0.106468 sec TOTAL (23) : 0.098776 sec TOTAL (1) : 0.007692 sec TOTAL (2) : 0.088494 sec TOTAL (3) : 0.010283 sec *********************************************************************** real 0m1.586s user 0m0.507s sys 0m0.875s

https://www.gnu.org/software/make/manual/html_node/Multiple-Targets.html

…ilds)

Checked that ./build.avx2/gcheck.exe -v 1 8 1 and ./build.avx2/check.exe -v 1 8 1 now return the same printuouts of momenta as expected.

This is what I did: - Changed "constexpr bool dumpEvents = false;" to "true" - Rebuilt, launched runTest.exe - Copied cp dump_eemumu_0.txt ../../../../../test/eemumu/dump_CPUTest.eemumu.txt" This may break epoch2? In case, fix it by hardcoding neppR=8 there too

…C++-only builds

(they must be the same because the same reference file is used for tests)

… 8.E-11

…ang++) Fix cherry-pick conflict: epoch1/cuda/ee_mumu/SubProcesses/Makefile

Performance is 1.25E6, slightly better than gcc9 1.15E6 but lower than Fortran 1.50E6 time ./build.none/check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS Internal loops fptype_sv = VECTOR[1] == SCALAR (no SIMD) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = clang 10.0.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.234199e+00 ) sec TotalTime[Rambo+ME] (23) = ( 6.911213e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.229851e-01 ) sec TotalTime[Rambo] (2) = ( 1.849719e+00 ) sec TotalTime[MatrixElems] (3) = ( 5.061495e+00 ) sec MeanTimeInMatrixElems = ( 4.217912e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.214358e-01 , 4.223094e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.696825e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 9.103258e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.243004e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000383 sec 0b MemAlloc : 0.070821 sec 0c GenCreat : 0.000904 sec 0d SGoodHel : 0.000438 sec 1a GenSeed : 0.000030 sec 1b GenRnGen : 0.322956 sec 2a RamboIni : 0.081141 sec 2b RamboFin : 1.768578 sec 3a SigmaKin : 5.061495 sec 4a DumpLoop : 0.074358 sec 8a CompStat : 0.084354 sec 9a GenDestr : 0.000020 sec 9b DumpScrn : 0.009514 sec 9c DumpJson : 0.000002 sec TOTAL : 7.474991 sec TOTAL (123) : 7.234199 sec TOTAL (23) : 6.911214 sec TOTAL (1) : 0.322985 sec TOTAL (2) : 1.849719 sec TOTAL (3) : 5.061495 sec *********************************************************************** real 0m7.499s user 0m7.376s sys 0m0.121s time ./build.none/gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 9.123791e-02 ) sec TotalTime[Rambo+ME] (23) = ( 8.373227e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.505641e-03 ) sec TotalTime[Rambo] (2) = ( 7.402575e-02 ) sec TotalTime[MatrixElems] (3) = ( 9.706521e-03 ) sec MeanTimeInMatrixElems = ( 8.088767e-04 ) sec [Min,Max]TimeInMatrixElems = [ 8.009510e-04 , 8.176020e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.895660e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 7.513777e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.481680e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.802752 sec 0a ProcInit : 0.000472 sec 0b MemAlloc : 0.032316 sec 0c GenCreat : 0.009958 sec 0d SGoodHel : 0.002051 sec 1a GenSeed : 0.000017 sec 1b GenRnGen : 0.007489 sec 2a RamboIni : 0.000106 sec 2b RamboFin : 0.000051 sec 2c CpDTHwgt : 0.006522 sec 2d CpDTHmom : 0.067347 sec 3a SigmaKin : 0.000081 sec 3b CpDTHmes : 0.009625 sec 4a DumpLoop : 0.079669 sec 8a CompStat : 0.046016 sec 9a GenDestr : 0.000063 sec 9b DumpScrn : 0.000268 sec 9c DumpJson : 0.000002 sec TOTAL : 1.064805 sec TOTAL (123) : 0.091238 sec TOTAL (23) : 0.083732 sec TOTAL (1) : 0.007506 sec TOTAL (2) : 0.074026 sec TOTAL (3) : 0.009707 sec *********************************************************************** real 0m1.365s user 0m0.447s sys 0m0.478s

…hout SIMD! Now AVX=none with gcc9 is 1.28E6, it was 1.15E6 (remember fortran is 1.50E6). It means that with AVX=none gcc9 and clang10 are completely comparable. Note however that the speedup between AVX=none and AVX=avx2 is lower than 4: 4.40E6 / 1.28E6 is only 3.4, we can do better... time ./build.none/check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[1] == AOS Internal loops fptype_sv = VECTOR[1] == SCALAR (no SIMD) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.160223e+00 ) sec TotalTime[Rambo+ME] (23) = ( 6.836318e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.239050e-01 ) sec TotalTime[Rambo] (2) = ( 1.939587e+00 ) sec TotalTime[MatrixElems] (3) = ( 4.896731e+00 ) sec MeanTimeInMatrixElems = ( 4.080609e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.074413e-01 , 4.092229e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.786676e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 9.202989e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.284828e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000369 sec 0b MemAlloc : 0.070329 sec 0c GenCreat : 0.000909 sec 0d SGoodHel : 0.000105 sec 1a GenSeed : 0.000026 sec 1b GenRnGen : 0.323879 sec 2a RamboIni : 0.077785 sec 2b RamboFin : 1.861802 sec 3a SigmaKin : 4.896730 sec 4a DumpLoop : 0.073871 sec 8a CompStat : 0.025105 sec 9a GenDestr : 0.000082 sec 9b DumpScrn : 0.008952 sec 9c DumpJson : 0.000006 sec TOTAL : 7.339950 sec TOTAL (123) : 7.160223 sec TOTAL (23) : 6.836318 sec TOTAL (1) : 0.323905 sec TOTAL (2) : 1.939587 sec TOTAL (3) : 4.896730 sec *********************************************************************** real 0m7.362s user 0m7.236s sys 0m0.123s time ./build.avx2/check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] Internal loops fptype_sv = VECTOR[4] (AVX2) Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 3.598255e+00 ) sec TotalTime[Rambo+ME] (23) = ( 3.275359e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.228953e-01 ) sec TotalTime[Rambo] (2) = ( 1.845746e+00 ) sec TotalTime[MatrixElems] (3) = ( 1.429614e+00 ) sec MeanTimeInMatrixElems = ( 1.191345e-01 ) sec [Min,Max]TimeInMatrixElems = [ 1.187156e-01 , 1.201074e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.748474e+06 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.920844e+06 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 4.400809e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202858e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000379 sec 0b MemAlloc : 0.070129 sec 0c GenCreat : 0.000908 sec 0d SGoodHel : 0.000100 sec 1a GenSeed : 0.000025 sec 1b GenRnGen : 0.322871 sec 2a RamboIni : 0.110108 sec 2b RamboFin : 1.735638 sec 3a SigmaKin : 1.429614 sec 4a DumpLoop : 0.075421 sec 8a CompStat : 0.024105 sec 9a GenDestr : 0.000091 sec 9b DumpScrn : 0.008895 sec 9c DumpJson : 0.000002 sec TOTAL : 3.778286 sec TOTAL (123) : 3.598255 sec TOTAL (23) : 3.275360 sec TOTAL (1) : 0.322895 sec TOTAL (2) : 1.845746 sec TOTAL (3) : 1.429614 sec *********************************************************************** real 0m3.799s user 0m3.677s sys 0m0.120s

Fix conflicts in epoch1/cuda/ee_mumu/src/Makefile

into klas2

Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/testxxx.cc The test for sxxxxx is fixed but other testxxx still fail with no cxtype_ref [NB: the test does succeed in gcc after this merge for 512y with cxtype_ref]

…ests

…_ref All tests pass for gcc/double Keep this enabled for the moment for gcc too - eventually use this as default?! Performance is the following ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.306067e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.003754 sec real 0m7.011s =Symbols in CPPProcess.o= (~sse4: 620) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.188863e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.148541 sec real 0m1.448s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.489082e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.705370 sec real 0m4.713s =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.437350e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.526841 sec real 0m3.534s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2640) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.752122e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.504417 sec real 0m3.512s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2525) (512y: 37) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.723845e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.822494 sec real 0m3.830s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 997) (512y: 84) (512z: 2135) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.145092e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.682285 sec real 0m7.690s =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.328923e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.092988 sec real 0m1.386s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

…xing the issues ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.207457e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 TOTAL : 7.152017 sec real 0m7.162s =Symbols in CPPProcess.o= (~sse4: 577) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.458041e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 1.061856 sec real 0m1.349s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 48 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.454632e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270375e-06 ) GeV^0 TOTAL : 3.386881 sec real 0m3.396s =Symbols in CPPProcess.o= (~sse4: 3926) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 7.986212e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.690546 sec real 0m2.700s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3004) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 8.563670e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.649917 sec real 0m2.660s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2889) (512y: 19) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 7.139157e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.786044 sec real 0m2.795s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1540) (512y: 53) (512z: 2241) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.075523e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 TOTAL : 7.819291 sec real 0m7.829s =Symbols in CPPProcess.o= (~sse4: 542) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.526834e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 0.991750 sec real 0m1.282s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 72 -------------------------------------------------------------------------

Note that the physics results are the same in all AVXs of clang, but they differ from thos on gcc...? Baseline double/clang: ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) EvtsPerSec[MatrixElems] (3) = ( 1.263836e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.165586 sec real 0m7.172s =Symbols in CPPProcess.o= (~sse4: 1241) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 2.615041e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 4.634355 sec real 0m4.642s =Symbols in CPPProcess.o= (~sse4: 3601) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 5.206133e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.425824 sec real 0m3.433s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3004) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 5.140326e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.435411 sec real 0m3.443s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2727) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 3.714051e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 3.919521 sec real 0m3.927s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3524) (512y: 0) (512z: 1193) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 11.0.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 1.216910e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.425224 sec real 0m7.432s =Symbols in CPPProcess.o= (~sse4: 1165) (avx2: 0) (512y: 0) (512z: 0) -------------------------------------------------------------------------

Baseline float performance ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.210226e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 TOTAL : 7.137704 sec real 0m7.147s =Symbols in CPPProcess.o= (~sse4: 577) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.451026e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 0.807972 sec real 0m1.108s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 48 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.444181e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270375e-06 ) GeV^0 TOTAL : 3.375671 sec real 0m3.386s =Symbols in CPPProcess.o= (~sse4: 3736) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 7.949196e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.702257 sec real 0m2.712s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3147) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 8.533158e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.654457 sec real 0m2.664s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2987) (512y: 81) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 7.227639e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.776870 sec real 0m2.786s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1738) (512y: 179) (512z: 2150) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.075260e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 TOTAL : 7.817157 sec real 0m7.826s =Symbols in CPPProcess.o= (~sse4: 542) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.507709e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 0.669016 sec real 0m0.967s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKfPf": launch__registers_per_thread 72 -------------------------------------------------------------------------

…aph4gpu into klas2ep12 (A merge is necessary as I have two directories in paralle for gcc9 and clang11) Test clang float performance ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) EvtsPerSec[MatrixElems] (3) = ( 1.289271e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268978e-06 ) GeV^0 TOTAL : 6.978230 sec real 0m6.985s =Symbols in CPPProcess.o= (~sse4: 1625) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 5.087213e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268977e-06 ) GeV^0 TOTAL : 3.423323 sec real 0m3.430s =Symbols in CPPProcess.o= (~sse4: 4258) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = FLOAT (NaN/abnormal=4, zero=0) Internal loops fptype_sv = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 1.048762e+07 ) sec^-1 MeanMatrixElemValue = ( 1.371786e-02 +- 3.269407e-06 ) GeV^0 TOTAL : 2.734676 sec real 0m2.742s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3727) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = FLOAT (NaN/abnormal=4, zero=0) Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 1.051075e+07 ) sec^-1 MeanMatrixElemValue = ( 1.371786e-02 +- 3.269407e-06 ) GeV^0 TOTAL : 2.734637 sec real 0m2.741s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3204) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [clang 11.0.0] FP precision = FLOAT (NaN/abnormal=4, zero=0) Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=NO] EvtsPerSec[MatrixElems] (3) = ( 7.302535e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371786e-02 +- 3.269407e-06 ) GeV^0 TOTAL : 2.984331 sec real 0m2.991s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3928) (512y: 0) (512z: 1872) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [clang 11.0.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.239338e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268978e-06 ) GeV^0 TOTAL : 7.287751 sec real 0m7.295s =Symbols in CPPProcess.o= (~sse4: 1544) (avx2: 0) (512y: 0) (512z: 0) -------------------------------------------------------------------------

This should be the last performance test before the merge ** NB: there seems to be a numerical difference in average MEs between cxtype_ref and nocxtyperef implementations (even for none???). This should be investigated, then we can move to the no-cxtype-ref *** Final baseline perfomance before the merge *** ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.305527e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.191895 sec real 0m7.202s =Symbols in CPPProcess.o= (~sse4: 620) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.118856e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.908404 sec real 0m1.201s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.531723e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.845304 sec real 0m4.855s =Symbols in CPPProcess.o= (~sse4: 3277) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.431103e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.717100 sec real 0m3.727s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2780) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.757142e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.631310 sec real 0m3.641s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2604) (512y: 97) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.698749e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.014051 sec real 0m4.024s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1205) (512y: 209) (512z: 2044) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.143650e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.865311 sec real 0m7.875s =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) EvtsPerSec[MatrixElems] (3) = ( 7.063732e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.103881 sec real 0m1.404s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

…aph4gpu into klas2ep12 (A merge is necessary as I have two directories in paralle for gcc9 and clang11)

…e comments

…och1/2) This is one of the last changes before merging the vectorization PR *** FINAL OMP/AVXALL DOUBLE GCC PERFORMANCE BEFORE MERGING *** ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.248269 sec real 0m7.258s =Symbols in CPPProcess.o= (~sse4: 620) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 4 / 4 EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.630541 sec real 0m3.640s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.766821e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.919429 sec real 0m1.231s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 2.535637e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.831976 sec real 0m4.842s =Symbols in CPPProcess.o= (~sse4: 3277) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 4 / 4 EvtsPerSec[MatrixElems] (3) = ( 9.907296e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 2.987761 sec real 0m2.997s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.431989e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.712352 sec real 0m3.722s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2780) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 4 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.712174e+07 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 2.662205 sec real 0m2.672s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.764206 sec real 0m3.774s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2604) (512y: 97) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 4 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 2.645810 sec real 0m2.655s ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 3.707833e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.012550 sec real 0m4.023s =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1205) (512y: 209) (512z: 2044) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 4 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.388584e+07 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 2.768061 sec real 0m2.778s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.148017e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.843966 sec real 0m7.854s =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 4 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.548427e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.746793 sec real 0m3.756s ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.870215e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.920510 sec real 0m1.227s ==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164 -------------------------------------------------------------------------

valassi · 2021-04-23T15:47:27Z

@roiser @oliviermattelaer @hageboeck I am finally self merging the vectorization PR - hope it all goes well!

I will document what I have done here and on issue #71.

Still a few things I'd like to do, but at least this should allow Stefan to branch off for cuda graphs.

NB: up until this PS, the code was essentially identical (for eemumu) in epoch1 and epoch2. As of now, epoch1 is the most advanced branch, including vectorization. Epoch2 remains as an older control branch.

My present baseline performance (in double, in gcc, and in my old pre-clang vector implementation that I still use as default) is as follows, copied from the last commit 6f4916e

(For completeness I am giving also the OMP=4 numbers, but generally they are just 4x times the others... so you can focus on OMP=1)

*** FINAL OMP/AVXALL DOUBLE GCC PERFORMANCE BEFORE MERGING ***
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.290318e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.248269 sec
real    0m7.258s
=Symbols in CPPProcess.o= (~sse4:  620) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 5.112854e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.630541 sec
real    0m3.640s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.766821e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.919429 sec
real    0m1.231s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 2.535637e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.831976 sec
real    0m4.842s
=Symbols in CPPProcess.o= (~sse4: 3277) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 9.907296e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.987761 sec
real    0m2.997s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.431989e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.712352 sec
real    0m3.722s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2780) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.712174e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.662205 sec
real    0m2.672s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.753192e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.764206 sec
real    0m3.774s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2604) (512y:   97) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.830711e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.645810 sec
real    0m2.655s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 3.707833e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.012550 sec
real    0m4.023s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1205) (512y:  209) (512z: 2044)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.388584e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.768061 sec
real    0m2.778s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.148017e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.843966 sec
real    0m7.854s
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 4 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.548427e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.746793 sec
real    0m3.756s
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.870215e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.920510 sec
real    0m1.227s
==PROF== Profiling "_ZN5gProc8sigmaKinEPKdPd": launch__registers_per_thread 164
-------------------------------------------------------------------------

valassi · 2021-04-23T15:48:39Z

Pushing the merge button.

valassi added 30 commits December 11, 2020 14:49

Merge remote-tracking branch 'upstream/master' into klas

8c0c1c5

Fix conflicts in Makefiles and CPPProcess.cc (the #pragma omp loop is very different)

Support simultaneous builds with avx2, avx512, no AVX, in different b…

ce451e2

…uild directories

Fix a build error in tests - go back to Stephan's grouped targets

5465bd3

https://www.gnu.org/software/make/manual/html_node/Multiple-Targets.html

Build only the C++ if CUDA_HOME is invalid (hack to allow C++ only bu…

b7a228a

…ilds)

Fix debug builds

a577c0a

Better fix to build only C++ if CUDA_HOME is invalid

1a410e3

Fix C++-only build (the GTESTLIBS dependency must be brought forward)

0bff57d

BUG FIX in C++ printout of momenta.

ff31789

Checked that ./build.avx2/gcheck.exe -v 1 8 1 and ./build.avx2/check.exe -v 1 8 1 now return the same printuouts of momenta as expected.

bug fix in c++ test in runTest.cc: add getGoodHel and setGoodHel in c++

978526b

Allow epoch2 (just like epoch1) to set an invalid CUDA_HOME to force …

5223fa5

…C++-only builds

Hardcode neppR=8 also in epoch2, as in epoch1

a192241

(they must be the same because the same reference file is used for tests)

Final (?) fix for the klas/SIMD PR: increase tolerance from 5.E-12 to…

5d558e6

… 8.E-11

Minimal changes to add clang support (assume CXX points to .../bin/cl…

58d092e

…ang++) Fix cherry-pick conflict: epoch1/cuda/ee_mumu/SubProcesses/Makefile

Comment unused variable (clang complains)

3df764d

Issues with [] not being a ref for clang

fe2da96

Exact same issues with opencl in clang (probably uses these by default)

82249e7

Encapsulate clang-only changes in an #ifdef

a8e4ad5

Set AVX=none for clang builds (no support yet for SIMD in this code)

72e472b

Fix clang build warning

7b09d76

Add 'make cleanall' to clean all AVX tags

2d3d370

Ensure that 'make distclean' cleans all AVX tags for this compiler

874f198

Merge branch 'nocuda' into klas - this is a NOOP.

1b26634

Fix conflicts in epoch1/cuda/ee_mumu/src/Makefile

Merge branch 'klas' into klas2

39a2ca4

Merge branch 'klas2' of https://gitlab.cern.ch:8443/valassi/madgraph4gpu

aac18af

into klas2

valassi added 14 commits April 22, 2021 21:09

Merge branch 'testxxx' into klas2ep12

68feb00

Fix conflicts: epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/testxxx.cc The test for sxxxxx is fixed but other testxxx still fail with no cxtype_ref [NB: the test does succeed in gcc after this merge for 512y with cxtype_ref]

Merge remote-tracking branch 'upstream/master' into klas2ep12

12d9f12

[klas2ep12] testxxx add indentation to ease debugging of individual t…

0c75ac8

…ests

Merge branch 'klas2ep12' of https://gitlab.cern.ch:8443/valassi/madgr…

97c5d99

…aph4gpu into klas2ep12 (A merge is necessary as I have two directories in paralle for gcc9 and clang11)

[klas2ep12] cosmetics CPPProcess.cc ep1, fit withn 130 char, harmonis…

3369905

…e comments

[klas2ep12] HelAmps_sm.h fix interfaces using _sv types

088cc2a

[klas2ep12] minor fix in printout of nan/abnormal

2925a07

valassi mentioned this pull request Apr 23, 2021

klas2 (SIMD CPU) + epoch1/epoch2 #152

Closed

valassi merged commit dfcc0f9 into madgraph5:master Apr 23, 2021

valassi self-assigned this Apr 23, 2021

valassi added enhancement A feature we want to develop performance How fast is it? Make it go faster! labels Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

vectorization/SIMD : klas2ep12bis [klas2 (SIMD CPU) + epoch1/epoch2] #171

Conversation

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021

valassi commented Apr 23, 2021