Decouple neppM and neppV in vectorised C++ code? #176

valassi · 2021-04-25T14:51:31Z

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

It is also somewhat related to issue #175 about the xxx function interface.

Presently, there are are at least three types of (AO)SOA structures in the code, determined by three parameters

neppR determines the number of events per page in the AOSOA random number array. This is hardcoded for all (cuda, cpp, double, float, all SIMD, no SIMD...) configurations, so that the same physics results are obtained. It is essentially an internal parameter of the phase space sampler (rambo), while it does not even need to be known to the random engine itself. The random engine creates a flat array of random numbers, then rambo assignes these to different events according to neppR.
neppM is the number of events per page in the AOSOA momenta data structure. It is known to the xxx function interface (as allmomenta, see issue "xxx" function interface: further separation of data access and calculations? #175) and it is known to rambo (the getMomentaInitial and getMomentaFinal functions). This is about how the encoding/decoding of momenta in memory is done, in both c++ and cuda. For cuda, this is tuned to a reasonable value to have coalesced memory access, fitting in 32-byte cache lines (see the studies in issue AOS/SOA for input particle 4-momenta (and random numbers) #16). For c++, this is presently coupled to neppV for vectorisation, see next bullet.
neppV is the size of the vectors used in SIMD vectorized operations. This is dictated by which type of vectorization one wants to have (none:1, SSE42 ie 128bit: 2 doubles, AVX2 ie 256 bit: 4 doubles, 512y ie AVX512 with 256 bit: 4 doubles, 512z ie AVX512 with 512 bit: 8 doubles).

Presently, the implementation of vectorization couples the AOSOA momenta structure and the wavefunction vectorization by assuming neppM == neppV. This is worse than hardcoded, it is just an assumption throughout the code.

However, there is in principle no strict need to have neppM == neppV:

one could have an AOSOA where each momenta event page fits 2 or 4 or more SIMD vectors, i.e. neppM=2neppV or neppM=4neppV and so on
and one could even test the performance of the code without AOSOA and with an AOS instead, i.e. with neppM=1

About the latter point, the critical issue is that there are two separate contributions to the speedup from the vectorized implementation, one from data access and one from calculations

the SIMD calculation is most likely that is most important: when doing a sum for 4 events, an AVX2 implementation does a single processor instruction instead of 4
however, there is also a contribution (to be assessed) from data access: if the momenta already are in an AOSOA with at least 4 events contiguous to one another, then copying these to the fptype_v which is fed to the xxx functions is faster than if data have to be collected more randomly from memory

The main benefit of decouplking neppM and neppV would be to improve the modularity and encapsulation of the code, avoinding these dangerous hidden assumptions through independent components (now rambo and ME calculations are in a way tightly coupled by the assumption neppV=neppM). Eventually, this may also be useful/needed to make it easier to interface to fortran code.

In addition, having these as separate parameters would also give some flexibility in testing different optimisation options. For instance, it would make it possible to separately measure the speedup from arithmetic operations SIMD and from data loading SIMD.

Low priority for the moment. The code works as it is...

The code builds and works successfully for cuda and no-simd c++. Same performance as before. I need to fix also the simd case now. On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.146758e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.362327e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.729534 sec 2,546,643,114 cycles # 2.649 GHz 3,492,106,442 instructions # 1.37 insn per cycle 1.022481339 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.307705e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.156784 sec 19,162,022,514 cycles # 2.675 GHz 48,584,903,549 instructions # 2.54 insn per cycle 7.167157825 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) -------------------------------------------------------------------------

…adgraph5#176) On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.038242e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.362365e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.728157 sec 2,554,522,227 cycles # 2.659 GHz 3,496,344,265 instructions # 1.37 insn per cycle 1.019990013 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.317495e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.115284 sec 19,044,157,995 cycles # 2.674 GHz 48,580,763,905 instructions # 2.55 insn per cycle 7.125600434 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 612) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.526924e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.850482 sec 12,981,086,298 cycles # 2.672 GHz 29,922,576,131 instructions # 2.31 insn per cycle 4.860605296 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.598681e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.650489 sec 9,223,991,412 cycles # 2.522 GHz 16,543,527,703 instructions # 1.79 insn per cycle 3.660490491 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.919065e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.577370 sec 9,071,445,005 cycles # 2.530 GHz 16,481,431,021 instructions # 1.82 insn per cycle 3.587748072 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.730098e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.010023 sec 8,859,728,050 cycles # 2.205 GHz 13,342,306,281 instructions # 1.51 insn per cycle 4.020793330 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================

…ance! This essentially completes issue madgraph5#176: it is POSSIBLE to set neppM and neppV independently. However, for optimal performance, it is FASTER to set neppM=neppV. This makes a big difference for eemumu, but probably much less for ggttgg and more complex processes On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.146330e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351010e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.100235 sec 3,341,971,699 cycles # 2.650 GHz 4,801,041,143 instructions # 1.44 insn per cycle 1.396121507 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.318711e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.114590 sec 19,048,064,399 cycles # 2.675 GHz 48,580,634,235 instructions # 2.55 insn per cycle 7.125142173 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 612) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.546985e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.826285 sec 12,912,485,025 cycles # 2.672 GHz 29,897,572,102 instructions # 2.32 insn per cycle 4.836678642 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.602856e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.656571 sec 9,234,334,212 cycles # 2.520 GHz 16,554,612,106 instructions # 1.79 insn per cycle 3.666847112 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2741) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.929594e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.574374 sec 9,046,115,905 cycles # 2.528 GHz 16,493,566,277 instructions # 1.82 insn per cycle 3.584645103 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2571) (512y: 93) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.718931e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.006165 sec 8,858,689,858 cycles # 2.207 GHz 13,350,658,796 instructions # 1.51 insn per cycle 4.016687969 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1120) (512y: 209) (512z: 2041) =========================================================================

valassi · 2021-06-01T16:39:40Z

Very interesting - especially for @roiser and @oliviermattelaer (and @hageboeck) and the Fortran integration.

I have now essentially completed the decoupling of neppM (the momentum array striding) and neppV (the SIMD vector size). Previously they were set equal, because intuitively this seemed to be the most effieicnt.

Using PR #208 , now neppV is set independently based on the build type (as it was in the past), but it is now POSSIBLE to set neppM independently, to a value different from neppM.

HOWEVER, it turns out that it is indeed faster to use neppM=neppV. For eemumu this is a large performance hit (a factor 1.5 more or less). Probably for ggttgg it is much less so. In practice, it is best to pass from Fortran to C++ an array which is exactly the way we want it.

Paraphrasing the discussion I had with @roiser last week, there are three important points to keep in mind:

Within the SIMD calculations, I use gcc/clang vectors. Before PR Remove neppV and fptype_sv from API and decouple neppV from neppM (epoch1) #208, in check.cc the momentum AOSOA was an array of vectors (fptype_v*). I have now transformed it into a standard C array (fptype*), with the idea that we may need to get that from Fortran, and in Fortran I do not know how to create C++ compiler vector extension vectors. This is completed for the momenta (not yet the MEs), functionally it works and also the performance is good. The trick is that I use a reinterpret_cast to manipulate the C array as if it was an array of vectors. It seems that the compiler does a good job at using vector instructions for accessing/copying data and doing arithmentics on them. So this problem is solved.
With Stefan I had mentioned alignment. It turns out that the previous trick of using reinterpret_cast indeed only works if the data are properly aligned. I have modified the hstMakeUnique functions to return a pointer to an aligned block of data. Without this, I get some segfaults (see in the PR logs). With alignment, both functinality and performance are retrieved.
The third point to keep in mind is (what I believe must be) data pipelining. We want to access data in the order in which it is stored, otherwise there is a lot of random going back and forth. I have tested that now I am able to set neppM to a different value than neppV. However, for all cases (whether no simd with neppV=1, or SIMD AVX2 with neppV=4 for instance), performance is best when the momentum AOSOA has striding neppM equal to neppV. So I have set this back to what it was.

In summary, it seems that I did a lot of changes for nothing? Not quite

I think that some points are better understood
It is easier to make performane tests for specific choices (eg set neppM different from neppV for ggttgg, eventually)
And especially, the Fortran code will FUNCTIONALLY work even if initially we cannot match excatly to the optimal C++ striding/alignment

In practice:

we should try to have the same AOSOA for momenta in Fortran as in C++
but if we dont get that, it's not a problem (note that doing this would imply having a different fortran for the cuda and for the different simd backends, as they all use different optimal neppM stridings... so initially we may try with a single fortran, and do all relevant conversions on the c++)

Before I commit PR #208, a few points that I should keep in mind

test clang (I only tested reinterpret cast in gcc)
move the MEs array also to a simple C array (fptype*) rather than an array of vectors (fptype_v*)

valassi · 2021-06-01T16:47:23Z

PS note that naively I assumed that setting neppM>neppV would be enough to have good performance. This is indeed enough to be able to use reinterpret cast and use SIMD, however the performance is bad because of pipelining. Example, if neppM=8 (this was my hypothetical default, as it corresponds to the maximum 512bit register width), then I can reinterpret this striding to fit 4 2-vectors in SSE or 2 4-vectors in AVX2 (and also 8 scalars in no SIMD). However, all of these cause some performance overhead, which I guess is memry access cost.

For comparison, note that in CUDA this is exactly the strategy I use: I set neppM to be large enough to fit a 32-byte cacheline, but then I know that I need several cachelines anyway. On the other hand, CUDA looks like ~32 operations in parallel, so the dimensions are quite different.

I am not completely sure what are the CPU dimensions that are relevant here (the cacheline? the size of a memory page?)... maybe to be understood.

Anyway, again: for ggttgg most likely this will be COMPLETELY irrelevant. Just like the device-to-host copies in CUDA, also C++ memory access has a cost, but it becomes completely negligible when the flops intensive part is much mush longer. So more of a philisophical problem than a real problem I think (to be checked when we have ggttgg vectorized...)

valassi · 2021-06-01T16:57:01Z

Final comment, a few useful links

valassi · 2021-11-30T17:39:05Z

I would say that the modern interpretation of this issue is:
"remove neppV and fptye_sv from the public APIs".

It is possible to hide vectorization inside the implementation, an donly pass around double* arrays, provided that these arrays are aligned (and that neppM is a multiple of neppV).

I am presently supporting a whole lot of options, aligned or not, arbitrary neppM including lower than neppV etc. This is most probably useless. We do not want to run production code that is so slow. I just need to add a few sanity check in the appropriate places (and possibly make sure that these sanity checks do not slow down the calculation, eg checking the alignment of every single momenta buffer...)

…as5X in epochX/eemumu Ported everything as-is (in one single commit), including stuff that eventually needs to be cleaned up (e.g. it will probably be better to only support some given choices of neppM and neppV and fail otherwise) This essentially completes (except for backports) issue madgraph5#176 about removing neppV from the public API... Performance (see next commit) is almost the same. Probably 512y eemumu loses 1% because I have removed a few vectorizations on allMEs. Maybe I can get back a tiny bit even there.

valassi · 2021-12-02T11:42:01Z

I have merged PR #208 for epoch1 and PR #300 for epochX. This issue can be closed as done.

valassi added the refactoring Restructure classes and interfaces, improve modularity and encapsulation label Apr 25, 2021

This was referenced Apr 25, 2021

Multi-SIMD-mode executables? #177

Open

MadEvent integration (and Fortran interoperability with C/C++/CUDA) #121

Open

valassi mentioned this issue Nov 30, 2021

Remove neppV and fptype_sv from API and decouple neppV from neppM (epochX) #300

Merged

valassi closed this as completed Dec 2, 2021

valassi self-assigned this Dec 2, 2021

This was referenced Dec 2, 2021

Remove neppV and fptype_sv from API and decouple neppV from neppM (epoch1) #208

Merged

"xxx" function interface: further separation of data access and calculations? #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple neppM and neppV in vectorised C++ code? #176

Decouple neppM and neppV in vectorised C++ code? #176

valassi commented Apr 25, 2021

valassi commented Jun 1, 2021

valassi commented Jun 1, 2021

valassi commented Jun 1, 2021

valassi commented Nov 30, 2021

valassi commented Dec 2, 2021

Decouple neppM and neppV in vectorised C++ code? #176

Decouple neppM and neppV in vectorised C++ code? #176

Comments

valassi commented Apr 25, 2021

valassi commented Jun 1, 2021

valassi commented Jun 1, 2021

valassi commented Jun 1, 2021

valassi commented Nov 30, 2021

valassi commented Dec 2, 2021