-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple neppM and neppV in vectorised C++ code? #176
Comments
The code builds and works successfully for cuda and no-simd c++. Same performance as before. I need to fix also the simd case now. On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.146758e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.362327e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.729534 sec 2,546,643,114 cycles # 2.649 GHz 3,492,106,442 instructions # 1.37 insn per cycle 1.022481339 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.307705e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.156784 sec 19,162,022,514 cycles # 2.675 GHz 48,584,903,549 instructions # 2.54 insn per cycle 7.167157825 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) -------------------------------------------------------------------------
…adgraph5#176) On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.038242e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.362365e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.728157 sec 2,554,522,227 cycles # 2.659 GHz 3,496,344,265 instructions # 1.37 insn per cycle 1.019990013 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.317495e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.115284 sec 19,044,157,995 cycles # 2.674 GHz 48,580,763,905 instructions # 2.55 insn per cycle 7.125600434 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 612) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.526924e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.850482 sec 12,981,086,298 cycles # 2.672 GHz 29,922,576,131 instructions # 2.31 insn per cycle 4.860605296 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.598681e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.650489 sec 9,223,991,412 cycles # 2.522 GHz 16,543,527,703 instructions # 1.79 insn per cycle 3.660490491 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.919065e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.577370 sec 9,071,445,005 cycles # 2.530 GHz 16,481,431,021 instructions # 1.82 insn per cycle 3.587748072 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.730098e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.010023 sec 8,859,728,050 cycles # 2.205 GHz 13,342,306,281 instructions # 1.51 insn per cycle 4.020793330 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045) =========================================================================
…ance! This essentially completes issue madgraph5#176: it is POSSIBLE to set neppM and neppV independently. However, for optimal performance, it is FASTER to set neppM=neppV. This makes a big difference for eemumu, but probably much less for ggttgg and more complex processes On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.146330e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351010e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.100235 sec 3,341,971,699 cycles # 2.650 GHz 4,801,041,143 instructions # 1.44 insn per cycle 1.396121507 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.318711e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.114590 sec 19,048,064,399 cycles # 2.675 GHz 48,580,634,235 instructions # 2.55 insn per cycle 7.125142173 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 612) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 2.546985e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.826285 sec 12,912,485,025 cycles # 2.672 GHz 29,897,572,102 instructions # 2.32 insn per cycle 4.836678642 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.602856e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.656571 sec 9,234,334,212 cycles # 2.520 GHz 16,554,612,106 instructions # 1.79 insn per cycle 3.666847112 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2741) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.929594e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.574374 sec 9,046,115,905 cycles # 2.528 GHz 16,493,566,277 instructions # 1.82 insn per cycle 3.584645103 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2571) (512y: 93) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 3.718931e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 4.006165 sec 8,858,689,858 cycles # 2.207 GHz 13,350,658,796 instructions # 1.51 insn per cycle 4.016687969 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1120) (512y: 209) (512z: 2041) =========================================================================
Very interesting - especially for @roiser and @oliviermattelaer (and @hageboeck) and the Fortran integration. I have now essentially completed the decoupling of neppM (the momentum array striding) and neppV (the SIMD vector size). Previously they were set equal, because intuitively this seemed to be the most effieicnt. Using PR #208 , now neppV is set independently based on the build type (as it was in the past), but it is now POSSIBLE to set neppM independently, to a value different from neppM. HOWEVER, it turns out that it is indeed faster to use neppM=neppV. For eemumu this is a large performance hit (a factor 1.5 more or less). Probably for ggttgg it is much less so. In practice, it is best to pass from Fortran to C++ an array which is exactly the way we want it. Paraphrasing the discussion I had with @roiser last week, there are three important points to keep in mind:
In summary, it seems that I did a lot of changes for nothing? Not quite
In practice:
Before I commit PR #208, a few points that I should keep in mind
|
PS note that naively I assumed that setting neppM>neppV would be enough to have good performance. This is indeed enough to be able to use reinterpret cast and use SIMD, however the performance is bad because of pipelining. Example, if neppM=8 (this was my hypothetical default, as it corresponds to the maximum 512bit register width), then I can reinterpret this striding to fit 4 2-vectors in SSE or 2 4-vectors in AVX2 (and also 8 scalars in no SIMD). However, all of these cause some performance overhead, which I guess is memry access cost. For comparison, note that in CUDA this is exactly the strategy I use: I set neppM to be large enough to fit a 32-byte cacheline, but then I know that I need several cachelines anyway. On the other hand, CUDA looks like ~32 operations in parallel, so the dimensions are quite different. I am not completely sure what are the CPU dimensions that are relevant here (the cacheline? the size of a memory page?)... maybe to be understood. Anyway, again: for ggttgg most likely this will be COMPLETELY irrelevant. Just like the device-to-host copies in CUDA, also C++ memory access has a cost, but it becomes completely negligible when the flops intensive part is much mush longer. So more of a philisophical problem than a real problem I think (to be checked when we have ggttgg vectorized...) |
I would say that the modern interpretation of this issue is: It is possible to hide vectorization inside the implementation, an donly pass around double* arrays, provided that these arrays are aligned (and that neppM is a multiple of neppV). I am presently supporting a whole lot of options, aligned or not, arbitrary neppM including lower than neppV etc. This is most probably useless. We do not want to run production code that is so slow. I just need to add a few sanity check in the appropriate places (and possibly make sure that these sanity checks do not slow down the calculation, eg checking the alignment of every single momenta buffer...) |
…as5X in epochX/eemumu Ported everything as-is (in one single commit), including stuff that eventually needs to be cleaned up (e.g. it will probably be better to only support some given choices of neppM and neppV and fail otherwise) This essentially completes (except for backports) issue madgraph5#176 about removing neppV from the public API... Performance (see next commit) is almost the same. Probably 512y eemumu loses 1% because I have removed a few vectorizations on allMEs. Maybe I can get back a tiny bit even there.
This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.
It is also somewhat related to issue #175 about the xxx function interface.
Presently, there are are at least three types of (AO)SOA structures in the code, determined by three parameters
neppR determines the number of events per page in the AOSOA random number array. This is hardcoded for all (cuda, cpp, double, float, all SIMD, no SIMD...) configurations, so that the same physics results are obtained. It is essentially an internal parameter of the phase space sampler (rambo), while it does not even need to be known to the random engine itself. The random engine creates a flat array of random numbers, then rambo assignes these to different events according to neppR.
neppM is the number of events per page in the AOSOA momenta data structure. It is known to the xxx function interface (as allmomenta, see issue "xxx" function interface: further separation of data access and calculations? #175) and it is known to rambo (the getMomentaInitial and getMomentaFinal functions). This is about how the encoding/decoding of momenta in memory is done, in both c++ and cuda. For cuda, this is tuned to a reasonable value to have coalesced memory access, fitting in 32-byte cache lines (see the studies in issue AOS/SOA for input particle 4-momenta (and random numbers) #16). For c++, this is presently coupled to neppV for vectorisation, see next bullet.
neppV is the size of the vectors used in SIMD vectorized operations. This is dictated by which type of vectorization one wants to have (none:1, SSE42 ie 128bit: 2 doubles, AVX2 ie 256 bit: 4 doubles, 512y ie AVX512 with 256 bit: 4 doubles, 512z ie AVX512 with 512 bit: 8 doubles).
Presently, the implementation of vectorization couples the AOSOA momenta structure and the wavefunction vectorization by assuming neppM == neppV. This is worse than hardcoded, it is just an assumption throughout the code.
However, there is in principle no strict need to have neppM == neppV:
About the latter point, the critical issue is that there are two separate contributions to the speedup from the vectorized implementation, one from data access and one from calculations
The main benefit of decouplking neppM and neppV would be to improve the modularity and encapsulation of the code, avoinding these dangerous hidden assumptions through independent components (now rambo and ME calculations are in a way tightly coupled by the assumption neppV=neppM). Eventually, this may also be useful/needed to make it easier to interface to fortran code.
In addition, having these as separate parameters would also give some flexibility in testing different optimisation options. For instance, it would make it possible to separately measure the speedup from arithmetic operations SIMD and from data loading SIMD.
Low priority for the moment. The code works as it is...
The text was updated successfully, but these errors were encountered: