Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple neppM and neppV in vectorised C++ code? #176

Closed
valassi opened this issue Apr 25, 2021 · 5 comments
Closed

Decouple neppM and neppV in vectorised C++ code? #176

valassi opened this issue Apr 25, 2021 · 5 comments
Assignees
Labels
refactoring Restructure classes and interfaces, improve modularity and encapsulation

Comments

@valassi
Copy link
Member

valassi commented Apr 25, 2021

This is a spinoff of vectorisation issue #71 and a followup to the big PR #171.

It is also somewhat related to issue #175 about the xxx function interface.

Presently, there are are at least three types of (AO)SOA structures in the code, determined by three parameters

  • neppR determines the number of events per page in the AOSOA random number array. This is hardcoded for all (cuda, cpp, double, float, all SIMD, no SIMD...) configurations, so that the same physics results are obtained. It is essentially an internal parameter of the phase space sampler (rambo), while it does not even need to be known to the random engine itself. The random engine creates a flat array of random numbers, then rambo assignes these to different events according to neppR.

  • neppM is the number of events per page in the AOSOA momenta data structure. It is known to the xxx function interface (as allmomenta, see issue "xxx" function interface: further separation of data access and calculations? #175) and it is known to rambo (the getMomentaInitial and getMomentaFinal functions). This is about how the encoding/decoding of momenta in memory is done, in both c++ and cuda. For cuda, this is tuned to a reasonable value to have coalesced memory access, fitting in 32-byte cache lines (see the studies in issue AOS/SOA for input particle 4-momenta (and random numbers) #16). For c++, this is presently coupled to neppV for vectorisation, see next bullet.

  • neppV is the size of the vectors used in SIMD vectorized operations. This is dictated by which type of vectorization one wants to have (none:1, SSE42 ie 128bit: 2 doubles, AVX2 ie 256 bit: 4 doubles, 512y ie AVX512 with 256 bit: 4 doubles, 512z ie AVX512 with 512 bit: 8 doubles).

Presently, the implementation of vectorization couples the AOSOA momenta structure and the wavefunction vectorization by assuming neppM == neppV. This is worse than hardcoded, it is just an assumption throughout the code.

However, there is in principle no strict need to have neppM == neppV:

  • one could have an AOSOA where each momenta event page fits 2 or 4 or more SIMD vectors, i.e. neppM=2neppV or neppM=4neppV and so on
  • and one could even test the performance of the code without AOSOA and with an AOS instead, i.e. with neppM=1

About the latter point, the critical issue is that there are two separate contributions to the speedup from the vectorized implementation, one from data access and one from calculations

  • the SIMD calculation is most likely that is most important: when doing a sum for 4 events, an AVX2 implementation does a single processor instruction instead of 4
  • however, there is also a contribution (to be assessed) from data access: if the momenta already are in an AOSOA with at least 4 events contiguous to one another, then copying these to the fptype_v which is fed to the xxx functions is faster than if data have to be collected more randomly from memory

The main benefit of decouplking neppM and neppV would be to improve the modularity and encapsulation of the code, avoinding these dangerous hidden assumptions through independent components (now rambo and ME calculations are in a way tightly coupled by the assumption neppV=neppM). Eventually, this may also be useful/needed to make it easier to interface to fortran code.

In addition, having these as separate parameters would also give some flexibility in testing different optimisation options. For instance, it would make it possible to separately measure the speedup from arithmetic operations SIMD and from data loading SIMD.

Low priority for the moment. The code works as it is...

@valassi valassi added the refactoring Restructure classes and interfaces, improve modularity and encapsulation label Apr 25, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 31, 2021
The code builds and works successfully for cuda and no-simd c++.
Same performance as before. I need to fix also the simd case now.

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.146758e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.362327e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.729534 sec
     2,546,643,114      cycles                    #    2.649 GHz
     3,492,106,442      instructions              #    1.37  insn per cycle
       1.022481339 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.307705e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.156784 sec
    19,162,022,514      cycles                    #    2.675 GHz
    48,584,903,549      instructions              #    2.54  insn per cycle
       7.167157825 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 31, 2021
…adgraph5#176)

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.038242e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.362365e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.728157 sec
     2,554,522,227      cycles                    #    2.659 GHz
     3,496,344,265      instructions              #    1.37  insn per cycle
       1.019990013 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.317495e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.115284 sec
    19,044,157,995      cycles                    #    2.674 GHz
    48,580,763,905      instructions              #    2.55  insn per cycle
       7.125600434 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  612) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.526924e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.850482 sec
    12,981,086,298      cycles                    #    2.672 GHz
    29,922,576,131      instructions              #    2.31  insn per cycle
       4.860605296 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.598681e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.650489 sec
     9,223,991,412      cycles                    #    2.522 GHz
    16,543,527,703      instructions              #    1.79  insn per cycle
       3.660490491 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.919065e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.577370 sec
     9,071,445,005      cycles                    #    2.530 GHz
    16,481,431,021      instructions              #    1.82  insn per cycle
       3.587748072 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.730098e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.010023 sec
     8,859,728,050      cycles                    #    2.205 GHz
    13,342,306,281      instructions              #    1.51  insn per cycle
       4.020793330 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
=========================================================================
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 1, 2021
…ance!

This essentially completes issue madgraph5#176: it is POSSIBLE to set neppM and neppV independently.
However, for optimal performance, it is FASTER to set neppM=neppV.
This makes a big difference for eemumu, but probably much less for ggttgg and more complex processes

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.146330e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.351010e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.100235 sec
     3,341,971,699      cycles                    #    2.650 GHz
     4,801,041,143      instructions              #    1.44  insn per cycle
       1.396121507 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.318711e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.114590 sec
    19,048,064,399      cycles                    #    2.675 GHz
    48,580,634,235      instructions              #    2.55  insn per cycle
       7.125142173 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  612) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.546985e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.826285 sec
    12,912,485,025      cycles                    #    2.672 GHz
    29,897,572,102      instructions              #    2.32  insn per cycle
       4.836678642 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.602856e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.656571 sec
     9,234,334,212      cycles                    #    2.520 GHz
    16,554,612,106      instructions              #    1.79  insn per cycle
       3.666847112 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2741) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.929594e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.574374 sec
     9,046,115,905      cycles                    #    2.528 GHz
    16,493,566,277      instructions              #    1.82  insn per cycle
       3.584645103 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2571) (512y:   93) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.718931e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.006165 sec
     8,858,689,858      cycles                    #    2.207 GHz
    13,350,658,796      instructions              #    1.51  insn per cycle
       4.016687969 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1120) (512y:  209) (512z: 2041)
=========================================================================
@valassi
Copy link
Member Author

valassi commented Jun 1, 2021

Very interesting - especially for @roiser and @oliviermattelaer (and @hageboeck) and the Fortran integration.

I have now essentially completed the decoupling of neppM (the momentum array striding) and neppV (the SIMD vector size). Previously they were set equal, because intuitively this seemed to be the most effieicnt.

Using PR #208 , now neppV is set independently based on the build type (as it was in the past), but it is now POSSIBLE to set neppM independently, to a value different from neppM.

HOWEVER, it turns out that it is indeed faster to use neppM=neppV. For eemumu this is a large performance hit (a factor 1.5 more or less). Probably for ggttgg it is much less so. In practice, it is best to pass from Fortran to C++ an array which is exactly the way we want it.

Paraphrasing the discussion I had with @roiser last week, there are three important points to keep in mind:

  1. Within the SIMD calculations, I use gcc/clang vectors. Before PR Remove neppV and fptype_sv from API and decouple neppV from neppM (epoch1) #208, in check.cc the momentum AOSOA was an array of vectors (fptype_v*). I have now transformed it into a standard C array (fptype*), with the idea that we may need to get that from Fortran, and in Fortran I do not know how to create C++ compiler vector extension vectors. This is completed for the momenta (not yet the MEs), functionally it works and also the performance is good. The trick is that I use a reinterpret_cast to manipulate the C array as if it was an array of vectors. It seems that the compiler does a good job at using vector instructions for accessing/copying data and doing arithmentics on them. So this problem is solved.
  2. With Stefan I had mentioned alignment. It turns out that the previous trick of using reinterpret_cast indeed only works if the data are properly aligned. I have modified the hstMakeUnique functions to return a pointer to an aligned block of data. Without this, I get some segfaults (see in the PR logs). With alignment, both functinality and performance are retrieved.
  3. The third point to keep in mind is (what I believe must be) data pipelining. We want to access data in the order in which it is stored, otherwise there is a lot of random going back and forth. I have tested that now I am able to set neppM to a different value than neppV. However, for all cases (whether no simd with neppV=1, or SIMD AVX2 with neppV=4 for instance), performance is best when the momentum AOSOA has striding neppM equal to neppV. So I have set this back to what it was.

In summary, it seems that I did a lot of changes for nothing? Not quite

  • I think that some points are better understood
  • It is easier to make performane tests for specific choices (eg set neppM different from neppV for ggttgg, eventually)
  • And especially, the Fortran code will FUNCTIONALLY work even if initially we cannot match excatly to the optimal C++ striding/alignment

In practice:

  • we should try to have the same AOSOA for momenta in Fortran as in C++
  • but if we dont get that, it's not a problem (note that doing this would imply having a different fortran for the cuda and for the different simd backends, as they all use different optimal neppM stridings... so initially we may try with a single fortran, and do all relevant conversions on the c++)

Before I commit PR #208, a few points that I should keep in mind

  • test clang (I only tested reinterpret cast in gcc)
  • move the MEs array also to a simple C array (fptype*) rather than an array of vectors (fptype_v*)

@valassi
Copy link
Member Author

valassi commented Jun 1, 2021

PS note that naively I assumed that setting neppM>neppV would be enough to have good performance. This is indeed enough to be able to use reinterpret cast and use SIMD, however the performance is bad because of pipelining. Example, if neppM=8 (this was my hypothetical default, as it corresponds to the maximum 512bit register width), then I can reinterpret this striding to fit 4 2-vectors in SSE or 2 4-vectors in AVX2 (and also 8 scalars in no SIMD). However, all of these cause some performance overhead, which I guess is memry access cost.

For comparison, note that in CUDA this is exactly the strategy I use: I set neppM to be large enough to fit a 32-byte cacheline, but then I know that I need several cachelines anyway. On the other hand, CUDA looks like ~32 operations in parallel, so the dimensions are quite different.

I am not completely sure what are the CPU dimensions that are relevant here (the cacheline? the size of a memory page?)... maybe to be understood.

Anyway, again: for ggttgg most likely this will be COMPLETELY irrelevant. Just like the device-to-host copies in CUDA, also C++ memory access has a cost, but it becomes completely negligible when the flops intensive part is much mush longer. So more of a philisophical problem than a real problem I think (to be checked when we have ggttgg vectorized...)

@valassi
Copy link
Member Author

valassi commented Nov 30, 2021

I would say that the modern interpretation of this issue is:
"remove neppV and fptye_sv from the public APIs".

It is possible to hide vectorization inside the implementation, an donly pass around double* arrays, provided that these arrays are aligned (and that neppM is a multiple of neppV).

I am presently supporting a whole lot of options, aligned or not, arbitrary neppM including lower than neppV etc. This is most probably useless. We do not want to run production code that is so slow. I just need to add a few sanity check in the appropriate places (and possibly make sure that these sanity checks do not slow down the calculation, eg checking the alignment of every single momenta buffer...)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Nov 30, 2021
…as5X in epochX/eemumu

Ported everything as-is (in one single commit), including stuff that eventually needs to be cleaned up
(e.g. it will probably be better to only support some given choices of neppM and neppV and fail otherwise)

This essentially completes (except for backports) issue madgraph5#176 about removing neppV from the public API...

Performance (see next commit) is almost the same.
Probably 512y eemumu loses 1% because I have removed a few vectorizations on allMEs.
Maybe I can get back a tiny bit even there.
@valassi
Copy link
Member Author

valassi commented Dec 2, 2021

I have merged PR #208 for epoch1 and PR #300 for epochX. This issue can be closed as done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Restructure classes and interfaces, improve modularity and encapsulation
Projects
None yet
Development

No branches or pull requests

1 participant