Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to hardcode physics parameters (with hacks to remove) #306

Merged
merged 34 commits into from
Dec 9, 2021

Conversation

valassi
Copy link
Member

@valassi valassi commented Dec 9, 2021

This is a followup to #23 and #39

Yesterday while working on unrelated things (older WIP in klas3/klas3base in eemumu epoch1) I realised there was a performance regression of around 20% in cuda for eemumu between epoch1 and epochX, because epoch1 was using hardcoded parameters while epoch1 was using parameters read from file and then set in constant memory.

I moved also epoch1 to the (slower, but more default) reading f parameters from files. However, I also added the option to use hardcoded physics parameters, bth in c++ and cuda, if an ifdef switches them on. This PR is the result of that.

The implementation is fully functional, but it includes a few hacks which should be improved. The complication comes from the fact that one should use constexpr in the calculation of derived parameters, and there are two issues:

  • first, sqrt is not constexpr: I worked around this by using a nice solution based on Netwon Raphson that I found on SO, and this is code that can remain
  • second, a lot of complex arithmetics is also not constexpr, and something as simple as multiplying two complex numbers is not constexpr; I worked around this by massaging manually a few formulas, and only including all constants that are relevant to eemumu, ggtt and ggttgg

Eventually, the hack of the second issue above should be removed

  • the easiest would be to use a custom complex implementation (Simple custom complex class (cxsmpl) #307), which is something that I had in mind anyway to go beyond std, thrust and all others (we really only need basic + - * /)
  • once that is fixed, it is possible to derive automatically generated code, so one would have to backport in python (essentially a third set of methods, merging the two presently used for the declaration in .h and assignment in .cc)

This PR in any case is now fully functional (not WIP) and can be merged

…epochX eemumu - add static, no change

Adding static constexpr was very important for ggttgg (issue madgraph5#283), here it seems irrelevant

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.799312e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.365768e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.711972 sec
       378,500,268      cycles:u                  #    0.402 GHz
       714,043,921      instructions:u            #    1.89  insn per cycle
       1.001718451 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.295117e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.205833 sec
    19,098,292,611      cycles:u                  #    2.648 GHz
    48,696,208,914      instructions:u            #    2.55  insn per cycle
       7.215278528 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  636) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.916163e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.584543 sec
     8,920,192,544      cycles:u                  #    2.485 GHz
    16,446,670,786      instructions:u            #    1.84  insn per cycle
       3.593956146 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2704) (512y:   52) (512z:    0)
=========================================================================
…f change

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.045236e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.351345e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.891413 sec
       716,988,854      cycles:u                  #    0.657 GHz
     1,417,713,891      instructions:u            #    1.98  insn per cycle
       1.187788844 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.295190e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.202504 sec
    19,086,100,992      cycles:u                  #    2.648 GHz
    48,696,209,031      instructions:u            #    2.55  insn per cycle
       7.211954767 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  636) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.915871e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.579208 sec
     8,912,890,911      cycles:u                  #    2.487 GHz
    16,446,669,684      instructions:u            #    1.85  insn per cycle
       3.588096744 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2704) (512y:   52) (512z:    0)
=========================================================================
…ochX - no hardcoded cIPC/cIPD parameters

This gives immediately a large 20% performance hit, down from 1.36E9 to 1.11E9 (issue madgraph5#39)

Note that I have only removed the cIPC/cIPD re-definition.
This should have not even built, but it was working because of silent shadowing (issue madgraph5#263)

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.339914e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.114434e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.919174 sec
       736,027,835      cycles:u                  #    0.672 GHz
     1,455,130,982      instructions:u            #    1.98  insn per cycle
       1.212221096 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 130
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.293167e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.207740 sec
    19,104,919,438      cycles:u                  #    2.649 GHz
    48,696,208,569      instructions:u            #    2.55  insn per cycle
       7.216573883 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  636) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.896673e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.615998 sec
     8,999,428,944      cycles:u                  #    2.485 GHz
    16,446,670,749      instructions:u            #    1.83  insn per cycle
       3.625251370 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2704) (512y:   52) (512z:    0)
=========================================================================
… cIPC

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.084043e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.368415e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.899709 sec
       654,810,337      cycles:u                  #    0.615 GHz
     1,260,261,618      instructions:u            #    1.92  insn per cycle
       1.194686180 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.407555e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     6.815422 sec
    18,040,748,446      cycles:u                  #    2.646 GHz
    45,198,123,974      instructions:u            #    2.51  insn per cycle
       6.824563562 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  709) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.887657e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.582503 sec
     8,907,581,347      cycles:u                  #    2.483 GHz
    16,503,293,420      instructions:u            #    1.85  insn per cycle
       3.591946422 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2668) (512y:   52) (512z:    0)
=========================================================================
…device__ constexpr), not better

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.067573e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.364358e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.709938 sec
       377,016,803      cycles:u                  #    0.402 GHz
       705,677,659      instructions:u            #    1.87  insn per cycle
       0.999220153 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.298532e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.194256 sec
    19,032,415,777      cycles:u                  #    2.644 GHz
    49,124,031,797      instructions:u            #    2.58  insn per cycle
       7.203532181 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  650) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.859426e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.605767 sec
     8,946,079,220      cycles:u                  #    2.479 GHz
    16,534,751,901      instructions:u            #    1.85  insn per cycle
       3.614755261 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2671) (512y:   52) (512z:    0)
=========================================================================
…lar for both cuda and c++?

(A speedup had been noticed for cuda in issue madgraph5#283)

Note: the c++ 1.40 seems real, it is not a fluctuation - the number of symbols changes significantly.
But this is the same performance as in two commits before, will go back there fd2ed7cccfd1ae9860a94b5f3b106a5ea5926814

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.980252e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.366042e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.712567 sec
       377,011,388      cycles:u                  #    0.401 GHz
       695,726,819      instructions:u            #    1.85  insn per cycle
       1.001320452 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.406692e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     6.828146 sec
    18,064,378,966      cycles:u                  #    2.644 GHz
    45,198,124,436      instructions:u            #    2.50  insn per cycle
       6.836933557 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  709) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.900702e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.580288 sec
     8,900,139,517      cycles:u                  #    2.482 GHz
    16,503,293,690      instructions:u            #    1.85  insn per cycle
       3.589493433 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2668) (512y:   52) (512z:    0)
=========================================================================
…ems similar for both cuda and c++?"

Revert "[hrdcod] try to use constexpr cIPC (but must move it as cannot use __device__ constexpr), not better"

This reverts commit e5dbcbf.
This reverts commit ddb6ad2.

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.092406e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.368487e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.755564 sec
       379,213,444      cycles:u                  #    0.387 GHz
       704,182,459      instructions:u            #    1.86  insn per cycle
       1.045784696 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.402354e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     6.827715 sec
    18,100,940,623      cycles:u                  #    2.649 GHz
    45,198,123,843      instructions:u            #    2.50  insn per cycle
       6.836828051 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  709) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.881582e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.582493 sec
     8,913,059,897      cycles:u                  #    2.484 GHz
    16,503,293,863      instructions:u            #    1.85  insn per cycle
       3.591563196 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2668) (512y:   52) (512z:    0)
=========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.332524e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.112207e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.832664 sec
     1,008,402,735      cycles:u                  #    1.136 GHz
     1,991,015,340      instructions:u            #    1.97  insn per cycle
       1.123411276 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 130
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.302263e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.174307 sec
    19,018,111,844      cycles:u                  #    2.649 GHz
    48,696,210,444      instructions:u            #    2.56  insn per cycle
       7.183149713 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  636) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.910747e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.587155 sec
     8,923,946,197      cycles:u                  #    2.484 GHz
    16,446,671,076      instructions:u            #    1.84  insn per cycle
       3.596330694 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2704) (512y:   52) (512z:    0)
=========================================================================
…use this in production (madgraph5#229)

In particular ggttgg is slower with inlining, it is not useful to focus on eemumu
…logs for tests

(using the copyLOgs script everytieme is not really needed)
…n - generate eemumu, a few diffs remain

The difficult part now is to automatically generate the "correct" hardcoded parameters...
…ed physics parameters.

Note two delicate technicalities:
- there is no constexpr sqrt, I included a version copied from SO
- std::complex arithmetics are not constexpr, I had to redefine a multiplication in an easier way
Note: I renounced doing a proper automatic generation, I am hardcoding what is enough for eemumu and ggttgg
The trick will be to handle those std::complex constexpr: probably a custom complex class will help...

Tis hacky version is almost complete - just need to assign the process-dependent hardcoded parameters to cIPC and cIPD
…hout hardcoding - all ok

Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++
…ithout hardcoding - all ok

Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++
…d parameter printout, disable irrelevant code

All ok in manual eemumu
…arameters class with hardcoded parameters etc
…hout hardcoding - all ok

Performance with/without hardcoding is similar, but hardcocing does decrease registers from 172 to 166 for ggtt
…and rerun tests with/without hardcoding - all ok

Essentially:
 ./CODEGEN/generateAndCompare.sh gg_ttgg
 ./CODEGEN/syncManu.sh -ggttgg
 ./tput/teeThroughputX.sh -makej -ggttgg -flt
 ./tput/teeThroughputX.sh -makej -ggttgg -hrdonly
@valassi
Copy link
Member Author

valassi commented Dec 9, 2021

This is only adding an optional feature - I am self merging to more easily sync with other developments

@valassi valassi merged commit c93bd14 into madgraph5:master Dec 9, 2021
@valassi valassi self-assigned this Dec 9, 2021
@valassi
Copy link
Member Author

valassi commented Dec 9, 2021

Note that this PR implements separate alternative portions of the code with/without harcdoing of parameters, depending on an ifdef. This should reduce the risk of the silent shadowing issue described in #263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant