Option to hardcode physics parameters (with hacks to remove) #306

valassi · 2021-12-09T11:00:08Z

This is a followup to #23 and #39

Yesterday while working on unrelated things (older WIP in klas3/klas3base in eemumu epoch1) I realised there was a performance regression of around 20% in cuda for eemumu between epoch1 and epochX, because epoch1 was using hardcoded parameters while epoch1 was using parameters read from file and then set in constant memory.

I moved also epoch1 to the (slower, but more default) reading f parameters from files. However, I also added the option to use hardcoded physics parameters, bth in c++ and cuda, if an ifdef switches them on. This PR is the result of that.

The implementation is fully functional, but it includes a few hacks which should be improved. The complication comes from the fact that one should use constexpr in the calculation of derived parameters, and there are two issues:

first, sqrt is not constexpr: I worked around this by using a nice solution based on Netwon Raphson that I found on SO, and this is code that can remain
second, a lot of complex arithmetics is also not constexpr, and something as simple as multiplying two complex numbers is not constexpr; I worked around this by massaging manually a few formulas, and only including all constants that are relevant to eemumu, ggtt and ggttgg

Eventually, the hack of the second issue above should be removed

the easiest would be to use a custom complex implementation (Simple custom complex class (cxsmpl) #307), which is something that I had in mind anyway to go beyond std, thrust and all others (we really only need basic + - * /)
once that is fixed, it is possible to derive automatically generated code, so one would have to backport in python (essentially a third set of methods, merging the two presently used for the declaration in .h and assignment in .cc)

This PR in any case is now fully functional (not WIP) and can be merged

…ter for eemumu but not ggttgg

…epochX eemumu - add static, no change Adding static constexpr was very important for ggttgg (issue madgraph5#283), here it seems irrelevant On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.799312e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.365768e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.711972 sec 378,500,268 cycles:u # 0.402 GHz 714,043,921 instructions:u # 1.89 insn per cycle 1.001718451 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.295117e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.205833 sec 19,098,292,611 cycles:u # 2.648 GHz 48,696,208,914 instructions:u # 2.55 insn per cycle 7.215278528 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.916163e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.584543 sec 8,920,192,544 cycles:u # 2.485 GHz 16,446,670,786 instructions:u # 1.84 insn per cycle 3.593956146 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================

…f change On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.045236e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351345e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.891413 sec 716,988,854 cycles:u # 0.657 GHz 1,417,713,891 instructions:u # 1.98 insn per cycle 1.187788844 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.295190e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.202504 sec 19,086,100,992 cycles:u # 2.648 GHz 48,696,209,031 instructions:u # 2.55 insn per cycle 7.211954767 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.915871e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.579208 sec 8,912,890,911 cycles:u # 2.487 GHz 16,446,669,684 instructions:u # 1.85 insn per cycle 3.588096744 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================

…ochX - no hardcoded cIPC/cIPD parameters This gives immediately a large 20% performance hit, down from 1.36E9 to 1.11E9 (issue madgraph5#39) Note that I have only removed the cIPC/cIPD re-definition. This should have not even built, but it was working because of silent shadowing (issue madgraph5#263) On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.339914e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.114434e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.919174 sec 736,027,835 cycles:u # 0.672 GHz 1,455,130,982 instructions:u # 1.98 insn per cycle 1.212221096 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 130 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.293167e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.207740 sec 19,104,919,438 cycles:u # 2.649 GHz 48,696,208,569 instructions:u # 2.55 insn per cycle 7.216573883 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.896673e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.615998 sec 8,999,428,944 cycles:u # 2.485 GHz 16,446,670,749 instructions:u # 1.83 insn per cycle 3.625251370 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================

… cIPC On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.084043e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.368415e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.899709 sec 654,810,337 cycles:u # 0.615 GHz 1,260,261,618 instructions:u # 1.92 insn per cycle 1.194686180 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.407555e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 6.815422 sec 18,040,748,446 cycles:u # 2.646 GHz 45,198,123,974 instructions:u # 2.51 insn per cycle 6.824563562 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 709) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.887657e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.582503 sec 8,907,581,347 cycles:u # 2.483 GHz 16,503,293,420 instructions:u # 1.85 insn per cycle 3.591946422 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2668) (512y: 52) (512z: 0) =========================================================================

…device__ constexpr), not better On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.067573e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.364358e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.709938 sec 377,016,803 cycles:u # 0.402 GHz 705,677,659 instructions:u # 1.87 insn per cycle 0.999220153 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.298532e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.194256 sec 19,032,415,777 cycles:u # 2.644 GHz 49,124,031,797 instructions:u # 2.58 insn per cycle 7.203532181 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 650) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.859426e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.605767 sec 8,946,079,220 cycles:u # 2.479 GHz 16,534,751,901 instructions:u # 1.85 insn per cycle 3.614755261 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2671) (512y: 52) (512z: 0) =========================================================================

…lar for both cuda and c++? (A speedup had been noticed for cuda in issue madgraph5#283) Note: the c++ 1.40 seems real, it is not a fluctuation - the number of symbols changes significantly. But this is the same performance as in two commits before, will go back there fd2ed7cccfd1ae9860a94b5f3b106a5ea5926814 On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.980252e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.366042e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.712567 sec 377,011,388 cycles:u # 0.401 GHz 695,726,819 instructions:u # 1.85 insn per cycle 1.001320452 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.406692e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 6.828146 sec 18,064,378,966 cycles:u # 2.644 GHz 45,198,124,436 instructions:u # 2.50 insn per cycle 6.836933557 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 709) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.900702e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.580288 sec 8,900,139,517 cycles:u # 2.482 GHz 16,503,293,690 instructions:u # 1.85 insn per cycle 3.589493433 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2668) (512y: 52) (512z: 0) =========================================================================

…ems similar for both cuda and c++?" Revert "[hrdcod] try to use constexpr cIPC (but must move it as cannot use __device__ constexpr), not better" This reverts commit e5dbcbf. This reverts commit ddb6ad2. On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.092406e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.368487e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.755564 sec 379,213,444 cycles:u # 0.387 GHz 704,182,459 instructions:u # 1.86 insn per cycle 1.045784696 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.402354e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 6.827715 sec 18,100,940,623 cycles:u # 2.649 GHz 45,198,123,843 instructions:u # 2.50 insn per cycle 6.836828051 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 709) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.881582e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.582493 sec 8,913,059,897 cycles:u # 2.484 GHz 16,503,293,863 instructions:u # 1.85 insn per cycle 3.591563196 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2668) (512y: 52) (512z: 0) =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.332524e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.112207e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.832664 sec 1,008,402,735 cycles:u # 1.136 GHz 1,991,015,340 instructions:u # 1.97 insn per cycle 1.123411276 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 130 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.302263e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.174307 sec 19,018,111,844 cycles:u # 2.649 GHz 48,696,210,444 instructions:u # 2.56 insn per cycle 7.183149713 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.910747e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.587155 sec 8,923,946,197 cycles:u # 2.484 GHz 16,446,671,076 instructions:u # 1.84 insn per cycle 3.596330694 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================

…epoch1

…cIPC

…use this in production (madgraph5#229) In particular ggttgg is slower with inlining, it is not useful to focus on eemumu

…logs for tests (using the copyLOgs script everytieme is not really needed)

…HARDCODE_CIPC'

…n - generate eemumu, a few diffs remain The difficult part now is to automatically generate the "correct" hardcoded parameters...

…ed physics parameters. Note two delicate technicalities: - there is no constexpr sqrt, I included a version copied from SO - std::complex arithmetics are not constexpr, I had to redefine a multiplication in an easier way

Note: I renounced doing a proper automatic generation, I am hardcoding what is enough for eemumu and ggttgg The trick will be to handle those std::complex constexpr: probably a custom complex class will help... Tis hacky version is almost complete - just need to assign the process-dependent hardcoded parameters to cIPC and cIPD

…nual and auto are now the same

…hout hardcoding - all ok Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++

…ndling new/removed files yet)

…ithout hardcoding - all ok Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++

…d parameter printout, disable irrelevant code All ok in manual eemumu

…arameters class with hardcoded parameters etc

…hout hardcoding - all ok Performance with/without hardcoding is similar, but hardcocing does decrease registers from 172 to 166 for ggtt

…and rerun tests with/without hardcoding - all ok Essentially: ./CODEGEN/generateAndCompare.sh gg_ttgg ./CODEGEN/syncManu.sh -ggttgg ./tput/teeThroughputX.sh -makej -ggttgg -flt ./tput/teeThroughputX.sh -makej -ggttgg -hrdonly

valassi · 2021-12-09T11:12:00Z

This is only adding an optional feature - I am self merging to more easily sync with other developments

valassi · 2021-12-09T16:00:45Z

Note that this PR implements separate alternative portions of the code with/without harcdoing of parameters, depending on an ifdef. This should reduce the risk of the silent shadowing issue described in #263

valassi added 30 commits December 9, 2021 08:28

[klas3bis] add tput/showOldTputs.sh script - relaise cuda11.4 was bet…

6970570

…ter for eemumu but not ggttgg

[klas3bis] bug fix in throughput12.sh for -nocpp

a4b503b

[hrdcod] Cosmetics in epoch1 eemumu, reduce differences to epochX

7462211

[hrdcod] add the option to hardcode cIPC in epochX eemumu as done in …

72f6874

…epoch1

[hrdcod] add HRDCIP option in eemumu manual Makefiles for hardcoding …

47fe8f4

…cIPC

[hrdcod] rename all logs to add the _hrd0 suffix

4ec01ed

[hrdcod] remove logs with aggressive inlining - will most likely not …

50ffe27

…use this in production (madgraph5#229) In particular ggttgg is slower with inlining, it is not useful to focus on eemumu

[hrdcod] add ggtt to copyLogs

524d07e

[hrdcod] moved ggtt auto logs to manu (./copyLogs.sh -ggtt -a2m)

e22349d

[hrdcod] add -hrd and -hrdonly options to the tput scripts

69bc4ad

[hrdcod] remove all auto logs, as in any case we will use the manual …

d69a10e

…logs for tests (using the copyLOgs script everytieme is not really needed)

[hrdcod] fix error in Subprocess Makefile with 'CUFLAGS += -DMGONGPU_…

dac3ecf

…HARDCODE_CIPC'

[hrdcod] rerun eemumu manual in both hrd0 and hrd1, all ok

3584b18

[hrdcod] Incomplete patch for hardcoding parameters in code generatio…

62acf0b

…n - generate eemumu, a few diffs remain The difficult part now is to automatically generate the "correct" hardcoded parameters...

[hrdcod] Complete backport of hardcoded physics parameters, eemumu ma…

d3ff995

…nual and auto are now the same

[hrdcod] rename HRDCIP as HRDCOD for clarity

ea7bdd5

[hrdcod] reenable ggtt manual tests in the tput script

9bfc182

[hrdcod] regenerate ggtt auto, resync manual and rerun tests with/wit…

8d43efc

…hout hardcoding - all ok Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++

[hrdcod] add a script to resync manual code to automatic code (not ha…

050f9c0

…ndling new/removed files yet)

[hrdcod] regenerate ggttgg auto, resync manual and rerun tests with/w…

e8a1480

…ithout hardcoding - all ok Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++

valassi added 4 commits December 9, 2021 11:20

[hrdcod] improve the parameters class with hardcoded parameters, adde…

c1f139f

…d parameter printout, disable irrelevant code All ok in manual eemumu

[hrdcod] backport to code generation and regenerate eemumu: improve p…

9ab43d9

…arameters class with hardcoded parameters etc

[hrdcod] regenerate ggtt auto, resync manual and rerun tests with/wit…

618a816

…hout hardcoding - all ok Performance with/without hardcoding is similar, but hardcocing does decrease registers from 172 to 166 for ggtt

This was referenced Dec 9, 2021

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

Closed

Masses/couplings as compile time constants instead of constant memory #39

Open

Simple custom complex class (cxsmpl) #307

Closed

valassi merged commit c93bd14 into madgraph5:master Dec 9, 2021

valassi self-assigned this Dec 9, 2021

valassi mentioned this pull request Dec 9, 2021

CUDA silently shadows constant variables from different contexts #263

Open

valassi mentioned this pull request May 9, 2022

Improve handling of model parameters #448

Open

valassi mentioned this pull request Mar 29, 2023

Support for non-SM UFO models: HRDCOD=1 build fails in model SMEFTsim_topU3l_MwScheme_UFO #614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to hardcode physics parameters (with hacks to remove) #306

Option to hardcode physics parameters (with hacks to remove) #306

valassi commented Dec 9, 2021 •

edited

Loading

valassi commented Dec 9, 2021

valassi commented Dec 9, 2021

Option to hardcode physics parameters (with hacks to remove) #306

Option to hardcode physics parameters (with hacks to remove) #306

Conversation

valassi commented Dec 9, 2021 • edited Loading

valassi commented Dec 9, 2021

valassi commented Dec 9, 2021

valassi commented Dec 9, 2021 •

edited

Loading