-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to hardcode physics parameters (with hacks to remove) #306
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ter for eemumu but not ggttgg
…epochX eemumu - add static, no change Adding static constexpr was very important for ggttgg (issue madgraph5#283), here it seems irrelevant On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.799312e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.365768e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.711972 sec 378,500,268 cycles:u # 0.402 GHz 714,043,921 instructions:u # 1.89 insn per cycle 1.001718451 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.295117e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.205833 sec 19,098,292,611 cycles:u # 2.648 GHz 48,696,208,914 instructions:u # 2.55 insn per cycle 7.215278528 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.916163e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.584543 sec 8,920,192,544 cycles:u # 2.485 GHz 16,446,670,786 instructions:u # 1.84 insn per cycle 3.593956146 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================
…f change On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.045236e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351345e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.891413 sec 716,988,854 cycles:u # 0.657 GHz 1,417,713,891 instructions:u # 1.98 insn per cycle 1.187788844 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.295190e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.202504 sec 19,086,100,992 cycles:u # 2.648 GHz 48,696,209,031 instructions:u # 2.55 insn per cycle 7.211954767 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.915871e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.579208 sec 8,912,890,911 cycles:u # 2.487 GHz 16,446,669,684 instructions:u # 1.85 insn per cycle 3.588096744 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================
…ochX - no hardcoded cIPC/cIPD parameters This gives immediately a large 20% performance hit, down from 1.36E9 to 1.11E9 (issue madgraph5#39) Note that I have only removed the cIPC/cIPD re-definition. This should have not even built, but it was working because of silent shadowing (issue madgraph5#263) On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.339914e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.114434e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.919174 sec 736,027,835 cycles:u # 0.672 GHz 1,455,130,982 instructions:u # 1.98 insn per cycle 1.212221096 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 130 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.293167e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.207740 sec 19,104,919,438 cycles:u # 2.649 GHz 48,696,208,569 instructions:u # 2.55 insn per cycle 7.216573883 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.896673e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.615998 sec 8,999,428,944 cycles:u # 2.485 GHz 16,446,670,749 instructions:u # 1.83 insn per cycle 3.625251370 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================
… cIPC On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.084043e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.368415e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.899709 sec 654,810,337 cycles:u # 0.615 GHz 1,260,261,618 instructions:u # 1.92 insn per cycle 1.194686180 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.407555e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 6.815422 sec 18,040,748,446 cycles:u # 2.646 GHz 45,198,123,974 instructions:u # 2.51 insn per cycle 6.824563562 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 709) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.887657e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.582503 sec 8,907,581,347 cycles:u # 2.483 GHz 16,503,293,420 instructions:u # 1.85 insn per cycle 3.591946422 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2668) (512y: 52) (512z: 0) =========================================================================
…device__ constexpr), not better On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.067573e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.364358e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.709938 sec 377,016,803 cycles:u # 0.402 GHz 705,677,659 instructions:u # 1.87 insn per cycle 0.999220153 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.298532e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.194256 sec 19,032,415,777 cycles:u # 2.644 GHz 49,124,031,797 instructions:u # 2.58 insn per cycle 7.203532181 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 650) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.859426e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.605767 sec 8,946,079,220 cycles:u # 2.479 GHz 16,534,751,901 instructions:u # 1.85 insn per cycle 3.614755261 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2671) (512y: 52) (512z: 0) =========================================================================
…lar for both cuda and c++? (A speedup had been noticed for cuda in issue madgraph5#283) Note: the c++ 1.40 seems real, it is not a fluctuation - the number of symbols changes significantly. But this is the same performance as in two commits before, will go back there fd2ed7cccfd1ae9860a94b5f3b106a5ea5926814 On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.980252e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.366042e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.712567 sec 377,011,388 cycles:u # 0.401 GHz 695,726,819 instructions:u # 1.85 insn per cycle 1.001320452 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.406692e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 6.828146 sec 18,064,378,966 cycles:u # 2.644 GHz 45,198,124,436 instructions:u # 2.50 insn per cycle 6.836933557 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 709) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.900702e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.580288 sec 8,900,139,517 cycles:u # 2.482 GHz 16,503,293,690 instructions:u # 1.85 insn per cycle 3.589493433 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2668) (512y: 52) (512z: 0) =========================================================================
…ems similar for both cuda and c++?" Revert "[hrdcod] try to use constexpr cIPC (but must move it as cannot use __device__ constexpr), not better" This reverts commit e5dbcbf. This reverts commit ddb6ad2. On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.092406e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.368487e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.755564 sec 379,213,444 cycles:u # 0.387 GHz 704,182,459 instructions:u # 1.86 insn per cycle 1.045784696 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.402354e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 6.827715 sec 18,100,940,623 cycles:u # 2.649 GHz 45,198,123,843 instructions:u # 2.50 insn per cycle 6.836828051 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 709) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=1] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.881582e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.582493 sec 8,913,059,897 cycles:u # 2.484 GHz 16,503,293,863 instructions:u # 1.85 insn per cycle 3.591563196 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2668) (512y: 52) (512z: 0) =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.1.105 (gcc 10.2.0)] [inlineHel=0] [hardcodeCIPC=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.332524e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.112207e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.832664 sec 1,008,402,735 cycles:u # 1.136 GHz 1,991,015,340 instructions:u # 1.97 insn per cycle 1.123411276 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 130 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.302263e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.174307 sec 19,018,111,844 cycles:u # 2.649 GHz 48,696,210,444 instructions:u # 2.56 insn per cycle 7.183149713 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 636) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc 10.2.0] [inlineHel=0] [hardcodeCIPC=0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.910747e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.587155 sec 8,923,946,197 cycles:u # 2.484 GHz 16,446,671,076 instructions:u # 1.84 insn per cycle 3.596330694 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2704) (512y: 52) (512z: 0) =========================================================================
…use this in production (madgraph5#229) In particular ggttgg is slower with inlining, it is not useful to focus on eemumu
…logs for tests (using the copyLOgs script everytieme is not really needed)
…n - generate eemumu, a few diffs remain The difficult part now is to automatically generate the "correct" hardcoded parameters...
…ed physics parameters. Note two delicate technicalities: - there is no constexpr sqrt, I included a version copied from SO - std::complex arithmetics are not constexpr, I had to redefine a multiplication in an easier way
Note: I renounced doing a proper automatic generation, I am hardcoding what is enough for eemumu and ggttgg The trick will be to handle those std::complex constexpr: probably a custom complex class will help... Tis hacky version is almost complete - just need to assign the process-dependent hardcoded parameters to cIPC and cIPD
…nual and auto are now the same
…hout hardcoding - all ok Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++
…ndling new/removed files yet)
…ithout hardcoding - all ok Note that the performance with hardcoding is not obviosuly better with hardcoding, neither in cuda nor in c++
…d parameter printout, disable irrelevant code All ok in manual eemumu
…arameters class with hardcoded parameters etc
…hout hardcoding - all ok Performance with/without hardcoding is similar, but hardcocing does decrease registers from 172 to 166 for ggtt
…and rerun tests with/without hardcoding - all ok Essentially: ./CODEGEN/generateAndCompare.sh gg_ttgg ./CODEGEN/syncManu.sh -ggttgg ./tput/teeThroughputX.sh -makej -ggttgg -flt ./tput/teeThroughputX.sh -makej -ggttgg -hrdonly
This was referenced Dec 9, 2021
Closed
This is only adding an optional feature - I am self merging to more easily sync with other developments |
Note that this PR implements separate alternative portions of the code with/without harcdoing of parameters, depending on an ifdef. This should reduce the risk of the silent shadowing issue described in #263 |
This was referenced Jan 25, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a followup to #23 and #39
Yesterday while working on unrelated things (older WIP in klas3/klas3base in eemumu epoch1) I realised there was a performance regression of around 20% in cuda for eemumu between epoch1 and epochX, because epoch1 was using hardcoded parameters while epoch1 was using parameters read from file and then set in constant memory.
I moved also epoch1 to the (slower, but more default) reading f parameters from files. However, I also added the option to use hardcoded physics parameters, bth in c++ and cuda, if an ifdef switches them on. This PR is the result of that.
The implementation is fully functional, but it includes a few hacks which should be improved. The complication comes from the fact that one should use constexpr in the calculation of derived parameters, and there are two issues:
Eventually, the hack of the second issue above should be removed
This PR in any case is now fully functional (not WIP) and can be merged