-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23
Comments
For completeness, I did the same study on the other parameter neppR fo rthe random number array layout. The relevant function is now rambo get final momenta, not sigmakin. I use the same neppR=32, but in one case it is constant at compile time, in another it is passed dynamically (via function signatures). I get he same results as before: there is a sall penalty from using dynamic parameters. I do not observe this as a change in rambo throughput (but thats because the copy backl to host is even heavier). I do see however again a reduction in SM and memory usage, and in increase in IMAD. The number of registers is unchanged (I am not using constant memory here). Note also (same as in the previous case, but I did not add the plot) that there is a higher number of stalled for no instructions. The difference is again a single line change here Conclusions:
|
This is possible if constants are defined at compile time (issue #23) time ./gcheck.exe -p 16384 32 12 *************************************** NumIterations = 12 NumThreadsPerBlock = 32 NumBlocksPerGrid = 16384 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[32] Momenta memory layout = AOSOA[32] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 12 TotalTimeInWaveFuncs = 1.181599e-02 sec MeanTimeInWaveFuncs = 9.846660e-04 sec StdDevTimeInWaveFuncs = 2.134487e-05 sec MinTimeInWaveFuncs = 9.725020e-04 sec MaxTimeInWaveFuncs = 1.054626e-03 sec --------------------------------------- TotalEventsComputed = 6291456 RamboEventsPerSec = 6.285191e+07 sec^-1 MatrixElemEventsPerSec = 5.324526e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = 1.371972e-02 GeV^0 StdErrMatrixElemValue = 3.270361e-06 GeV^0 StdDevMatrixElemValue = 8.202972e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.144397 sec 0a ProcInit : 0.000555 sec 0b MemAlloc : 0.073817 sec 0c GenCreat : 0.014790 sec 1a GenSeed : 0.000014 sec 1b GenRnGen : 0.008059 sec 2a RamboIni : 0.000105 sec 2b RamboFin : 0.000076 sec 2c CpDTHwgt : 0.008542 sec 2d CpDTHmom : 0.091377 sec 3a SigmaKin : 0.000084 sec 3b CpDTHmes : 0.011732 sec 4a DumpLoop : 0.022817 sec 9a DumpAll : 0.023759 sec 9b GenDestr : 0.000278 sec 9c MemFree : 0.023780 sec 9d CudReset : 0.043683 sec TOTAL : 0.467866 sec TOTAL(n-2) : 0.279785 sec *************************************** real 0m0.479s user 0m0.170s sys 0m0.306s
En passant, using compile time constants allows the use of casts to multidimensional arrays, which may make the code more readable. |
./profile.sh -nogui -p 1 4 1 gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:21:27, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 launch__registers_per_thread register/thread 178 ---------------------------------------------------------------------- --------------- ------------------------------
gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:26:44, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 launch__registers_per_thread register/thread 174 ---------------------------------------------------------------------- --------------- ------------------------------
…23) gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:29:47, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 launch__registers_per_thread register/thread 174 ---------------------------------------------------------------------- --------------- ------------------------------
Discussion with Stefan and Olivier. Should check again the effect of having masses/couplings in constant memory instead of compile time constants. Chek again that we save 4 registers. Is this because some are 0? (And are there couplings that are not purely real or purely imaginary?) |
I have just changed the title to mention constexpr. I realised that I am using const in amny places in the code where constexpr would be possible. Not sure if this could also give some speedups, but I dump this idea also in this same basket... |
The option to use compile time constants for physics parameters is now fully functional (HRDCOD=1), for instance thorugh MR #306. I would close the older issue #23 where the initial studies were done, for issue cleanup, but I would keep #39 still open for further studies. It is still not completely clear to me when/if one option is sfatsre than the other. In reweighting in any case we need HRDCOD=0, rather than HRDCOD=1. Closing #23 and leaving #39 open. |
I dump here a few observations I made while analysing AOS/SOA/ASA memory layouuts is issue #16. Th ebottom line is that there is a (small) performance penalty in using constant memory for variables that could instead be taken as compile time constants.
In that context, I want to compare a few different layouts (eg ASA with 1, 2, 4.. events per "page"). Initially my code was only allowing some parameters to be defined at compile time, so every different test needed a rebuild. I then thought of allowing runtime choices of these parameters.
This is implemented here for instance: 576ba40. In this version both the random and momenta lyaouts (neppM and neppR) are ready to be received as command line options (not done fully yet). For neppR, this is passed around in function signatures. For neppM, this wwould have been too complex an dI pass it around in device constant memory.
Anyway, the point is that then I compared, for the same choice neppM=32, the difference between the compile time and constant memory constants. It is clear that there is a performance penalty, even if small. This is at the level of 5%, eg throughput goes down from 6.0E8 to 5.0E8 (all these numbers have the usual large spread due to VM load). I compared with ncu the profiles, and this is quite interesting. I would say that the problem is not an extra load on memory (constant memory is fast), but rather the fact that there are extra arithmetic operations which somehow can be avoided if the constants are known at compile time.
This is the overview
One of the most interesting plots is the instruction statistics. Here it is clear that the constant memory version needs many more IMAD (integer multiply and add) operations. I believe these are the runtime calculation of the momenta indices, which comehoe can otherwise be avoided if the constants are know at compile time
Also interesting to know that 2 more registers are used, but this has impact on occupancy
Conclusion: if possible, it is best to use compile-time constants rather than constant memory. The difference is small, but it's probably worth it to avoid the penalty unless necessary.
The text was updated successfully, but these errors were encountered: