Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

valassi · 2020-08-17T06:54:22Z

I dump here a few observations I made while analysing AOS/SOA/ASA memory layouuts is issue #16. Th ebottom line is that there is a (small) performance penalty in using constant memory for variables that could instead be taken as compile time constants.

In that context, I want to compare a few different layouts (eg ASA with 1, 2, 4.. events per "page"). Initially my code was only allowing some parameters to be defined at compile time, so every different test needed a rebuild. I then thought of allowing runtime choices of these parameters.

This is implemented here for instance: 576ba40. In this version both the random and momenta lyaouts (neppM and neppR) are ready to be received as command line options (not done fully yet). For neppR, this is passed around in function signatures. For neppM, this wwould have been too complex an dI pass it around in device constant memory.

Anyway, the point is that then I compared, for the same choice neppM=32, the difference between the compile time and constant memory constants. It is clear that there is a performance penalty, even if small. This is at the level of 5%, eg throughput goes down from 6.0E8 to 5.0E8 (all these numbers have the usual large spread due to VM load). I compared with ncu the profiles, and this is quite interesting. I would say that the problem is not an extra load on memory (constant memory is fast), but rather the fact that there are extra arithmetic operations which somehow can be avoided if the constants are known at compile time.

This is the overview

One of the most interesting plots is the instruction statistics. Here it is clear that the constant memory version needs many more IMAD (integer multiply and add) operations. I believe these are the runtime calculation of the momenta indices, which comehoe can otherwise be avoided if the constants are know at compile time

Also interesting to know that 2 more registers are used, but this has impact on occupancy

Conclusion: if possible, it is best to use compile-time constants rather than constant memory. The difference is small, but it's probably worth it to avoid the penalty unless necessary.

For my ASA studies, I will consider several options. One, keep compile time constants and rebuild for every test. Two, use constant memory and accept the penalty, only for the purpose of comparing options (but then again, I would not be comparing final codes which have these hardcoded at compile time...). Three, ugly and cumbersome, use template for all possible options I want to study, but then it means having much larger executables (embed all possible values of int in functions), more complex makefiles (all templated functions muct be compiled in the same unit as the main, probably), and a more polluted code with templates everywhere. I will probably stay with option one and clean up around that.
For physics parameters, one may eventually think of hardcoding them rather that reading them from a file into constant memory. All code is auto-generated anyway, so this is entirely feasible. One should study if it makes any difference in performance.

The text was updated successfully, but these errors were encountered:

valassi · 2020-08-17T08:02:02Z

For completeness, I did the same study on the other parameter neppR fo rthe random number array layout. The relevant function is now rambo get final momenta, not sigmakin. I use the same neppR=32, but in one case it is constant at compile time, in another it is passed dynamically (via function signatures). I get he same results as before: there is a sall penalty from using dynamic parameters. I do not observe this as a change in rambo throughput (but thats because the copy backl to host is even heavier). I do see however again a reduction in SM and memory usage, and in increase in IMAD. The number of registers is unchanged (I am not using constant memory here).

Overview

IMAD increase

Note also (same as in the previous case, but I did not add the plot) that there is a higher number of stalled for no instructions.

The difference is again a single line change here
2d0eb1e

Conclusions:

For ASA layouts neppR and neppM, I will move back to compile-time constants. I will remove all other code I had tested to make this more dynamic, and I will investigate options with header changes and rebuilds...

This is possible if constants are defined at compile time (issue #23) time ./gcheck.exe -p 16384 32 12 *************************************** NumIterations = 12 NumThreadsPerBlock = 32 NumBlocksPerGrid = 16384 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[32] Momenta memory layout = AOSOA[32] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 12 TotalTimeInWaveFuncs = 1.181599e-02 sec MeanTimeInWaveFuncs = 9.846660e-04 sec StdDevTimeInWaveFuncs = 2.134487e-05 sec MinTimeInWaveFuncs = 9.725020e-04 sec MaxTimeInWaveFuncs = 1.054626e-03 sec --------------------------------------- TotalEventsComputed = 6291456 RamboEventsPerSec = 6.285191e+07 sec^-1 MatrixElemEventsPerSec = 5.324526e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = 1.371972e-02 GeV^0 StdErrMatrixElemValue = 3.270361e-06 GeV^0 StdDevMatrixElemValue = 8.202972e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.144397 sec 0a ProcInit : 0.000555 sec 0b MemAlloc : 0.073817 sec 0c GenCreat : 0.014790 sec 1a GenSeed : 0.000014 sec 1b GenRnGen : 0.008059 sec 2a RamboIni : 0.000105 sec 2b RamboFin : 0.000076 sec 2c CpDTHwgt : 0.008542 sec 2d CpDTHmom : 0.091377 sec 3a SigmaKin : 0.000084 sec 3b CpDTHmes : 0.011732 sec 4a DumpLoop : 0.022817 sec 9a DumpAll : 0.023759 sec 9b GenDestr : 0.000278 sec 9c MemFree : 0.023780 sec 9d CudReset : 0.043683 sec TOTAL : 0.467866 sec TOTAL(n-2) : 0.279785 sec *************************************** real 0m0.479s user 0m0.170s sys 0m0.306s

valassi · 2020-08-17T10:32:02Z

En passant, using compile time constants allows the use of casts to multidimensional arrays, which may make the code more readable.
1d7e111

The performance is the same

./profile.sh -nogui -p 1 4 1 gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:21:27, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 launch__registers_per_thread register/thread 178 ---------------------------------------------------------------------- --------------- ------------------------------

gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:26:44, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 launch__registers_per_thread register/thread 174 ---------------------------------------------------------------------- --------------- ------------------------------

…23) gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:29:47, Context 1, Stream 7 Section: Command line profiler metrics ---------------------------------------------------------------------- --------------- ------------------------------ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum request 16 l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 16 launch__registers_per_thread register/thread 174 ---------------------------------------------------------------------- --------------- ------------------------------

valassi · 2020-10-29T16:09:59Z

Discussion with Stefan and Olivier. Should check again the effect of having masses/couplings in constant memory instead of compile time constants. Chek again that we save 4 registers. Is this because some are 0? (And are there couplings that are not purely real or purely imaginary?)

valassi · 2021-04-25T16:19:31Z

I have just changed the title to mention constexpr. I realised that I am using const in amny places in the code where constexpr would be possible. Not sure if this could also give some speedups, but I dump this idea also in this same basket...

valassi · 2021-12-09T11:01:36Z

As discussed in PR #306, constexpr is a fundamental ingredient of any strategy to use hardcoded parameters. I am merging a PR with a first implmentation of this, but precisely some constexpr issues in complex will need to be fixed (#307)

valassi · 2023-05-22T14:40:37Z

The option to use compile time constants for physics parameters is now fully functional (HRDCOD=1), for instance thorugh MR #306. I would close the older issue #23 where the initial studies were done, for issue cleanup, but I would keep #39 still open for further studies. It is still not completely clear to me when/if one option is sfatsre than the other. In reweighting in any case we need HRDCOD=0, rather than HRDCOD=1. Closing #23 and leaving #39 open.

valassi added the idea Possible new development (may need further discussion) label Aug 17, 2020

valassi mentioned this issue Aug 17, 2020

AOS/SOA for input particle 4-momenta (and random numbers) #16

Closed

valassi added a commit that referenced this issue Aug 17, 2020

Cleanup for neppR: keep it as a compile-time constant (issue #23)

27b7b72

valassi added a commit that referenced this issue Aug 17, 2020

Cleanup for neppM: keep it a constant at compile time (issue #23)

cd6b20b

valassi mentioned this issue Aug 18, 2020

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Closed

valassi mentioned this issue Oct 29, 2020

Masses/couplings as compile time constants instead of constant memory #39

Open

valassi changed the title ~~Compile-time vs constant-memory constants (layouts; physics parameters?)~~ Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) Apr 25, 2021

valassi mentioned this issue Dec 9, 2021

Option to hardcode physics parameters (with hacks to remove) #306

Merged

valassi mentioned this issue Dec 9, 2021

Simple custom complex class (cxsmpl) #307

Closed

valassi mentioned this issue Mar 29, 2023

Support for non-SM UFO models: HRDCOD=1 build fails in model SMEFTsim_topU3l_MwScheme_UFO #614

Closed

valassi closed this as completed May 22, 2023

valassi self-assigned this May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Oct 29, 2020

valassi commented Apr 25, 2021

valassi commented Dec 9, 2021 •

edited

Loading

valassi commented May 22, 2023

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

Comments

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Oct 29, 2020

valassi commented Apr 25, 2021

valassi commented Dec 9, 2021 • edited Loading

valassi commented May 22, 2023

valassi commented Dec 9, 2021 •

edited

Loading