Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

Closed
valassi opened this issue Aug 17, 2020 · 6 comments
Assignees
Labels
idea Possible new development (may need further discussion)

Comments

@valassi
Copy link
Member

valassi commented Aug 17, 2020

I dump here a few observations I made while analysing AOS/SOA/ASA memory layouuts is issue #16. Th ebottom line is that there is a (small) performance penalty in using constant memory for variables that could instead be taken as compile time constants.

In that context, I want to compare a few different layouts (eg ASA with 1, 2, 4.. events per "page"). Initially my code was only allowing some parameters to be defined at compile time, so every different test needed a rebuild. I then thought of allowing runtime choices of these parameters.

This is implemented here for instance: 576ba40. In this version both the random and momenta lyaouts (neppM and neppR) are ready to be received as command line options (not done fully yet). For neppR, this is passed around in function signatures. For neppM, this wwould have been too complex an dI pass it around in device constant memory.

Anyway, the point is that then I compared, for the same choice neppM=32, the difference between the compile time and constant memory constants. It is clear that there is a performance penalty, even if small. This is at the level of 5%, eg throughput goes down from 6.0E8 to 5.0E8 (all these numbers have the usual large spread due to VM load). I compared with ncu the profiles, and this is quite interesting. I would say that the problem is not an extra load on memory (constant memory is fast), but rather the fact that there are extra arithmetic operations which somehow can be avoided if the constants are known at compile time.

This is the overview
image

One of the most interesting plots is the instruction statistics. Here it is clear that the constant memory version needs many more IMAD (integer multiply and add) operations. I believe these are the runtime calculation of the momenta indices, which comehoe can otherwise be avoided if the constants are know at compile time
image

Also interesting to know that 2 more registers are used, but this has impact on occupancy
image

Conclusion: if possible, it is best to use compile-time constants rather than constant memory. The difference is small, but it's probably worth it to avoid the penalty unless necessary.

  • For my ASA studies, I will consider several options. One, keep compile time constants and rebuild for every test. Two, use constant memory and accept the penalty, only for the purpose of comparing options (but then again, I would not be comparing final codes which have these hardcoded at compile time...). Three, ugly and cumbersome, use template for all possible options I want to study, but then it means having much larger executables (embed all possible values of int in functions), more complex makefiles (all templated functions muct be compiled in the same unit as the main, probably), and a more polluted code with templates everywhere. I will probably stay with option one and clean up around that.
  • For physics parameters, one may eventually think of hardcoding them rather that reading them from a file into constant memory. All code is auto-generated anyway, so this is entirely feasible. One should study if it makes any difference in performance.
@valassi valassi added the idea Possible new development (may need further discussion) label Aug 17, 2020
@valassi
Copy link
Member Author

valassi commented Aug 17, 2020

For completeness, I did the same study on the other parameter neppR fo rthe random number array layout. The relevant function is now rambo get final momenta, not sigmakin. I use the same neppR=32, but in one case it is constant at compile time, in another it is passed dynamically (via function signatures). I get he same results as before: there is a sall penalty from using dynamic parameters. I do not observe this as a change in rambo throughput (but thats because the copy backl to host is even heavier). I do see however again a reduction in SM and memory usage, and in increase in IMAD. The number of registers is unchanged (I am not using constant memory here).

Overview
image

IMAD increase
image

Note also (same as in the previous case, but I did not add the plot) that there is a higher number of stalled for no instructions.
image

The difference is again a single line change here
2d0eb1e

Conclusions:

  • For ASA layouts neppR and neppM, I will move back to compile-time constants. I will remove all other code I had tested to make this more dynamic, and I will investigate options with header changes and rebuilds...

valassi added a commit that referenced this issue Aug 17, 2020
This is possible if constants are defined at compile time (issue #23)

time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[32]
Momenta memory layout     = AOSOA[32]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 1.181599e-02 sec
MeanTimeInWaveFuncs       = 9.846660e-04 sec
StdDevTimeInWaveFuncs     = 2.134487e-05 sec
MinTimeInWaveFuncs        = 9.725020e-04 sec
MaxTimeInWaveFuncs        = 1.054626e-03 sec
---------------------------------------
TotalEventsComputed       = 6291456
RamboEventsPerSec         = 6.285191e+07 sec^-1
MatrixElemEventsPerSec    = 5.324526e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.371972e-02 GeV^0
StdErrMatrixElemValue     = 3.270361e-06 GeV^0
StdDevMatrixElemValue     = 8.202972e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.144397 sec
0a ProcInit : 0.000555 sec
0b MemAlloc : 0.073817 sec
0c GenCreat : 0.014790 sec
1a GenSeed  : 0.000014 sec
1b GenRnGen : 0.008059 sec
2a RamboIni : 0.000105 sec
2b RamboFin : 0.000076 sec
2c CpDTHwgt : 0.008542 sec
2d CpDTHmom : 0.091377 sec
3a SigmaKin : 0.000084 sec
3b CpDTHmes : 0.011732 sec
4a DumpLoop : 0.022817 sec
9a DumpAll  : 0.023759 sec
9b GenDestr : 0.000278 sec
9c MemFree  : 0.023780 sec
9d CudReset : 0.043683 sec
TOTAL       : 0.467866 sec
TOTAL(n-2)  : 0.279785 sec
***************************************

real    0m0.479s
user    0m0.170s
sys     0m0.306s
@valassi
Copy link
Member Author

valassi commented Aug 17, 2020

En passant, using compile time constants allows the use of casts to multidimensional arrays, which may make the code more readable.
1d7e111

The performance is the same
image

valassi added a commit that referenced this issue Aug 19, 2020
./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:21:27, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    launch__registers_per_thread                                           register/thread                            178
    ---------------------------------------------------------------------- --------------- ------------------------------
valassi added a commit that referenced this issue Aug 19, 2020
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:26:44, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    launch__registers_per_thread                                           register/thread                            174
    ---------------------------------------------------------------------- --------------- ------------------------------
valassi added a commit that referenced this issue Aug 19, 2020
…23)

  gProc::sigmaKin(double const*, double*), 2020-Aug-18 18:29:47, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    launch__registers_per_thread                                           register/thread                            174
    ---------------------------------------------------------------------- --------------- ------------------------------
@valassi
Copy link
Member Author

valassi commented Oct 29, 2020

Discussion with Stefan and Olivier. Should check again the effect of having masses/couplings in constant memory instead of compile time constants. Chek again that we save 4 registers. Is this because some are 0? (And are there couplings that are not purely real or purely imaginary?)

@valassi valassi changed the title Compile-time vs constant-memory constants (layouts; physics parameters?) Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) Apr 25, 2021
@valassi
Copy link
Member Author

valassi commented Apr 25, 2021

I have just changed the title to mention constexpr. I realised that I am using const in amny places in the code where constexpr would be possible. Not sure if this could also give some speedups, but I dump this idea also in this same basket...

@valassi
Copy link
Member Author

valassi commented Dec 9, 2021

As discussed in PR #306, constexpr is a fundamental ingredient of any strategy to use hardcoded parameters. I am merging a PR with a first implmentation of this, but precisely some constexpr issues in complex will need to be fixed (#307)

@valassi
Copy link
Member Author

valassi commented May 22, 2023

The option to use compile time constants for physics parameters is now fully functional (HRDCOD=1), for instance thorugh MR #306. I would close the older issue #23 where the initial studies were done, for issue cleanup, but I would keep #39 still open for further studies. It is still not completely clear to me when/if one option is sfatsre than the other. In reweighting in any case we need HRDCOD=0, rather than HRDCOD=1. Closing #23 and leaving #39 open.

@valassi valassi closed this as completed May 22, 2023
@valassi valassi self-assigned this May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Possible new development (may need further discussion)
Projects
None yet
Development

No branches or pull requests

1 participant