Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

roiser · 2020-08-12T15:21:23Z

eemumu_AV/master

AV should check for bugs in global memory implementation (seems not possible to exhaust it with eemumu…)

valassi · 2020-08-13T08:37:30Z

This is ongoing. I fixed the global memory issue in the meantime: roiser@649bdc4

But I still get much worse performance than with local memory or shared memory. It is even worse than before, because now I can use 16384 blocks and the performance penalty of global vs local is much worse at 16384 clocks than it was at 256.

I would say that this remains to be better undesrtoood and profiled

valassi · 2020-08-18T08:58:57Z

I have closed the AOS/SOA issue #16 fo rthe memory layout of momenta (and random numbers). Now the access to momenta in memory seems well coalesced... and this does not even seem to have a big impact.

I have renamed this issue #7 as "Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations". It may still be that optimising those memory structures like w[5][6] will have an impact.

In other words, the memory of the input data (momenta) to signmakin and of output data (MEs) seems now under control, and maybe the layout is not that relevant. We should rather focus on the memory layout inside the sigmakin calculation, ie nside the ixxx, oxxx and FV functions.

valassi · 2020-08-18T14:32:55Z

I repeat here what I wrote in issue #26 about maxregcount.

I think that I got this whole idea of local/global/shared wrong, because I had not understood two things

one, that 'registers' are vector registers, and that warps effectively do "simd" n those vector registers (so even the 'local' w[5][6] is already vectorizxed), and there is no need to vectorize them manually as in c++ on a cpu
two, that what i read about 'local storage is thread-local global storage' is also in part a misconception, because those stack variables like w[5][6] stay on the registers as long as possible, and only spill out to "local" memory (i.e. thread-local global memory) if there is no space left in the registers.

It seems much better therefore to concentrate on the 'local' coding with w[5][6] and drop the shared and global implementations, which are complex and have worse performance. For instance

In issue Use maxrregcount to reduce register usage and improve throughput #26 I investigated the use of maxregcount. This does improve things, but needs careful fine tuning. Essentially, now with the latest AOSOA structure (issue AOS/SOA for input particle 4-momenta (and random numbers) #16) we are doing minimal usage of the memory bandwidth. Setting a limit on the number of registers forces the spillover from registries to "local" memory, which in any case is not an issue as we have plenty of unused bandwidth. This allows increasing occupancy by reducing the number of registries.
In issue Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23, I wonder if moving some physics constants from constant memory to hardcoded executables would also reduce the number of registries.
We should investigate individual operations like sqrt (issue Investigate how expensive sqrt is for CUDA (registers) and C++ (vectorization etc) #15)
Clearly using float instead of double would not only allow the use of FP32, but would also reduce the number of registers (issue Use single precision or even half precision #5)
I wonder if using a better complex implementation (issue Use cuComplex instead of thrust::complex #6) with home-made wrapping of doubles would also decrease register usage?
And we should also consider splitting kernels into smaller chunks, for instance using cuda graphs (issue cuda graphs #12)

As far as this work on local vs global vcs storage is concerned, I would say that

The shared implementation, as it is now for w[5][6], has no future. It has a 35% throughpout penalty and it already uses 15k bytes of shared memory with 32 threads, when there are only 64k (or maybe 48k?). It is already impossible to use 128 threads. With more complex processes like ggttgg. it would simply explode. This is not a useful way for us to use shared memory.
The global memory implementation has a huge hit, like 70%. As discussed in issue Use maxrregcount to reduce register usage and improve throughput #26, it seems much better to let the system handlk the spill of some memory to "local" (thread-local global) memory, rather than do it ourselves.
The only reason why we may need to resurrect some of these global memory studies in the future is if we do split the sigmakin into smaller chunks. In that case we need to store intermediate results (like the w[5][6]...) into global memory (NOT shared memory) to move it across kernels. But it would need a much more profound reengineering of the kernels.

In conclusion, I woudl simply remove the present global and shared implementations. THis will clean up tyhe code.

Before doing that, I will still post a few graphs from profiling of shared and global below.

valassi · 2020-08-18T14:49:11Z

First, compare SHARED to "LOCAL".

It uses memory more, but SM usage decreases (unlike the case of maxregcount 128, where using more the memory allows a better usage of SM too).

The roofline is not very interesting. Actually the point moves to the right, not to the left (arithm,etic intensity increases). Actually FP64 decreases, while the LSU load store unit increases a lot

The memory plot shows that shared memory is used, which was not the case before. There are also wuit ea few bank conflicts (I am not sure if they are many or if that's an issue)

On the stall plot, it seems that the "useful" stalls like math pipe throttle decrease a lot

The mix of instructions changes a lot. But I would say that there a lot of new, not essential operations that are added. So we do more computations maybe, but not in useful things?

Finally, the occupancy plot. This is probably one of the most useful plots here. All this work was to reduce the number of registers, and we do reduce them a bit from 176 to 164. But, because we are using shared memory, this does NOT translate into a higher occupancy, actually the overall occupancy is lower. (I am not sure what is involved, but this is what the bottom plot syas)

So, in conclusion this usage of shared memory done this way is useless.

valassi · 2020-08-18T14:57:52Z

About global memory. This looks a bit like the plot for maxreg=64. We have increased memory usage so much, that SM decreases and we have become memory bound

And again the roofline says it all, we are now on the diagonal

Not surprisingly, memory usage has increased enormously

And the warp plots are interesting, though I do not understand all of it. We actually have slightly higher active warps per scheduler, but much fewer issued warps per scheduler. And again the only "useful" stall, the math pipe throttle, has gone down and we are memory bound.

As in the SHARED case, again the instruction mix is very different, and we do more instructions but probably they are useless ones.. (ie some we could avoid in the ME computation)

Fianlly, this time the reduced number of registries (166 vs 176) does increase occupancy a bit, but this is completely irrelevant now as we have become memory bound

Keep only the "local" w[5][6] and focus on that. time ./gcheck.exe -p 65536 128 1 *************************************** NumIterations = 1 NumThreadsPerBlock = 128 NumBlocksPerGrid = 65536 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 1 TotalTimeInWaveFuncs = 1.147885e-02 sec MeanTimeInWaveFuncs = 1.147885e-02 sec StdDevTimeInWaveFuncs = 0.000000e+00 sec MinTimeInWaveFuncs = 1.147885e-02 sec MaxTimeInWaveFuncs = 1.147885e-02 sec --------------------------------------- TotalEventsComputed = 8388608 RamboEventsPerSec = 8.288389e+07 sec^-1 MatrixElemEventsPerSec = 7.307881e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 8388608 MeanMatrixElemValue = 1.371734e-02 GeV^0 StdErrMatrixElemValue = 2.831148e-06 GeV^0 StdDevMatrixElemValue = 8.199880e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.145482 sec 0a ProcInit : 0.000564 sec 0b MemAlloc : 0.650950 sec 0c GenCreat : 0.014307 sec 1a GenSeed : 0.000006 sec 1b GenRnGen : 0.000689 sec 2a RamboIni : 0.000024 sec 2b RamboFin : 0.000006 sec 2c CpDTHwgt : 0.008415 sec 2d CpDTHmom : 0.092765 sec 3a SGoodHel : 0.024875 sec 3b SigmaKin : 0.000018 sec 3c CpDTHmes : 0.011461 sec 4a DumpLoop : 0.030057 sec 9a DumpAll : 0.031446 sec 9b GenDestr : 0.000062 sec 9c MemFree : 0.274781 sec 9d CudReset : 0.044059 sec TOTAL : 1.329966 sec TOTAL(n-2) : 1.140425 sec *************************************** real 0m1.341s user 0m0.220s sys 0m1.113s time ./check.exe -p 65536 128 1 *************************************** NumIterations = 1 NumThreadsPerBlock = 128 NumBlocksPerGrid = 65536 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Curand generation = HOST (C++ code) --------------------------------------- NumberOfEntries = 1 TotalTimeInWaveFuncs = 2.303040e+01 sec MeanTimeInWaveFuncs = 2.303040e+01 sec StdDevTimeInWaveFuncs = 0.000000e+00 sec MinTimeInWaveFuncs = 2.303040e+01 sec MaxTimeInWaveFuncs = 2.303040e+01 sec --------------------------------------- TotalEventsComputed = 8388608 RamboEventsPerSec = 2.971090e+06 sec^-1 MatrixElemEventsPerSec = 3.642407e+05 sec^-1 *************************************** NumMatrixElements(notNan) = 8388608 MeanMatrixElemValue = 1.371734e-02 GeV^0 StdErrMatrixElemValue = 2.831148e-06 GeV^0 StdDevMatrixElemValue = 8.199880e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 0a ProcInit : 0.000398 sec 0b MemAlloc : 1.254668 sec 0c GenCreat : 0.000983 sec 1a GenSeed : 0.000003 sec 1b GenRnGen : 0.455943 sec 2a RamboIni : 0.128670 sec 2b RamboFin : 2.694741 sec 3b SigmaKin : 23.030397 sec 4a DumpLoop : 0.016648 sec 9a DumpAll : 0.032200 sec 9b GenDestr : 0.000091 sec 9c MemFree : 0.143718 sec TOTAL : 27.758461 sec TOTAL(n-2) : 27.614344 sec *************************************** real 0m27.765s user 0m26.679s sys 0m1.067s

valassi · 2020-08-18T15:24:25Z

Ok I have removed all code for SHARED and GLOBAL.

In case, this is where it was done: d1a5097

This can be closed now

…builds. The build fails on clang10 at compilation time clang++: /build/gcc/build/contrib/clang-10.0.0/src/clang/10.0.0/tools/clang/lib/CodeGen/CGExpr.cpp:596: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed. Stack dump: 0. Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -I/usr/local/cuda-11.0/include/ -c CPPProcess.cc -o CPPProcess.o 1. <eof> parser at end of file 2. Per-file LLVM IR generation 3. ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]' #0 0x0000000001af5f9a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af5f9a) #1 0x0000000001af3d54 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3d54) #2 0x0000000001af3fa9 llvm::sys::CleanupOnSignal(unsigned long) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3fa9) madgraph5#3 0x0000000001a6ed08 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1a6ed08) madgraph5#4 0x00007fd31c178630 __restore_rt (/lib64/libpthread.so.0+0xf630) madgraph5#5 0x00007fd31ac8c3d7 raise (/lib64/libc.so.6+0x363d7) madgraph5#6 0x00007fd31ac8dac8 abort (/lib64/libc.so.6+0x37ac8) madgraph5#7 0x00007fd31ac851a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6) madgraph5#8 0x00007fd31ac85252 (/lib64/libc.so.6+0x2f252) madgraph5#9 0x000000000203a042 clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(clang::Expr const*) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x203a042)

…ns is different for fcheck > ./fcheck.exe 2048 64 10 GPUBLOCKS= 2048 GPUTHREADS= 64 NITERATIONS= 10 WARNING! Instantiate host Bridge (nevt=131072) INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it WARNING! Instantiate host Sampler (nevt=131072) Iteration #1 Iteration #2 Iteration madgraph5#3 Iteration madgraph5#4 Iteration madgraph5#5 Iteration madgraph5#6 Iteration madgraph5#7 Iteration madgraph5#8 Iteration madgraph5#9 WARNING! flagging abnormal ME for ievt=111162 Iteration madgraph5#10 Average Matrix Element: 1.3716954486179133E-002 Abnormal MEs: 1 > ./check.exe -p 2048 64 10 | grep FLOAT FP precision = FLOAT (NaN/abnormal=2, zero=0) I imagine that this is because momenta in Fortran get translated from float to double and back to float, while in c++ they stay in float?

./cmadevent_cudacpp < /tmp/avalassi/input_ggtt_cpp | grep DEBUG_ | sort | uniq -c 16416 DEBUG_SMATRIX1 #1 262656 DEBUG_SMATRIX1 #1a 1 DEBUG_SMATRIX1 #2 16416 DEBUG_SMATRIX1 madgraph5#4 25 DEBUG_SMATRIX1 #4a 1 DEBUG_SMATRIX1 #4b 16416 DEBUG_SMATRIX1 madgraph5#7 16416 DEBUG_SMATRIX1 madgraph5#8

…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected STARTED AT Thu May 16 01:24:16 AM CEST 2024 (SM tests) ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt The new issue madgraph5#845 is the following +Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. + +Backtrace for this error: +#0 0x7f2a1a623860 in ??? +#1 0x7f2a1a622a05 in ??? +#2 0x7f2a1a254def in ??? +madgraph5#3 0x7f2a1ae20acc in ??? +madgraph5#4 0x7f2a1acc4575 in ??? +madgraph5#5 0x7f2a1ae1d4c9 in ??? +madgraph5#6 0x7f2a1ae2570d in ??? +madgraph5#7 0x7f2a1ae2afa1 in ??? +madgraph5#8 0x43008b in ??? +madgraph5#9 0x431c10 in ??? +madgraph5#10 0x432d47 in ??? +madgraph5#11 0x433b1e in ??? +madgraph5#12 0x44a921 in ??? +madgraph5#13 0x42ebbf in ??? +madgraph5#14 0x40371e in ??? +madgraph5#15 0x7f2a1a23feaf in ??? +madgraph5#16 0x7f2a1a23ff5f in ??? +madgraph5#17 0x403844 in ??? +madgraph5#18 0xffffffffffffffff in ??? +./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} +ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed

roiser added the enhancement A feature we want to develop label Aug 12, 2020

roiser assigned valassi Aug 13, 2020

valassi mentioned this issue Aug 18, 2020

AOS/SOA for input particle 4-momenta (and random numbers) #16

Closed

valassi changed the title ~~Use shared memory for ME calculations~~ Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations Aug 18, 2020

valassi mentioned this issue Aug 18, 2020

Use maxrregcount to reduce register usage and improve throughput #26

Open

valassi closed this as completed Aug 18, 2020

valassi mentioned this issue Aug 18, 2020

Use const/restrict for ME calculation parameters #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

roiser commented Aug 12, 2020

valassi commented Aug 13, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Comments

roiser commented Aug 12, 2020

valassi commented Aug 13, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020

valassi commented Aug 18, 2020