Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Closed
roiser opened this issue Aug 12, 2020 · 6 comments
Assignees
Labels
enhancement A feature we want to develop

Comments

@roiser
Copy link
Member

roiser commented Aug 12, 2020

eemumu_AV/master

AV should check for bugs in global memory implementation (seems not possible to exhaust it with eemumu…)

@roiser roiser added the enhancement A feature we want to develop label Aug 12, 2020
@valassi
Copy link
Member

valassi commented Aug 13, 2020

This is ongoing. I fixed the global memory issue in the meantime: roiser@649bdc4

But I still get much worse performance than with local memory or shared memory. It is even worse than before, because now I can use 16384 blocks and the performance penalty of global vs local is much worse at 16384 clocks than it was at 256.

I would say that this remains to be better undesrtoood and profiled

@valassi valassi changed the title Use shared memory for ME calculations Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations Aug 18, 2020
@valassi
Copy link
Member

valassi commented Aug 18, 2020

I have closed the AOS/SOA issue #16 fo rthe memory layout of momenta (and random numbers). Now the access to momenta in memory seems well coalesced... and this does not even seem to have a big impact.

I have renamed this issue #7 as "Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations". It may still be that optimising those memory structures like w[5][6] will have an impact.

In other words, the memory of the input data (momenta) to signmakin and of output data (MEs) seems now under control, and maybe the layout is not that relevant. We should rather focus on the memory layout inside the sigmakin calculation, ie nside the ixxx, oxxx and FV functions.

@valassi
Copy link
Member

valassi commented Aug 18, 2020

I repeat here what I wrote in issue #26 about maxregcount.

I think that I got this whole idea of local/global/shared wrong, because I had not understood two things

  • one, that 'registers' are vector registers, and that warps effectively do "simd" n those vector registers (so even the 'local' w[5][6] is already vectorizxed), and there is no need to vectorize them manually as in c++ on a cpu
  • two, that what i read about 'local storage is thread-local global storage' is also in part a misconception, because those stack variables like w[5][6] stay on the registers as long as possible, and only spill out to "local" memory (i.e. thread-local global memory) if there is no space left in the registers.

It seems much better therefore to concentrate on the 'local' coding with w[5][6] and drop the shared and global implementations, which are complex and have worse performance. For instance

As far as this work on local vs global vcs storage is concerned, I would say that

  • The shared implementation, as it is now for w[5][6], has no future. It has a 35% throughpout penalty and it already uses 15k bytes of shared memory with 32 threads, when there are only 64k (or maybe 48k?). It is already impossible to use 128 threads. With more complex processes like ggttgg. it would simply explode. This is not a useful way for us to use shared memory.
  • The global memory implementation has a huge hit, like 70%. As discussed in issue Use maxrregcount to reduce register usage and improve throughput #26, it seems much better to let the system handlk the spill of some memory to "local" (thread-local global) memory, rather than do it ourselves.
  • The only reason why we may need to resurrect some of these global memory studies in the future is if we do split the sigmakin into smaller chunks. In that case we need to store intermediate results (like the w[5][6]...) into global memory (NOT shared memory) to move it across kernels. But it would need a much more profound reengineering of the kernels.

In conclusion, I woudl simply remove the present global and shared implementations. THis will clean up tyhe code.

Before doing that, I will still post a few graphs from profiling of shared and global below.

@valassi
Copy link
Member

valassi commented Aug 18, 2020

First, compare SHARED to "LOCAL".

It uses memory more, but SM usage decreases (unlike the case of maxregcount 128, where using more the memory allows a better usage of SM too).
image

The roofline is not very interesting. Actually the point moves to the right, not to the left (arithm,etic intensity increases). Actually FP64 decreases, while the LSU load store unit increases a lot
image

The memory plot shows that shared memory is used, which was not the case before. There are also wuit ea few bank conflicts (I am not sure if they are many or if that's an issue)
image

On the stall plot, it seems that the "useful" stalls like math pipe throttle decrease a lot
image

The mix of instructions changes a lot. But I would say that there a lot of new, not essential operations that are added. So we do more computations maybe, but not in useful things?
image

Finally, the occupancy plot. This is probably one of the most useful plots here. All this work was to reduce the number of registers, and we do reduce them a bit from 176 to 164. But, because we are using shared memory, this does NOT translate into a higher occupancy, actually the overall occupancy is lower. (I am not sure what is involved, but this is what the bottom plot syas)
image

So, in conclusion this usage of shared memory done this way is useless.

@valassi
Copy link
Member

valassi commented Aug 18, 2020

About global memory. This looks a bit like the plot for maxreg=64. We have increased memory usage so much, that SM decreases and we have become memory bound
image

And again the roofline says it all, we are now on the diagonal
image

Not surprisingly, memory usage has increased enormously
image

And the warp plots are interesting, though I do not understand all of it. We actually have slightly higher active warps per scheduler, but much fewer issued warps per scheduler. And again the only "useful" stall, the math pipe throttle, has gone down and we are memory bound.
image

As in the SHARED case, again the instruction mix is very different, and we do more instructions but probably they are useless ones.. (ie some we could avoid in the ME computation)
image

Fianlly, this time the reduced number of registries (166 vs 176) does increase occupancy a bit, but this is completely irrelevant now as we have become memory bound
image

valassi added a commit that referenced this issue Aug 18, 2020
Keep only the "local" w[5][6] and focus on that.

time ./gcheck.exe -p 65536 128 1
***************************************
NumIterations             = 1
NumThreadsPerBlock        = 128
NumBlocksPerGrid          = 65536
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 1
TotalTimeInWaveFuncs      = 1.147885e-02 sec
MeanTimeInWaveFuncs       = 1.147885e-02 sec
StdDevTimeInWaveFuncs     = 0.000000e+00 sec
MinTimeInWaveFuncs        = 1.147885e-02 sec
MaxTimeInWaveFuncs        = 1.147885e-02 sec
---------------------------------------
TotalEventsComputed       = 8388608
RamboEventsPerSec         = 8.288389e+07 sec^-1
MatrixElemEventsPerSec    = 7.307881e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 8388608
MeanMatrixElemValue       = 1.371734e-02 GeV^0
StdErrMatrixElemValue     = 2.831148e-06 GeV^0
StdDevMatrixElemValue     = 8.199880e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.145482 sec
0a ProcInit : 0.000564 sec
0b MemAlloc : 0.650950 sec
0c GenCreat : 0.014307 sec
1a GenSeed  : 0.000006 sec
1b GenRnGen : 0.000689 sec
2a RamboIni : 0.000024 sec
2b RamboFin : 0.000006 sec
2c CpDTHwgt : 0.008415 sec
2d CpDTHmom : 0.092765 sec
3a SGoodHel : 0.024875 sec
3b SigmaKin : 0.000018 sec
3c CpDTHmes : 0.011461 sec
4a DumpLoop : 0.030057 sec
9a DumpAll  : 0.031446 sec
9b GenDestr : 0.000062 sec
9c MemFree  : 0.274781 sec
9d CudReset : 0.044059 sec
TOTAL       : 1.329966 sec
TOTAL(n-2)  : 1.140425 sec
***************************************
real    0m1.341s
user    0m0.220s
sys     0m1.113s

time ./check.exe -p 65536 128 1
***************************************
NumIterations             = 1
NumThreadsPerBlock        = 128
NumBlocksPerGrid          = 65536
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = STD::COMPLEX
RanNumb memory layout     = AOSOA[4]
Momenta memory layout     = AOSOA[4]
Curand generation         = HOST (C++ code)
---------------------------------------
NumberOfEntries           = 1
TotalTimeInWaveFuncs      = 2.303040e+01 sec
MeanTimeInWaveFuncs       = 2.303040e+01 sec
StdDevTimeInWaveFuncs     = 0.000000e+00 sec
MinTimeInWaveFuncs        = 2.303040e+01 sec
MaxTimeInWaveFuncs        = 2.303040e+01 sec
---------------------------------------
TotalEventsComputed       = 8388608
RamboEventsPerSec         = 2.971090e+06 sec^-1
MatrixElemEventsPerSec    = 3.642407e+05 sec^-1
***************************************
NumMatrixElements(notNan) = 8388608
MeanMatrixElemValue       = 1.371734e-02 GeV^0
StdErrMatrixElemValue     = 2.831148e-06 GeV^0
StdDevMatrixElemValue     = 8.199880e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
0a ProcInit : 0.000398 sec
0b MemAlloc : 1.254668 sec
0c GenCreat : 0.000983 sec
1a GenSeed  : 0.000003 sec
1b GenRnGen : 0.455943 sec
2a RamboIni : 0.128670 sec
2b RamboFin : 2.694741 sec
3b SigmaKin : 23.030397 sec
4a DumpLoop : 0.016648 sec
9a DumpAll  : 0.032200 sec
9b GenDestr : 0.000091 sec
9c MemFree  : 0.143718 sec
TOTAL       : 27.758461 sec
TOTAL(n-2)  : 27.614344 sec
***************************************
real    0m27.765s
user    0m26.679s
sys     0m1.067s
@valassi
Copy link
Member

valassi commented Aug 18, 2020

Ok I have removed all code for SHARED and GLOBAL.

In case, this is where it was done: d1a5097

This can be closed now

@valassi valassi closed this as completed Aug 18, 2020
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 23, 2021
…builds.

The build fails on clang10 at compilation time

clang++: /build/gcc/build/contrib/clang-10.0.0/src/clang/10.0.0/tools/clang/lib/CodeGen/CGExpr.cpp:596: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed.
Stack dump:
0.      Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -I/usr/local/cuda-11.0/include/ -c CPPProcess.cc -o CPPProcess.o
1.      <eof> parser at end of file
2.      Per-file LLVM IR generation
3.      ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]'
 #0 0x0000000001af5f9a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af5f9a)
 #1 0x0000000001af3d54 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3d54)
 #2 0x0000000001af3fa9 llvm::sys::CleanupOnSignal(unsigned long) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3fa9)
 madgraph5#3 0x0000000001a6ed08 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1a6ed08)
 madgraph5#4 0x00007fd31c178630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 madgraph5#5 0x00007fd31ac8c3d7 raise (/lib64/libc.so.6+0x363d7)
 madgraph5#6 0x00007fd31ac8dac8 abort (/lib64/libc.so.6+0x37ac8)
 madgraph5#7 0x00007fd31ac851a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6)
 madgraph5#8 0x00007fd31ac85252 (/lib64/libc.so.6+0x2f252)
 madgraph5#9 0x000000000203a042 clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(clang::Expr const*) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x203a042)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022
…ns is different for fcheck

> ./fcheck.exe  2048 64 10
 GPUBLOCKS=          2048
 GPUTHREADS=           64
 NITERATIONS=          10
WARNING! Instantiate host Bridge (nevt=131072)
INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
WARNING! Instantiate host Sampler (nevt=131072)
Iteration #1
Iteration #2
Iteration madgraph5#3
Iteration madgraph5#4
Iteration madgraph5#5
Iteration madgraph5#6
Iteration madgraph5#7
Iteration madgraph5#8
Iteration madgraph5#9
WARNING! flagging abnormal ME for ievt=111162
Iteration madgraph5#10
 Average Matrix Element:   1.3716954486179133E-002
 Abnormal MEs:           1

> ./check.exe -p  2048 64 10 | grep FLOAT
FP precision                = FLOAT (NaN/abnormal=2, zero=0)

I imagine that this is because momenta in Fortran get translated from float to double and back to float, while in c++ they stay in float?
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 14, 2022
./cmadevent_cudacpp < /tmp/avalassi/input_ggtt_cpp | grep DEBUG_ | sort | uniq -c
  16416  DEBUG_SMATRIX1 #1
 262656  DEBUG_SMATRIX1 #1a
      1  DEBUG_SMATRIX1 #2
  16416  DEBUG_SMATRIX1 madgraph5#4
     25  DEBUG_SMATRIX1 #4a
      1  DEBUG_SMATRIX1 #4b
  16416  DEBUG_SMATRIX1 madgraph5#7
  16416  DEBUG_SMATRIX1 madgraph5#8
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 17, 2024
…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected

STARTED  AT Thu May 16 01:24:16 AM CEST 2024
(SM tests)
ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

The new issue madgraph5#845 is the following
+Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
+
+Backtrace for this error:
+#0  0x7f2a1a623860 in ???
+#1  0x7f2a1a622a05 in ???
+#2  0x7f2a1a254def in ???
+madgraph5#3  0x7f2a1ae20acc in ???
+madgraph5#4  0x7f2a1acc4575 in ???
+madgraph5#5  0x7f2a1ae1d4c9 in ???
+madgraph5#6  0x7f2a1ae2570d in ???
+madgraph5#7  0x7f2a1ae2afa1 in ???
+madgraph5#8  0x43008b in ???
+madgraph5#9  0x431c10 in ???
+madgraph5#10  0x432d47 in ???
+madgraph5#11  0x433b1e in ???
+madgraph5#12  0x44a921 in ???
+madgraph5#13  0x42ebbf in ???
+madgraph5#14  0x40371e in ???
+madgraph5#15  0x7f2a1a23feaf in ???
+madgraph5#16  0x7f2a1a23ff5f in ???
+madgraph5#17  0x403844 in ???
+madgraph5#18  0xffffffffffffffff in ???
+./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
+ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop
Projects
None yet
Development

No branches or pull requests

2 participants