Use single precision or even half precision #5

roiser · 2020-08-12T15:18:23Z

Started by OV for double -> float. Reduces register pressure. Now also typedef switch (AV).

in eemumu_AV/master

It’s a global switch, in the end one may want to have finer grained usage of single/double precision per calculation.

roiser · 2020-08-12T15:20:23Z

see also #6

valassi · 2020-08-13T08:44:48Z

I have integrated this (single precision, not half precision) here: roiser@85201ea

Here I implemented a check to avoid nans from single precision: roiser@6a31ca2

Note that FLOAT is a factor 2.3 faster than DOUBLE. I guess a factor 2 comes from the allowed use of FP32 capabilities of theGPU, the extra 15% comes from lower memory usage, possible fewer rgisters and hence better throughput?
See https://docs.google.com/document/d/1g2xwJ2FsSlxHvSUdPZjCyFW7zhsblMQ4g8UHlrkWyVw/edit#

valassi · 2020-08-19T15:56:32Z

Before I move to something else, I dump a few plots from the profiler here. As discussed, moving from DOUBLE to FLOAT increase throughput by more than a naive factor 2 (which one could assume because we now use the FP32 units and not only the FP64 ones). In my test it went from 6.8E8 to 1.64E9, ie a factor 2.4. For instance register usage is much smaller, and this brings a lot of benefits.

This is the overview. Registers down from 172 to 80. The number of requests is the ame but the number of transactions is divided by two as expected.

The roofline shows that FP32 is now well utilized while it was almost unused before. Conversely, FP64 is now AT ZERO.

The workload analysis shows that 32bit FMA is now the main workhorse. Instead FP64 has essentially disappeared

Memory usage mainly shows that the number of MB needed has been halved

The number of warps per scheduler more than doubvles, possible because of fewer registers used

Stall math pipe throttle more than doubles, which is good

The instruction mix sees an explosion of fp32 instructions

Finally, achieved occupancy almost triples, thanks to the reduced number of registers

Some conclusions

this is just a feaibility study, clearly FLOAT has much better throughput than DOUBLE but physics validation is needed
if we do investigate the FLOAT way, maybe something can be done to still keep some operations in the FP64 lines
about validation, note that the number of distinct random numbers in [0,1] is O(E7-E8), say at least 10M, as there are ~7 significant digits; however each particle has 4 random numbers, so there are at least E28 distinct states of one particle, and there are at least four particles, so at least E104 distinct events, so this is not a problem (and along these lines one may indeed consider half precision too, at least on this specific issue?)

I would keep it to this for the moment

…builds. The build fails on clang10 at compilation time clang++: /build/gcc/build/contrib/clang-10.0.0/src/clang/10.0.0/tools/clang/lib/CodeGen/CGExpr.cpp:596: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed. Stack dump: 0. Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -I/usr/local/cuda-11.0/include/ -c CPPProcess.cc -o CPPProcess.o 1. <eof> parser at end of file 2. Per-file LLVM IR generation 3. ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]' #0 0x0000000001af5f9a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af5f9a) #1 0x0000000001af3d54 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3d54) #2 0x0000000001af3fa9 llvm::sys::CleanupOnSignal(unsigned long) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3fa9) madgraph5#3 0x0000000001a6ed08 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1a6ed08) madgraph5#4 0x00007fd31c178630 __restore_rt (/lib64/libpthread.so.0+0xf630) madgraph5#5 0x00007fd31ac8c3d7 raise (/lib64/libc.so.6+0x363d7) madgraph5#6 0x00007fd31ac8dac8 abort (/lib64/libc.so.6+0x37ac8) madgraph5#7 0x00007fd31ac851a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6) madgraph5#8 0x00007fd31ac85252 (/lib64/libc.so.6+0x2f252) madgraph5#9 0x000000000203a042 clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(clang::Expr const*) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x203a042)

------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Internal loops fptype_sv = VECTOR[1] ('none': scalar, no SIMD) MatrixElements compiler = clang 11.0.0 EvtsPerSec[MatrixElems] (3) = ( 1.263547e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.168746 sec real 0m7.176s =Symbols in CPPProcess.o= (~sse4: 1241) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) MatrixElements compiler = clang 11.0.0 EvtsPerSec[MatrixElems] (3) = ( 1.218104e+06 ) sec^-1 MeanMatrixElemValue = ( 1.372113e-02 +- 3.270608e-06 ) GeV^0 TOTAL : 7.455322 sec real 0m7.463s =Symbols in CPPProcess.o= (~sse4: 1165) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- The build with vectors still fails also on clang11 in the same place clang++: /build/dkonst/CONTRIB/build/contrib/clang-11.0.0/src/clang/11.0.0/clang/lib/CodeGen/CGExpr.cpp:613: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed. PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -Wall -Wshadow -Wextra -DMGONGPU_COMMONRAND_ONHOST -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -c CPPProcess.cc -o CPPProcess.o 1. <eof> parser at end of file 2. Per-file LLVM IR generation 3. ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]' #0 0x0000000001ce208a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang+++0x1ce208a) #1 0x0000000001cdfe94 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang+++0x1cdfe94) #2 0x0000000001c52d98 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang+++0x1c52d98) madgraph5#3 0x00007f1836000630 __restore_rt (/lib64/libpthread.so.0+0xf630) madgraph5#4 0x00007f18350f13d7 raise (/lib64/libc.so.6+0x363d7) madgraph5#5 0x00007f18350f2ac8 abort (/lib64/libc.so.6+0x37ac8)

… see madgraph5#5, madgraph5#212

…y done for C++) - now 'make FPTYPE=f check' succeeds! - see madgraph5#5, madgraph5#212

…ns is different for fcheck > ./fcheck.exe 2048 64 10 GPUBLOCKS= 2048 GPUTHREADS= 64 NITERATIONS= 10 WARNING! Instantiate host Bridge (nevt=131072) INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it WARNING! Instantiate host Sampler (nevt=131072) Iteration #1 Iteration #2 Iteration madgraph5#3 Iteration madgraph5#4 Iteration madgraph5#5 Iteration madgraph5#6 Iteration madgraph5#7 Iteration madgraph5#8 Iteration madgraph5#9 WARNING! flagging abnormal ME for ievt=111162 Iteration madgraph5#10 Average Matrix Element: 1.3716954486179133E-002 Abnormal MEs: 1 > ./check.exe -p 2048 64 10 | grep FLOAT FP precision = FLOAT (NaN/abnormal=2, zero=0) I imagine that this is because momenta in Fortran get translated from float to double and back to float, while in c++ they stay in float?

…e and float (see madgraph5#5 and madgraph5#212)

valassi · 2022-05-11T17:42:16Z

For lack of a better ticket (should open a 'numerical precision saga' with nans fast math etc) note the interesting observations in #417 (comment) when comparing fortran and cudacpp: there may be large differences, which are bigger in float than in double, but the funny thing is that cudacpp seems to be systematically below fortran (at least, the outliers are more in that direction)

…failing patching file Source/dsample.f Hunk madgraph5#3 FAILED at 181. Hunk madgraph5#4 succeeded at 197 (offset 2 lines). Hunk madgraph5#5 FAILED at 211. Hunk madgraph5#6 succeeded at 893 (offset 3 lines). 2 out of 6 hunks FAILED -- saving rejects to file Source/dsample.f.rej patching file SubProcesses/addmothers.f patching file SubProcesses/cuts.f patching file SubProcesses/makefile Hunk madgraph5#3 FAILED at 61. Hunk madgraph5#4 succeeded at 94 (offset 6 lines). Hunk madgraph5#5 succeeded at 122 (offset 6 lines). 1 out of 5 hunks FAILED -- saving rejects to file SubProcesses/makefile.rej patching file SubProcesses/reweight.f Hunk #1 FAILED at 1782. Hunk #2 succeeded at 1827 (offset 27 lines). Hunk madgraph5#3 succeeded at 1841 (offset 27 lines). Hunk madgraph5#4 succeeded at 1963 (offset 27 lines). 1 out of 4 hunks FAILED -- saving rejects to file SubProcesses/reweight.f.rej patching file auto_dsig.f Hunk madgraph5#6 FAILED at 301. Hunk madgraph5#10 succeeded at 773 with fuzz 2 (offset 4 lines). Hunk madgraph5#11 succeeded at 912 (offset 16 lines). Hunk madgraph5#12 succeeded at 958 (offset 16 lines). Hunk madgraph5#13 succeeded at 971 (offset 16 lines). Hunk madgraph5#14 succeeded at 987 (offset 16 lines). Hunk madgraph5#15 succeeded at 1006 (offset 16 lines). Hunk madgraph5#16 succeeded at 1019 (offset 16 lines). 1 out of 16 hunks FAILED -- saving rejects to file auto_dsig.f.rej patching file driver.f patching file matrix1.f patching file auto_dsig1.f Hunk #2 succeeded at 220 (offset 7 lines). Hunk madgraph5#3 succeeded at 290 (offset 7 lines). Hunk madgraph5#4 succeeded at 453 (offset 8 lines). Hunk madgraph5#5 succeeded at 464 (offset 8 lines).

Transfer changes from master back into gpu_abstraction branch

…e and float (see madgraph5/madgraph4gpu#5 and madgraph5/madgraph4gpu#212)

…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected STARTED AT Thu May 16 01:24:16 AM CEST 2024 (SM tests) ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt The new issue madgraph5#845 is the following +Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. + +Backtrace for this error: +#0 0x7f2a1a623860 in ??? +#1 0x7f2a1a622a05 in ??? +#2 0x7f2a1a254def in ??? +madgraph5#3 0x7f2a1ae20acc in ??? +madgraph5#4 0x7f2a1acc4575 in ??? +madgraph5#5 0x7f2a1ae1d4c9 in ??? +madgraph5#6 0x7f2a1ae2570d in ??? +madgraph5#7 0x7f2a1ae2afa1 in ??? +madgraph5#8 0x43008b in ??? +madgraph5#9 0x431c10 in ??? +madgraph5#10 0x432d47 in ??? +madgraph5#11 0x433b1e in ??? +madgraph5#12 0x44a921 in ??? +madgraph5#13 0x42ebbf in ??? +madgraph5#14 0x40371e in ??? +madgraph5#15 0x7f2a1a23feaf in ??? +madgraph5#16 0x7f2a1a23ff5f in ??? +madgraph5#17 0x403844 in ??? +madgraph5#18 0xffffffffffffffff in ??? +./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} +ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed

roiser added enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator labels Aug 12, 2020

valassi mentioned this issue Aug 18, 2020

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Closed

valassi mentioned this issue Mar 18, 2021

Check the handling of NaN/Inf (both in Fortran madevent and in our standalone C++/CUDA) #129

Open

valassi mentioned this issue Feb 9, 2022

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Open

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022

[shared] allow double in Fortran and float in C++ (not yet in CUDA) -…

d1d5a95

… see madgraph5#5, madgraph5#212

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022

[shared] allow double in Fortran and float also in CUDA (as previousl…

d430ef1

…y done for C++) - now 'make FPTYPE=f check' succeeds! - see madgraph5#5, madgraph5#212

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022

[shared] in codegen, improve a comment in Bridge.h about mixing doubl…

12ba4aa

…e and float (see madgraph5#5 and madgraph5#212)

valassi mentioned this issue Feb 24, 2022

Shared libraries + Bridge + Cleaner Makefiles #367

Merged

valassi pushed a commit to valassi/madgraph4gpu that referenced this issue Jul 13, 2023

Merge pull request madgraph5#5 from Jooorgen/master

1dfeb18

Transfer changes from master back into gpu_abstraction branch

valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023

[shared] in codegen, improve a comment in Bridge.h about mixing doubl…

64185a3

…e and float (see madgraph5/madgraph4gpu#5 and madgraph5/madgraph4gpu#212)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use single precision or even half precision #5

Use single precision or even half precision #5

roiser commented Aug 12, 2020

roiser commented Aug 12, 2020

valassi commented Aug 13, 2020

valassi commented Aug 19, 2020

valassi commented May 11, 2022

Use single precision or even half precision #5

Use single precision or even half precision #5

Comments

roiser commented Aug 12, 2020

roiser commented Aug 12, 2020

valassi commented Aug 13, 2020

valassi commented Aug 19, 2020

valassi commented May 11, 2022