Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single precision or even half precision #5

Open
roiser opened this issue Aug 12, 2020 · 4 comments
Open

Use single precision or even half precision #5

roiser opened this issue Aug 12, 2020 · 4 comments
Labels
enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator

Comments

@roiser
Copy link
Member

roiser commented Aug 12, 2020

Started by OV for double -> float. Reduces register pressure. Now also typedef switch (AV).

in eemumu_AV/master

It’s a global switch, in the end one may want to have finer grained usage of single/double precision per calculation.

@roiser roiser added enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator labels Aug 12, 2020
@roiser
Copy link
Member Author

roiser commented Aug 12, 2020

see also #6

@valassi
Copy link
Member

valassi commented Aug 13, 2020

I have integrated this (single precision, not half precision) here: roiser@85201ea

Here I implemented a check to avoid nans from single precision: roiser@6a31ca2

Note that FLOAT is a factor 2.3 faster than DOUBLE. I guess a factor 2 comes from the allowed use of FP32 capabilities of theGPU, the extra 15% comes from lower memory usage, possible fewer rgisters and hence better throughput?
See https://docs.google.com/document/d/1g2xwJ2FsSlxHvSUdPZjCyFW7zhsblMQ4g8UHlrkWyVw/edit#

@valassi
Copy link
Member

valassi commented Aug 19, 2020

Before I move to something else, I dump a few plots from the profiler here. As discussed, moving from DOUBLE to FLOAT increase throughput by more than a naive factor 2 (which one could assume because we now use the FP32 units and not only the FP64 ones). In my test it went from 6.8E8 to 1.64E9, ie a factor 2.4. For instance register usage is much smaller, and this brings a lot of benefits.

This is the overview. Registers down from 172 to 80. The number of requests is the ame but the number of transactions is divided by two as expected.
image

The roofline shows that FP32 is now well utilized while it was almost unused before. Conversely, FP64 is now AT ZERO.
image

The workload analysis shows that 32bit FMA is now the main workhorse. Instead FP64 has essentially disappeared
image

Memory usage mainly shows that the number of MB needed has been halved
image

The number of warps per scheduler more than doubvles, possible because of fewer registers used
image

Stall math pipe throttle more than doubles, which is good
image

The instruction mix sees an explosion of fp32 instructions
image

Finally, achieved occupancy almost triples, thanks to the reduced number of registers
image

Some conclusions

  • this is just a feaibility study, clearly FLOAT has much better throughput than DOUBLE but physics validation is needed
  • if we do investigate the FLOAT way, maybe something can be done to still keep some operations in the FP64 lines
  • about validation, note that the number of distinct random numbers in [0,1] is O(E7-E8), say at least 10M, as there are ~7 significant digits; however each particle has 4 random numbers, so there are at least E28 distinct states of one particle, and there are at least four particles, so at least E104 distinct events, so this is not a problem (and along these lines one may indeed consider half precision too, at least on this specific issue?)

I would keep it to this for the moment

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 23, 2021
…builds.

The build fails on clang10 at compilation time

clang++: /build/gcc/build/contrib/clang-10.0.0/src/clang/10.0.0/tools/clang/lib/CodeGen/CGExpr.cpp:596: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed.
Stack dump:
0.      Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -I/usr/local/cuda-11.0/include/ -c CPPProcess.cc -o CPPProcess.o
1.      <eof> parser at end of file
2.      Per-file LLVM IR generation
3.      ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]'
 #0 0x0000000001af5f9a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af5f9a)
 #1 0x0000000001af3d54 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3d54)
 #2 0x0000000001af3fa9 llvm::sys::CleanupOnSignal(unsigned long) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3fa9)
 madgraph5#3 0x0000000001a6ed08 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1a6ed08)
 madgraph5#4 0x00007fd31c178630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 madgraph5#5 0x00007fd31ac8c3d7 raise (/lib64/libc.so.6+0x363d7)
 madgraph5#6 0x00007fd31ac8dac8 abort (/lib64/libc.so.6+0x37ac8)
 madgraph5#7 0x00007fd31ac851a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6)
 madgraph5#8 0x00007fd31ac85252 (/lib64/libc.so.6+0x2f252)
 madgraph5#9 0x000000000203a042 clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(clang::Expr const*) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x203a042)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 23, 2021
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Internal loops fptype_sv    = VECTOR[1] ('none': scalar, no SIMD)
MatrixElements compiler     = clang 11.0.0
EvtsPerSec[MatrixElems] (3) = ( 1.263547e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.168746 sec
real    0m7.176s
=Symbols in CPPProcess.o= (~sse4: 1241) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
MatrixElements compiler     = clang 11.0.0
EvtsPerSec[MatrixElems] (3) = ( 1.218104e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     7.455322 sec
real    0m7.463s
=Symbols in CPPProcess.o= (~sse4: 1165) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------

The build with vectors still fails also on clang11 in the same place

clang++: /build/dkonst/CONTRIB/build/contrib/clang-11.0.0/src/clang/11.0.0/clang/lib/CodeGen/CGExpr.cpp:613: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.      Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -Wall -Wshadow -Wextra -DMGONGPU_COMMONRAND_ONHOST -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -c CPPProcess.cc -o CPPProcess.o
1.      <eof> parser at end of file
2.      Per-file LLVM IR generation
3.      ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]'
 #0 0x0000000001ce208a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang+++0x1ce208a)
 #1 0x0000000001cdfe94 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang+++0x1cdfe94)
 #2 0x0000000001c52d98 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/11.0.0-77a9f/x86_64-centos7/bin/clang+++0x1c52d98)
 madgraph5#3 0x00007f1836000630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 madgraph5#4 0x00007f18350f13d7 raise (/lib64/libc.so.6+0x363d7)
 madgraph5#5 0x00007f18350f2ac8 abort (/lib64/libc.so.6+0x37ac8)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022
…y done for C++) - now 'make FPTYPE=f check' succeeds! - see madgraph5#5, madgraph5#212
valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022
…ns is different for fcheck

> ./fcheck.exe  2048 64 10
 GPUBLOCKS=          2048
 GPUTHREADS=           64
 NITERATIONS=          10
WARNING! Instantiate host Bridge (nevt=131072)
INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
WARNING! Instantiate host Sampler (nevt=131072)
Iteration #1
Iteration #2
Iteration madgraph5#3
Iteration madgraph5#4
Iteration madgraph5#5
Iteration madgraph5#6
Iteration madgraph5#7
Iteration madgraph5#8
Iteration madgraph5#9
WARNING! flagging abnormal ME for ievt=111162
Iteration madgraph5#10
 Average Matrix Element:   1.3716954486179133E-002
 Abnormal MEs:           1

> ./check.exe -p  2048 64 10 | grep FLOAT
FP precision                = FLOAT (NaN/abnormal=2, zero=0)

I imagine that this is because momenta in Fortran get translated from float to double and back to float, while in c++ they stay in float?
valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022
@valassi
Copy link
Member

valassi commented May 11, 2022

For lack of a better ticket (should open a 'numerical precision saga' with nans fast math etc) note the interesting observations in #417 (comment) when comparing fortran and cudacpp: there may be large differences, which are bigger in float than in double, but the funny thing is that cudacpp seems to be systematically below fortran (at least, the outliers are more in that direction)

valassi added a commit to valassi/madgraph4gpu that referenced this issue May 20, 2022
…failing

patching file Source/dsample.f
Hunk madgraph5#3 FAILED at 181.
Hunk madgraph5#4 succeeded at 197 (offset 2 lines).
Hunk madgraph5#5 FAILED at 211.
Hunk madgraph5#6 succeeded at 893 (offset 3 lines).
2 out of 6 hunks FAILED -- saving rejects to file Source/dsample.f.rej
patching file SubProcesses/addmothers.f
patching file SubProcesses/cuts.f
patching file SubProcesses/makefile
Hunk madgraph5#3 FAILED at 61.
Hunk madgraph5#4 succeeded at 94 (offset 6 lines).
Hunk madgraph5#5 succeeded at 122 (offset 6 lines).
1 out of 5 hunks FAILED -- saving rejects to file SubProcesses/makefile.rej
patching file SubProcesses/reweight.f
Hunk #1 FAILED at 1782.
Hunk #2 succeeded at 1827 (offset 27 lines).
Hunk madgraph5#3 succeeded at 1841 (offset 27 lines).
Hunk madgraph5#4 succeeded at 1963 (offset 27 lines).
1 out of 4 hunks FAILED -- saving rejects to file SubProcesses/reweight.f.rej
patching file auto_dsig.f
Hunk madgraph5#6 FAILED at 301.
Hunk madgraph5#10 succeeded at 773 with fuzz 2 (offset 4 lines).
Hunk madgraph5#11 succeeded at 912 (offset 16 lines).
Hunk madgraph5#12 succeeded at 958 (offset 16 lines).
Hunk madgraph5#13 succeeded at 971 (offset 16 lines).
Hunk madgraph5#14 succeeded at 987 (offset 16 lines).
Hunk madgraph5#15 succeeded at 1006 (offset 16 lines).
Hunk madgraph5#16 succeeded at 1019 (offset 16 lines).
1 out of 16 hunks FAILED -- saving rejects to file auto_dsig.f.rej
patching file driver.f
patching file matrix1.f
patching file auto_dsig1.f
Hunk #2 succeeded at 220 (offset 7 lines).
Hunk madgraph5#3 succeeded at 290 (offset 7 lines).
Hunk madgraph5#4 succeeded at 453 (offset 8 lines).
Hunk madgraph5#5 succeeded at 464 (offset 8 lines).
valassi pushed a commit to valassi/madgraph4gpu that referenced this issue Jul 13, 2023
Transfer changes from master back into gpu_abstraction branch
valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 17, 2024
…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected

STARTED  AT Thu May 16 01:24:16 AM CEST 2024
(SM tests)
ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

The new issue madgraph5#845 is the following
+Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
+
+Backtrace for this error:
+#0  0x7f2a1a623860 in ???
+#1  0x7f2a1a622a05 in ???
+#2  0x7f2a1a254def in ???
+madgraph5#3  0x7f2a1ae20acc in ???
+madgraph5#4  0x7f2a1acc4575 in ???
+madgraph5#5  0x7f2a1ae1d4c9 in ???
+madgraph5#6  0x7f2a1ae2570d in ???
+madgraph5#7  0x7f2a1ae2afa1 in ???
+madgraph5#8  0x43008b in ???
+madgraph5#9  0x431c10 in ???
+madgraph5#10  0x432d47 in ???
+madgraph5#11  0x433b1e in ???
+madgraph5#12  0x44a921 in ???
+madgraph5#13  0x42ebbf in ???
+madgraph5#14  0x40371e in ???
+madgraph5#15  0x7f2a1a23feaf in ???
+madgraph5#16  0x7f2a1a23ff5f in ???
+madgraph5#17  0x403844 in ???
+madgraph5#18  0xffffffffffffffff in ???
+./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
+ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator
Projects
None yet
Development

No branches or pull requests

2 participants