Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cuComplex instead of thrust::complex #6

Closed
roiser opened this issue Aug 12, 2020 · 4 comments
Closed

Use cuComplex instead of thrust::complex #6

roiser opened this issue Aug 12, 2020 · 4 comments
Assignees
Labels
enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator

Comments

@roiser
Copy link
Member

roiser commented Aug 12, 2020

Done (SR)

available in

eemumu_auto/mg5Complex

Need to include also the complex implementation by Laurence

@roiser roiser added enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator labels Aug 12, 2020
@roiser roiser self-assigned this Aug 13, 2020
@valassi
Copy link
Member

valassi commented Aug 13, 2020

This is now fully integrated here: roiser@bf1b625

But it does seem slower than thurst. See https://docs.google.com/document/d/1g2xwJ2FsSlxHvSUdPZjCyFW7zhsblMQ4g8UHlrkWyVw/edit#

@valassi
Copy link
Member

valassi commented Aug 14, 2020

When integrating cucomplex, I looked at the header and thought it might have a better performance because it includes explicit align, which maybe thrust does not.

This is an interesting read however: https://stackoverflow.com/questions/46965768/optimizing-memory-access-for-complex-numbers. There is a comment from the thrust developer there, saying that he implemented a better alignment in thrust.

PS About this specific issue: I am no longer convinced alignment is an issue for our complex, as registers are vector registers anyway... it may be relevant only if we spill to global memory and/or we fragment compution eg with cuda graphs

@valassi
Copy link
Member

valassi commented Aug 19, 2020

Before I move to other things, f=I sump also some plots on cucomplex vs thrust. I confirm that I observe a small degradation of throughput with cucomplex, 6.5E8 instead of 6.8E8.

First, a small degradation is visible in SOL, the SM is a bit worse, while memory is more used. Note the registers are 184 instead opf 172, which may also be the main reason.
image

The warp state actually seems better, with more math pipe throttle?
image

Well, actually the issue may be that there are many more instructions executed. This may depend on the types themselves, or maybe on the headers with operator overloading (which I copied from Stefan, but I made misnor modifications, it may be worth seeing if I made some transcription mistakes...)
image

Registers per thread are higher, but this does not seem to have any negative impact on occupancy
image

Conclusion:

  • the two main differences seem to be the higher number of registers and the higher number of instructions
  • the number of registers seems to have no negative impact
  • the higher number of instructions may well be the only culprit for the observed slowdown, it may be worth checking if I made a transcription mistake (or judgement error) while copying/modifying Stefan's headers
  • the difference is any case mninimal, and for the moment it's worth staying with thrust

valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 23, 2021
…builds.

The build fails on clang10 at compilation time

clang++: /build/gcc/build/contrib/clang-10.0.0/src/clang/10.0.0/tools/clang/lib/CodeGen/CGExpr.cpp:596: clang::CodeGen::RValue clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(const clang::Expr*): Assertion `LV.isSimple()' failed.
Stack dump:
0.      Program arguments: /cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -DUSE_NVTX -Wall -Wshadow -Wextra -fopenmp -ffast-math -march=skylake-avx512 -mprefer-vector-width=256 -I/usr/local/cuda-11.0/include/ -c CPPProcess.cc -o CPPProcess.o
1.      <eof> parser at end of file
2.      Per-file LLVM IR generation
3.      ../../src/mgOnGpuVectors.h:59:16: Generating code for declaration 'mgOnGpu::cxtype_v::operator[]'
 #0 0x0000000001af5f9a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af5f9a)
 #1 0x0000000001af3d54 llvm::sys::RunSignalHandlers() (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3d54)
 #2 0x0000000001af3fa9 llvm::sys::CleanupOnSignal(unsigned long) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1af3fa9)
 madgraph5#3 0x0000000001a6ed08 CrashRecoverySignalHandler(int) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x1a6ed08)
 madgraph5#4 0x00007fd31c178630 __restore_rt (/lib64/libpthread.so.0+0xf630)
 madgraph5#5 0x00007fd31ac8c3d7 raise (/lib64/libc.so.6+0x363d7)
 madgraph5#6 0x00007fd31ac8dac8 abort (/lib64/libc.so.6+0x37ac8)
 madgraph5#7 0x00007fd31ac851a6 __assert_fail_base (/lib64/libc.so.6+0x2f1a6)
 madgraph5#8 0x00007fd31ac85252 (/lib64/libc.so.6+0x2f252)
 madgraph5#9 0x000000000203a042 clang::CodeGen::CodeGenFunction::EmitReferenceBindingToExpr(clang::Expr const*) (/cvmfs/sft.cern.ch/lcg/releases/clang/10.0.0-62e61/x86_64-centos7/bin/clang+++0x203a042)
@valassi
Copy link
Member

valassi commented Oct 21, 2021

I would say that this can be closed.

We have cucomplex still in the headers of the the code, even if they are not the default.

We can always test it again if we like. But actually I tend to think that a custom cxtype_s with a simple pair of numbers (like cxtype_v is for RRRRIIII) would be more interesting, and allows some custom returns by reference.

I am closing this

@valassi valassi closed this as completed Oct 21, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022
…ns is different for fcheck

> ./fcheck.exe  2048 64 10
 GPUBLOCKS=          2048
 GPUTHREADS=           64
 NITERATIONS=          10
WARNING! Instantiate host Bridge (nevt=131072)
INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
WARNING! Instantiate host Sampler (nevt=131072)
Iteration #1
Iteration #2
Iteration madgraph5#3
Iteration madgraph5#4
Iteration madgraph5#5
Iteration madgraph5#6
Iteration madgraph5#7
Iteration madgraph5#8
Iteration madgraph5#9
WARNING! flagging abnormal ME for ievt=111162
Iteration madgraph5#10
 Average Matrix Element:   1.3716954486179133E-002
 Abnormal MEs:           1

> ./check.exe -p  2048 64 10 | grep FLOAT
FP precision                = FLOAT (NaN/abnormal=2, zero=0)

I imagine that this is because momenta in Fortran get translated from float to double and back to float, while in c++ they stay in float?
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 20, 2022
…failing

patching file Source/dsample.f
Hunk madgraph5#3 FAILED at 181.
Hunk madgraph5#4 succeeded at 197 (offset 2 lines).
Hunk madgraph5#5 FAILED at 211.
Hunk madgraph5#6 succeeded at 893 (offset 3 lines).
2 out of 6 hunks FAILED -- saving rejects to file Source/dsample.f.rej
patching file SubProcesses/addmothers.f
patching file SubProcesses/cuts.f
patching file SubProcesses/makefile
Hunk madgraph5#3 FAILED at 61.
Hunk madgraph5#4 succeeded at 94 (offset 6 lines).
Hunk madgraph5#5 succeeded at 122 (offset 6 lines).
1 out of 5 hunks FAILED -- saving rejects to file SubProcesses/makefile.rej
patching file SubProcesses/reweight.f
Hunk #1 FAILED at 1782.
Hunk #2 succeeded at 1827 (offset 27 lines).
Hunk madgraph5#3 succeeded at 1841 (offset 27 lines).
Hunk madgraph5#4 succeeded at 1963 (offset 27 lines).
1 out of 4 hunks FAILED -- saving rejects to file SubProcesses/reweight.f.rej
patching file auto_dsig.f
Hunk madgraph5#6 FAILED at 301.
Hunk madgraph5#10 succeeded at 773 with fuzz 2 (offset 4 lines).
Hunk madgraph5#11 succeeded at 912 (offset 16 lines).
Hunk madgraph5#12 succeeded at 958 (offset 16 lines).
Hunk madgraph5#13 succeeded at 971 (offset 16 lines).
Hunk madgraph5#14 succeeded at 987 (offset 16 lines).
Hunk madgraph5#15 succeeded at 1006 (offset 16 lines).
Hunk madgraph5#16 succeeded at 1019 (offset 16 lines).
1 out of 16 hunks FAILED -- saving rejects to file auto_dsig.f.rej
patching file driver.f
patching file matrix1.f
patching file auto_dsig1.f
Hunk #2 succeeded at 220 (offset 7 lines).
Hunk madgraph5#3 succeeded at 290 (offset 7 lines).
Hunk madgraph5#4 succeeded at 453 (offset 8 lines).
Hunk madgraph5#5 succeeded at 464 (offset 8 lines).
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 17, 2024
…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected

STARTED  AT Thu May 16 01:24:16 AM CEST 2024
(SM tests)
ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

The new issue madgraph5#845 is the following
+Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
+
+Backtrace for this error:
+#0  0x7f2a1a623860 in ???
+#1  0x7f2a1a622a05 in ???
+#2  0x7f2a1a254def in ???
+madgraph5#3  0x7f2a1ae20acc in ???
+madgraph5#4  0x7f2a1acc4575 in ???
+madgraph5#5  0x7f2a1ae1d4c9 in ???
+madgraph5#6  0x7f2a1ae2570d in ???
+madgraph5#7  0x7f2a1ae2afa1 in ???
+madgraph5#8  0x43008b in ???
+madgraph5#9  0x431c10 in ???
+madgraph5#10  0x432d47 in ???
+madgraph5#11  0x433b1e in ???
+madgraph5#12  0x44a921 in ???
+madgraph5#13  0x42ebbf in ???
+madgraph5#14  0x40371e in ???
+madgraph5#15  0x7f2a1a23feaf in ???
+madgraph5#16  0x7f2a1a23ff5f in ???
+madgraph5#17  0x403844 in ???
+madgraph5#18  0xffffffffffffffff in ???
+./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
+ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator
Projects
None yet
Development

No branches or pull requests

2 participants