-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel launchers and SIMD vectorization #71
Comments
Repeating a few points I noted in #82 (comment) (where multithreading is discussed, eg using openmp), the general parallelization strategy would be:
|
While playing with pragma omp parallel for, I also saw there is a pragma omp simd, https://bisqwit.iki.fi/story/howto/openmp/#SimdConstructOpenmp%204%200. Maybe that can be a simpler alternative to vector sompiler extensions (but I still think I need to pass vectors somehow in and out). To be kept in mind. |
There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look. |
1 similar comment
There is an interesting chapter in the Data Parallel C++ book on 'Programing for CPUs'. There is a specific subsection 'SIMD Vectorization on CPU'. You may be interested to take a look. |
Thanks Laurence. I assume you mean https://link.springer.com/chapter/10.1007/978-1-4842-5574-2_11. It is interesting. I prefer another reference by Sebastien however, it contains many more practical details, http://sponce.web.cern.ch/sponce/CSC/slides/PracticalVectorization.booklet.pdf. I also got from him a few useful headers and papers on intrinsics (but I hope I do not need to go that way). |
And... it is finally paying off :-) I am now at around a factor 4 speedup gained through SIMD vectorization in c++, from 0.5E5 to 2E6 for MEs. (Maybe more? I am not 100% sure where I was starting from). Now: valassi@5f9ff6d
One week ago: valassi@3178e95
Using also some good advice from @sponce I debugged the issues. Then I have slowly added more and more stuff to vectors. I managed to stay with autovectorization on compiler vector extensions, no Vc/VecCore or others. Many more things to clean up and analyse, but this definitely very promising!
It was faster than I thought to achieve this, bits and pieces of 10 days. This would not have been possible without the work on AOSOA this summer however. |
Excellent, this is a really good news and impressive progress. |
This is excellent news ! Really impressive I must say. |
Thanks a lot Olivier and Sebastien! I was not sure how to answer your question, so I have done a bit more analysis and prototyping with compiler flags, and some cleanup in the code. I have decided to cleanly hardcode and support only three scenarios: scalar, AVX2 and AVX512 (maybe I will add SSE, but whats the point today). Sebastien, I was a bit inspired by Arthur's work, which we had discussed in the past, https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Kernel/LHCbMath/LHCbMath/SIMDWrapper.h. What I took is the use of ifdefs, with AVX512F and AVX2. So now I have
Then I tried several combinations of Makefiles, all with -O3, and measured the matrix element throughput:
So all in all I would say:
All these numbers must be taken with a pinch of salt as there may be some fluctuations due to the load on the VM (in principle I was told this should not happen, but it's not completely ruled out). Anyway I repeated these tests a few times in the same time frame, there should be no fluctation. I think that being close to perfect speedup is excellent news, but I am not completely surprised, as I am only timing the number crunching which is perfectly parallelizable: all events go through excatly the same calculation, so the calculation is fully in lockstep. I might even recover a tiny bit of what is missing between 3.82 and 4, when the last bits and pieces are also vectorized (the amp[2]). Note that on the GPU we have no evidence of thread divergence, and this is exactly the same thing. If you have any comments about AVX512, please let me know! But all I have heard from @sponce and @hageboeck sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless. Ah, another question I have was whether |
My (small) experience with AVX512 is to let the compiler decide when to use it.
It seems that (at least intel) compiler knows pretty well when it provides a speed boost (and therefore it avoid it quite often).
I indeed hear a lot of negative feedback about it.
Cheers,
Olivier
On 8 Dec 2020, at 13:40, Andrea Valassi <[email protected]<mailto:[email protected]>> wrote:
Thanks a lot Olivier and Sebastien! I was not sure how to answer your question, so I have done a bit more analysis and prototyping with compiler flags, and some cleanup in the code.
I have decided to cleanly hardcode and support only three scenarios: scalar, AVX2 and AVX512 (maybe I will add SSE, but whats the point today). Sebastien, I was a bit inspired by Arthur's work, which we had discussed in the past, https://gitlab.cern.ch/lhcb/LHCb/-/blob/master/Kernel/LHCbMath/LHCbMath/SIMDWrapper.h. What I took is the use of ifdefs, with AVX512F and AVX2. So now I have
* if AVX512F is defined, use double[8] or better double __attribute__ ((vector_size (64))) for internal loops
* if AVX2 is defined, use double[4] or better double __attribute__ ((vector_size (32))) for internal loops
* if neither is defined, use double for internal loops
Then a complex vector is a class wrapping two double vectors, as RRRRIIII. I rely on the compiler vector extension above fully for floating point, the only boilerplate addition I need for vector types is for complex types (operation on two complex vectors, a complex vector and a complex scalar, a complex vector and a double scalar, a double vector and a complex scalar...).
Then I tried several combinations of Makefiles, all with -O3, and measured the matrix element throughput:
* no additional -m, hence scalar: 5.4E5 MEs/sec valassi@76fc2b8<valassi@76fc2b8>
* with -mavx2: 1.87E6 MEs/sec valassi@5282c92<valassi@5282c92>
* with -march=core-avx2: 2.08E6 MEs/sec (fastest, and present default) valassi@59e36b3<valassi@59e36b3>
* with -mavx512f -mavx512cd -mprefer-vector-width=512: 2.01E6 MEs/sec valassi@7bde798<valassi@7bde798>
* with -march=native -mprefer-vector-width=512: also 2.01E6 MEs/sec valassi@898cdfc<valassi@898cdfc>
* with -march=native and no preferred vector width: 2.06E6 MEs/sec valassi@ba3f4c4<valassi@ba3f4c4>
So all in all I would say:
* with -march=core-avx2 I get 2.08E6/5.4E5 i.e. a factor 3.82 (close to ideal 4), and this is ONLY from vectorization
* I get around 20% better with -march=core-avx2 than -mavx2 (I am on a Skylake VM)
* with AVX512, I get almost the same as AVX2 or slightly worse, certainly not better
* with AVX512, I thought that -mprefer-vector-width=512 should have a positive effect (see https://stackoverflow.com/a/52543573), but if anything it is slightly worse
All these numbers must be taken with a pinch of salt as there may be some fluctuations due to the load on the VM (in principle I was told this should not happen, but it's not completely ruled out). Anyway I repeated these tests a few times in the same time frame, there should be no fluctation.
I think that being close to perfect speedup is excellent news, but I am not completely surprised, as I am only timing the number crunching which is perfectly parallelizable: all events go through excatly the same calculation, so the calculation is fully in lockstep. I might even recover a tiny bit of what is missing between 3.82 and 4, when the last bits and pieces are also vectorized (the amp[2]). Note that on the GPU we have no evidence of thread divergence, and this is exactly the same thing.
If you have any comments about AVX512, please let me know! But all I have heard from @sponce<https://github.com/sponce> and @hageboeck<https://github.com/hageboeck> sounds like it is better to stay at AVX2 and not bother further. I might try a KNL for fun at some point (see https://colfaxresearch.com/knl-avx512), but it is probably pointless.
Ah, another question I have was whether alignas can make any difference here. I have the impression that the double __attribute__ ((vector_size (32))) is already an aligned RRRR. I have added an align as to the complex vector just in case, but now that I think of it it is irrelevant by definition (the operations are other on RRRR or on IIII).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#71 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6535W6MQYTDD546WWXYJLSTYNFLANCNFSM4UGYUJ3A>.
|
Very nice work. And answers a lot of my questions.
|
My comment is that mid-aged compilers are not good with it. In RooFit, only very few things profited from AVX512, and not so many "standard" CPUs support it. Oh, and all we did was ever only autovectorisation. Recent compilers are great if you write simple code! And lastly: And lastly lastly: cd $directory
if [[ "$compileCommand" =~ ^.*clang.*$ ]]; then
clangFlags="-Xclang -fcolor-diagnostics -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize -fno-math-errno"
# Run with diagnostics or on compiler error run without redirecting output
$compileCommand $clangFlags >/tmp/vecReport_all.txt 2>&1 || $compileCommand || exit 1
# Not interested in std library vectorisation reports:
grep -v "/usr/" /tmp/vecReport_all.txt > /tmp/vecReport.txt
sed -nE '/remark.*(not vectorized|vector.*not benef)/{N;{p}}' /tmp/vecReport.txt | sed -n 'N; s/\n/LINEBREAK/p' | sort -u | sed -n 's/LINEBREAK/\n/p'
grep --color "vectorized loop" /tmp/vecReport.txt
else
gccFlags="-ftree-vectorizer-verbose=2 -fdiagnostics-color=always"
$compileCommand $gccFlags -fopt-info-vec-missed 2>&1 | grep -vE "^/usr/|googletest" | sort -u || $compileCommand || exit 1
$compileCommand $gccFlags -fopt-info-vec-optimized 2>&1 | grep --color -E "^/home.*vectorized"
fi |
I realize I forgot a point in my comment (although last one somehow encompasses it) : did you compute how much you gain overall with this vectorization, I mean full processing time and not only the vectorized part. I ask because I've seen so many cases of perfect vectorization (factor 4 here) where the final software was slower overall, the reason being that you lose more time later dealing with vectorized data than you've gained initially. Of course that all depends how big the vectorized part is and how bad the use of vector data is later (maybe you actually even gain there). |
Hi, thanks both :-) @hageboeck, looks like I am more or less along your lines already in the klas branch
@sponce, good points:
Finally very good point on the real speedup and Ahmdahl's law: I know, here I am talking of the matrix element. But funnily enough, even for a simple LEP eemumu process, on C++/CPU this is the dominant part (on a GPU it is totally negligible). Quoting by heart, something like 12s for MEs, against 1s for the rest, so now reduced to 3s+1s. When we go to LHC processes, the ME will be MUCH larger, both on CPU and GPU, I think. So any speedup from porting this part is really great... |
I moved the momenta array from a C-style AOSOA to an AOSOA where the final "A" is a vector type. I am not sure how, but I seem to have gained another factor 1.5 speedup?... From 2E6 to 3E6. Now valassi@a2f8cf9
Was valassi@2d276bf
|
A large chunk of work on this issue will soon be merged from PR #152. PR #152 replaces the two previous draft PRs #72 and #132, which I have closed. The reason these two are obsolete is that I completely rebased my SIND work (which is in epoch1) on epoch2-level code. I have finally completed the "merge" of epoch2 and epoch1 of issue #139 in PR #151 (the "ep12" of "klas2ep12"). Presently epoch1 and epoch2 are identical. I will merge my SIMD work in epoch1 and keep epoch2 as-is prevectorization as a reference. I am copying here a few comments I made in PR #152
The CURRENT BASELINE BEFORE VECTORIZATION is that at the end of PR #151:
My CURRENT (WIP) BASELINE WITH VECTORIZATION is that in the FINAL MERGE OF 'origin/ep2to2ep1' into klas2ep12:
So, if I compare the vectorization branch to currenmt master, I see
A few additional comments (not in PR #152):
I will probably merge #152 tomorrow. |
I think that this old issue #71 can now be closed. In epochX (issue #244) I have now backported vectorization to the python code generating code, and I can now run vectorized c++ not only for the simple eemumu process, but also for the more complex (and more relevant to LHC!) ggttgg process. I observe similar speedups there, or even slightly better, for reasons to be understood. With respect to basic c++ with no simd, thrugh the appropriate use of SIMD eg AVX512 in 256 mode (see also #173) and LTO-like aggressive inlining (see #229) I get a factor 4 (~4.2) in double and a factor 8 (~7.8) in real. See for instance the logs in https://github.com/madgraph5/madgraph4gpu/tree/golden_epochX4/epochX/cudacpp/tput/logs_ggttgg_auto DOUBLE
For double, INLINING does not pay off, neither without nor with simd, it is worse than no inlining. What is interesting is that 512z is better than 512y in that case. FLOAT
That is to say, with float, INLINING gives eventually the same maximum speed as NO INLINING, but the former case is with AVX512/z, the latter with AVX512/y. Strange. In the simpler eemumu process, inlining did seem to provide a major booster of performance (which I could not explain). The summary is that we should use ggttgg for real studies - but also that we get VERY promising results there! Anyway, I am closing this and will repost these numbers on the LTO study issue #229 and the AVX512 study issue #173. |
I finally found some time to pursue some earlier tests on an idea I had from the beginning, namely, trying to implement SIMD vectorization in the C++ code as the same time as SIMT/SPMD parallelisation on the GPU in cuda,
The idea is always the same: event-level parallelism, with execution in lockstep (all events gop through exactly the same sequence of computations).
I pushed a few initial tests in https://github.com/valassi/madgraph4gpu/tree/klas, I will create a WIP PR about that. @roiser , @oliviermattelaer , @hageboeck , I would especially be interested to have some feedback from you :-)
Implementing SIMD in the C++ is closely linked to the API of kernel launchers (and of the methods the kernels internally call) on the GPU. In my previous eemumu_AV implementation, the signature of some c++ methods was modified by adding nevt (a number of events), or instead ievt (an index over events) with respect to the cuda signature, but some lines of code (eg loops on ievt=1..nevt) were commented out as they were just reminders of possible future changes.
The main idea behind all the changes I did is simple: bring the event loop more and more towards the inside. Eventually, the event loop must be the innermost loop. This is because you eventually want to perform every single floating point addition or multiplication in parallel over several events. In practice, one concrete consequence of this is that I had to invert the order of the helicity loop: so far, there was an outer event loop, with an inner loop over helicities within each event, while now there is an outer helicity loop, with an inner loop over events for each helicity.
One limitation of the present code (possible in a simple eemumu calculation) is that there is no loop over nprocesses, because nprocesses=1. This was already assumed, but now I made it much more explicit, removing all dead code and adding FIXME warnings.
So far, I got to this point
A lot is still missing
These changes may result in significant changes in the current interfaces, but I think they would normally lead to a better interface and structure also on the GPU. I'll continue in the next few days...
The text was updated successfully, but these errors were encountered: