good helicity #60

oliviermattelaer · 2020-11-23T15:14:58Z

Hi Andrea,

I'm wondering if you can explain your strategy for filtering automatically helicity?

Thanks,

Olivier

PS: Note that the condition
// NB: calculate_wavefunctions ADDS |M|^2 for a given ihel to the running
// sum of |M|^2 over helicities for the given event
calculate_wavefunctions(ihel, allmomenta, meHelSum[0]);
if (meHelSum[0] > meHelSumLast)
isGoodHel[ihel] = true;
meHelSumLast = meHelSum[0];
}

Is not good enough in general since in some computation "|M|^2" can be negative (ok in that case this is not technically |M|^2 which is computed) so a "!=" might be more appropriate

valassi · 2020-11-23T15:42:33Z

Hi Olivier,

some of the latest changes I did for helicity filtering are described in #24. I think that I tried to make the memory access more understandable, to avoid "transactions" that fetch from global memory some data related to helicity filtering: I am not sure this was a performance improvement per se, but I found that it reduced drastically the "number of transactions" metric, making it more easy to understand.

About your ">" vs "!=", I would need to have a look at the code (my code, and what was there before). Sorry I am in a meeting now, I will have a look at the code later and reply in more details.

Cheers
Andrea

valassi · 2020-11-25T09:50:59Z

Hi Olivier, sorry for the delay. I do not know if you still have doubts about this, but I'll try to give a few more details.

Did I introduce helicity masking on GPU or was it there already?

I do not remember at all. But looking at the code it may be that this is one of the issues we had left for later, and it was me who added this for the first time, while previously it only existed in the C++.

It looks like the first implementation I did was this one on August 5:
5418306
My commit comment is that it gave a factor 2 imprvement in throughput, while I was expecting a factor 4 (which apparently is what you get on CPU/C++).

Memory requests and transactions (and coalesced memory access).

After the implementation above on Aug 5, probably I did not touch this till much later on Aug 17. This was the time I was experimenting with memory layouts, especially the AOS/SOA studies for momenta memory layout that I did in #16.

The idea there was to achieve "coalesced" memory access, as discussed at the hackathon. Reading stuff right and left, I got convinced that a useful way to study this issue is to understand "requests" and "transactions" (aka "sectors") in memory access. The number of requests is essentiallly the number of variables you need to fetch from memory (but a single request is done for all threads in a 32-thread warp). Then the number of sectors or transactions is the number of 32-byte cachelines that are physically retrieved from memory, i.e. the number of roundtrips to memory. Memory coalescing means that, for a given number of requests, you minimise th enumber of transactions by ensuring that data are adjacent in the same cacheline. The two metrics can be easily printed out using a non-gui version of nsight systems, which I wrapped into a script.

Forgetting helicity masks for the moment, with the current code (or at least the latest code I tested...) this is what we get now, as described in #16 (comment)

-p blocks threads iterations	requests	sectors (transactions)
-p 1 4 1	16	16
-p 1 32 1	16	128

With 32 threads in one block, we are sure there is one warp only, with 32 threads: this issues 16 requests from memory, I think because each event needs 4 4-momenta. In other words, the warp needs 16 variables (actually, it needs 16 variables in each of 32 events, so 1632 variables, but you see this as 16 requests). With the latest AOSOA code that ensures coalescing, these 1632 doubles, which correspond to 81632=128*32 bytes (1 double is 8 bytes), are served using 128 transactions: this is because each transaction retrieves 32 bytes in a cacheline. If the memory is not coalesced (try changing neppM and neppR to 1, you get back AOS), then the requests would remain the same but the transactions would increase by a factor 4.

Now put back helicity masks. When I started these studies, I had my helicity mask implementation from Aug 5, and the number of requests I was seeing above was not 16, but rather 274! It took me some time to suspect that there was something suboptimale in my helicity masking, probably the lines

 __device__ unsigned long long sigmakin_itry = 0; // first iteration over nevt events
 __device__ bool sigmakin_goodhel[ncomb] = { false };

I then tried to comment out these lines, an dthe whole helicity masking on Aug 17, see 5de007b
This was effective in bringing memory requests and transactions down to a number I could understand, however of course it lost a factor 2 in throughput on GPU (and a factor 4 on CPU), because all helicities are computed.

Taking all the above in mind, I changed the implementation of helicity masking as in #24, with this commit on Aug 17: 194ffc1
This is ugly but achieves two things at the same time: it gives the factor 2 throughput increase of helicity masking, but it also brings the sectors for sigmakin down to something understandable. I am not 100% sure how this is achieved: partly it is because I now use constant memory (ie sigmaking accesses the helicity lists from cosntant memory, which does not show up in requests?), partly it is because the actual calculation of helicity lists is moved to a separate method that is only executed once at startup (so the variables needed for trial and error do not show up in sigmakin).

NOTE that decreasing the requests from 274 to 16 is probably irrelevant in terms of actual throughput. But I find it much nicer to have a code which we understand better, where the number of requests issued in sigmakin are really the new variables needed to described each new event, andonly those. So I would keep this.

About your observation concerning negative ME, you are certainly right!

I do not remember when/how I copied the c++ implementation into a cuda one (I guess on Aug 5?), maybe I changed a "!=" to a ">" because I did not understand it, and I should not have done so. Or maybe I had to interpret waht was done and change the logic anyway, so I did something wrong en passant. This needs to be changed with a PR.

TODO: fix the ">" back to "!=" in helicity mask implementation.

What is the actual throughput speedup?

One last question I am getting, by re-analysing this work, is why the throughput speedup is a factor 4 on c++ but only a factor 2 on GPU. Maybe there is something I am doing wrong on GPU?! It could be useful to check this again.

TODO: understand why the helicity masking speedup is a factor 4 on c++ and 2 on GPU.

I suggest to keep this issue open until these two TODO's are done...

Does this clarify, @oliviermattelaer ? Let me know if you have other questions...
Thanks!
Andrea

valassi · 2020-11-30T08:12:29Z

Hi @oliviermattelaer again, please note that I made quite a few additional changes in #71.

In particular, to implement vectorization I had to revert the order of loops, to make the event loop the innermost loop. Previously there was an outer event loop with inner helicity loop, now there is an outer helicity loop with inner event loop. This also affects the way helicity filtering is done on c++ (on cuda there is no difference, as the event loop is done in cuda threads and is orthogonal/interchangeable with the helicity loop). The new helicity filtering in c++ uses exactly the same methods/logic as in cuda: it is a bit of an overkill, but at least the two implementations are now very close, one less thing to understand/debug.

By the way, I have not yet fixed the ">", even there, sorry!

TODO: fix ">" as "!=" BOTH in the pesent master and in the klas branch for #71

valassi · 2020-12-09T22:20:51Z

To be checked.

I think I moved the > to != in the klas branch
But I think I still need to fix it in the master branch

valassi · 2021-10-21T13:36:16Z

This is an old issue. The latest epochX uses "!=" everywhere:

both for CPU,

madgraph4gpu/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc

Line 216 in e41c142

const bool differs = maskor( allMEs[ipagV] != allMEsLast[ipagV] ); // true if any of the neppV events differs
and for GPU,

madgraph4gpu/epochX/cudacpp/CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/madgraph/iolibs/template_files/gpu/process_function_definitions.inc

Line 187 in e41c142

if ( allMEs[ievt] != allMEsLast )

By the way, @oliviermattelaer - I am not sure if this is what we were discussing here, but the latest helicity filtering I use for the CPU is faster than it was (20% for eemumu), see #179.

I am closing this issue - please reopen if you have questions :-)

valassi self-assigned this Dec 9, 2020

valassi closed this as completed Oct 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

good helicity #60

good helicity #60

oliviermattelaer commented Nov 23, 2020

valassi commented Nov 23, 2020

valassi commented Nov 25, 2020 •

edited

Loading

valassi commented Nov 30, 2020

valassi commented Dec 9, 2020

valassi commented Oct 21, 2021

good helicity #60

good helicity #60

Comments

oliviermattelaer commented Nov 23, 2020

valassi commented Nov 23, 2020

valassi commented Nov 25, 2020 • edited Loading

valassi commented Nov 30, 2020

valassi commented Dec 9, 2020

valassi commented Oct 21, 2021

valassi commented Nov 25, 2020 •

edited

Loading