Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

good helicity #60

Closed
oliviermattelaer opened this issue Nov 23, 2020 · 5 comments
Closed

good helicity #60

oliviermattelaer opened this issue Nov 23, 2020 · 5 comments
Assignees

Comments

@oliviermattelaer
Copy link
Member

Hi Andrea,

I'm wondering if you can explain your strategy for filtering automatically helicity?

Thanks,

Olivier

PS: Note that the condition
// NB: calculate_wavefunctions ADDS |M|^2 for a given ihel to the running
// sum of |M|^2 over helicities for the given event
calculate_wavefunctions(ihel, allmomenta, meHelSum[0]);
if (meHelSum[0] > meHelSumLast)
isGoodHel[ihel] = true;
meHelSumLast = meHelSum[0];
}

Is not good enough in general since in some computation "|M|^2" can be negative (ok in that case this is not technically |M|^2 which is computed) so a "!=" might be more appropriate

@valassi
Copy link
Member

valassi commented Nov 23, 2020

Hi Olivier,

some of the latest changes I did for helicity filtering are described in #24. I think that I tried to make the memory access more understandable, to avoid "transactions" that fetch from global memory some data related to helicity filtering: I am not sure this was a performance improvement per se, but I found that it reduced drastically the "number of transactions" metric, making it more easy to understand.

About your ">" vs "!=", I would need to have a look at the code (my code, and what was there before). Sorry I am in a meeting now, I will have a look at the code later and reply in more details.

Cheers
Andrea

@valassi
Copy link
Member

valassi commented Nov 25, 2020

Hi Olivier, sorry for the delay. I do not know if you still have doubts about this, but I'll try to give a few more details.

  1. Did I introduce helicity masking on GPU or was it there already?

I do not remember at all. But looking at the code it may be that this is one of the issues we had left for later, and it was me who added this for the first time, while previously it only existed in the C++.

It looks like the first implementation I did was this one on August 5:
5418306
My commit comment is that it gave a factor 2 imprvement in throughput, while I was expecting a factor 4 (which apparently is what you get on CPU/C++).

  1. Memory requests and transactions (and coalesced memory access).

After the implementation above on Aug 5, probably I did not touch this till much later on Aug 17. This was the time I was experimenting with memory layouts, especially the AOS/SOA studies for momenta memory layout that I did in #16.

The idea there was to achieve "coalesced" memory access, as discussed at the hackathon. Reading stuff right and left, I got convinced that a useful way to study this issue is to understand "requests" and "transactions" (aka "sectors") in memory access. The number of requests is essentiallly the number of variables you need to fetch from memory (but a single request is done for all threads in a 32-thread warp). Then the number of sectors or transactions is the number of 32-byte cachelines that are physically retrieved from memory, i.e. the number of roundtrips to memory. Memory coalescing means that, for a given number of requests, you minimise th enumber of transactions by ensuring that data are adjacent in the same cacheline. The two metrics can be easily printed out using a non-gui version of nsight systems, which I wrapped into a script.

Forgetting helicity masks for the moment, with the current code (or at least the latest code I tested...) this is what we get now, as described in #16 (comment)

-p blocks threads iterations requests sectors (transactions)
-p 1 4 1 16 16
-p 1 32 1 16 128

With 32 threads in one block, we are sure there is one warp only, with 32 threads: this issues 16 requests from memory, I think because each event needs 4 4-momenta. In other words, the warp needs 16 variables (actually, it needs 16 variables in each of 32 events, so 1632 variables, but you see this as 16 requests). With the latest AOSOA code that ensures coalescing, these 1632 doubles, which correspond to 81632=128*32 bytes (1 double is 8 bytes), are served using 128 transactions: this is because each transaction retrieves 32 bytes in a cacheline. If the memory is not coalesced (try changing neppM and neppR to 1, you get back AOS), then the requests would remain the same but the transactions would increase by a factor 4.

Now put back helicity masks. When I started these studies, I had my helicity mask implementation from Aug 5, and the number of requests I was seeing above was not 16, but rather 274! It took me some time to suspect that there was something suboptimale in my helicity masking, probably the lines

 __device__ unsigned long long sigmakin_itry = 0; // first iteration over nevt events
 __device__ bool sigmakin_goodhel[ncomb] = { false };

I then tried to comment out these lines, an dthe whole helicity masking on Aug 17, see 5de007b
This was effective in bringing memory requests and transactions down to a number I could understand, however of course it lost a factor 2 in throughput on GPU (and a factor 4 on CPU), because all helicities are computed.

Taking all the above in mind, I changed the implementation of helicity masking as in #24, with this commit on Aug 17: 194ffc1
This is ugly but achieves two things at the same time: it gives the factor 2 throughput increase of helicity masking, but it also brings the sectors for sigmakin down to something understandable. I am not 100% sure how this is achieved: partly it is because I now use constant memory (ie sigmaking accesses the helicity lists from cosntant memory, which does not show up in requests?), partly it is because the actual calculation of helicity lists is moved to a separate method that is only executed once at startup (so the variables needed for trial and error do not show up in sigmakin).

NOTE that decreasing the requests from 274 to 16 is probably irrelevant in terms of actual throughput. But I find it much nicer to have a code which we understand better, where the number of requests issued in sigmakin are really the new variables needed to described each new event, andonly those. So I would keep this.

  1. About your observation concerning negative ME, you are certainly right!

I do not remember when/how I copied the c++ implementation into a cuda one (I guess on Aug 5?), maybe I changed a "!=" to a ">" because I did not understand it, and I should not have done so. Or maybe I had to interpret waht was done and change the logic anyway, so I did something wrong en passant. This needs to be changed with a PR.

TODO: fix the ">" back to "!=" in helicity mask implementation.

  1. What is the actual throughput speedup?

One last question I am getting, by re-analysing this work, is why the throughput speedup is a factor 4 on c++ but only a factor 2 on GPU. Maybe there is something I am doing wrong on GPU?! It could be useful to check this again.

TODO: understand why the helicity masking speedup is a factor 4 on c++ and 2 on GPU.

I suggest to keep this issue open until these two TODO's are done...

Does this clarify, @oliviermattelaer ? Let me know if you have other questions...
Thanks!
Andrea

@valassi
Copy link
Member

valassi commented Nov 30, 2020

Hi @oliviermattelaer again, please note that I made quite a few additional changes in #71.

In particular, to implement vectorization I had to revert the order of loops, to make the event loop the innermost loop. Previously there was an outer event loop with inner helicity loop, now there is an outer helicity loop with inner event loop. This also affects the way helicity filtering is done on c++ (on cuda there is no difference, as the event loop is done in cuda threads and is orthogonal/interchangeable with the helicity loop). The new helicity filtering in c++ uses exactly the same methods/logic as in cuda: it is a bit of an overkill, but at least the two implementations are now very close, one less thing to understand/debug.

By the way, I have not yet fixed the ">", even there, sorry!

TODO: fix ">" as "!=" BOTH in the pesent master and in the klas branch for #71

@valassi valassi self-assigned this Dec 9, 2020
@valassi
Copy link
Member

valassi commented Dec 9, 2020

To be checked.

  • I think I moved the > to != in the klas branch
  • But I think I still need to fix it in the master branch

@valassi
Copy link
Member

valassi commented Oct 21, 2021

This is an old issue. The latest epochX uses "!=" everywhere:

By the way, @oliviermattelaer - I am not sure if this is what we were discussing here, but the latest helicity filtering I use for the CPU is faster than it was (20% for eemumu), see #179.

I am closing this issue - please reopen if you have questions :-)

@valassi valassi closed this as completed Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants