-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
good helicity #60
Comments
Hi Olivier, some of the latest changes I did for helicity filtering are described in #24. I think that I tried to make the memory access more understandable, to avoid "transactions" that fetch from global memory some data related to helicity filtering: I am not sure this was a performance improvement per se, but I found that it reduced drastically the "number of transactions" metric, making it more easy to understand. About your ">" vs "!=", I would need to have a look at the code (my code, and what was there before). Sorry I am in a meeting now, I will have a look at the code later and reply in more details. Cheers |
Hi Olivier, sorry for the delay. I do not know if you still have doubts about this, but I'll try to give a few more details.
I do not remember at all. But looking at the code it may be that this is one of the issues we had left for later, and it was me who added this for the first time, while previously it only existed in the C++. It looks like the first implementation I did was this one on August 5:
After the implementation above on Aug 5, probably I did not touch this till much later on Aug 17. This was the time I was experimenting with memory layouts, especially the AOS/SOA studies for momenta memory layout that I did in #16. The idea there was to achieve "coalesced" memory access, as discussed at the hackathon. Reading stuff right and left, I got convinced that a useful way to study this issue is to understand "requests" and "transactions" (aka "sectors") in memory access. The number of requests is essentiallly the number of variables you need to fetch from memory (but a single request is done for all threads in a 32-thread warp). Then the number of sectors or transactions is the number of 32-byte cachelines that are physically retrieved from memory, i.e. the number of roundtrips to memory. Memory coalescing means that, for a given number of requests, you minimise th enumber of transactions by ensuring that data are adjacent in the same cacheline. The two metrics can be easily printed out using a non-gui version of nsight systems, which I wrapped into a script. Forgetting helicity masks for the moment, with the current code (or at least the latest code I tested...) this is what we get now, as described in #16 (comment)
With 32 threads in one block, we are sure there is one warp only, with 32 threads: this issues 16 requests from memory, I think because each event needs 4 4-momenta. In other words, the warp needs 16 variables (actually, it needs 16 variables in each of 32 events, so 1632 variables, but you see this as 16 requests). With the latest AOSOA code that ensures coalescing, these 1632 doubles, which correspond to 81632=128*32 bytes (1 double is 8 bytes), are served using 128 transactions: this is because each transaction retrieves 32 bytes in a cacheline. If the memory is not coalesced (try changing neppM and neppR to 1, you get back AOS), then the requests would remain the same but the transactions would increase by a factor 4. Now put back helicity masks. When I started these studies, I had my helicity mask implementation from Aug 5, and the number of requests I was seeing above was not 16, but rather 274! It took me some time to suspect that there was something suboptimale in my helicity masking, probably the lines
I then tried to comment out these lines, an dthe whole helicity masking on Aug 17, see 5de007b Taking all the above in mind, I changed the implementation of helicity masking as in #24, with this commit on Aug 17: 194ffc1 NOTE that decreasing the requests from 274 to 16 is probably irrelevant in terms of actual throughput. But I find it much nicer to have a code which we understand better, where the number of requests issued in sigmakin are really the new variables needed to described each new event, andonly those. So I would keep this.
I do not remember when/how I copied the c++ implementation into a cuda one (I guess on Aug 5?), maybe I changed a "!=" to a ">" because I did not understand it, and I should not have done so. Or maybe I had to interpret waht was done and change the logic anyway, so I did something wrong en passant. This needs to be changed with a PR. TODO: fix the ">" back to "!=" in helicity mask implementation.
One last question I am getting, by re-analysing this work, is why the throughput speedup is a factor 4 on c++ but only a factor 2 on GPU. Maybe there is something I am doing wrong on GPU?! It could be useful to check this again. TODO: understand why the helicity masking speedup is a factor 4 on c++ and 2 on GPU. I suggest to keep this issue open until these two TODO's are done... Does this clarify, @oliviermattelaer ? Let me know if you have other questions... |
Hi @oliviermattelaer again, please note that I made quite a few additional changes in #71. In particular, to implement vectorization I had to revert the order of loops, to make the event loop the innermost loop. Previously there was an outer event loop with inner helicity loop, now there is an outer helicity loop with inner event loop. This also affects the way helicity filtering is done on c++ (on cuda there is no difference, as the event loop is done in cuda threads and is orthogonal/interchangeable with the helicity loop). The new helicity filtering in c++ uses exactly the same methods/logic as in cuda: it is a bit of an overkill, but at least the two implementations are now very close, one less thing to understand/debug. By the way, I have not yet fixed the ">", even there, sorry! TODO: fix ">" as "!=" BOTH in the pesent master and in the klas branch for #71 |
To be checked.
|
This is an old issue. The latest epochX uses "!=" everywhere:
By the way, @oliviermattelaer - I am not sure if this is what we were discussing here, but the latest helicity filtering I use for the CPU is faster than it was (20% for eemumu), see #179. I am closing this issue - please reopen if you have questions :-) |
Hi Andrea,
I'm wondering if you can explain your strategy for filtering automatically helicity?
Thanks,
Olivier
PS: Note that the condition
// NB: calculate_wavefunctions ADDS |M|^2 for a given ihel to the running
// sum of |M|^2 over helicities for the given event
calculate_wavefunctions(ihel, allmomenta, meHelSum[0]);
if (meHelSum[0] > meHelSumLast)
isGoodHel[ihel] = true;
meHelSumLast = meHelSum[0];
}
Is not good enough in general since in some computation "|M|^2" can be negative (ok in that case this is not technically |M|^2 which is computed) so a "!=" might be more appropriate
The text was updated successfully, but these errors were encountered: