-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Branch efficiency: check that we have no issues with branch divergence #25
Comments
…oughput12.sh On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.403954e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.747784 sec 2,606,062,103 cycles # 2.648 GHz 3,536,734,749 instructions # 1.36 insn per cycle 1.052060967 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 868,352 : smsp__sass_branch_targets_threads_uniform.sum 868,352 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 16,384 ------------------------------------------------------------------------- FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.397452e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.608068 sec 2,198,704,176 cycles # 2.652 GHz 2,956,510,323 instructions # 1.34 insn per cycle 0.892671051 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 9,053,696 : smsp__sass_branch_targets_threads_uniform.sum 9,053,696 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 512 =========================================================================
… divergence and measure it On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 5.811740e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.012377 sec 3,103,258,724 cycles # 2.652 GHz 4,387,995,862 instructions # 1.41 insn per cycle 1.308308716 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 128 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33% : smsp__sass_branch_targets.sum 1,785,856 : smsp__sass_branch_targets_threads_uniform.sum 1,720,320 : smsp__sass_branch_targets_threads_divergent.sum 65,536 : smsp__warps_launched.sum 16,384 =========================================================================
After a few months I have come back on this issue with two improvements
(1) NEW TESTS AND METRICS The code is in PR #202 and #203 The main metric is sm__sass_average_branch_targets_threads_uniform.pct madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/throughput12.sh Line 157 in 2510b36
This metric should be 100% for uniform execution (i.e. no divergence) and less than 100% for divergence. Unfortunately, it is difficult to translate a percentage of non-uniformity into a throughput degradation. In the example below, I get a 96% uniformity, but the throughput degradation is around 20-30%, not just 4%! The test is this madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc Line 118 in 2510b36
Essentially, in half of the threads in a warp I use the default optimized opzxxx, in the other half I use the non optimized oxxxxx. The actual "4%" seems to be computed in the following way: there are a number of "branches" in the code in total, which can be either uniform (taken by all threads in a warp) or divergent (taken by some threads but not all, essentially). In my test, the current eemumu cuda, WE HAVE NO DIVERGENCE, and there are 53 branches, all 53 are taken in a uniform way. I guess these 53 include function calls and other possible decision points (or maybe, we actually have many ifs...). If I introduce a very silly/simple divergence as above, the number of branches goes from 53 to 109, and actually this reports 4 non unfirm branches and 105 uniform branches. The 105/109 is 96.33%. Not really helpful to translate to throughputs, but that's it. This is with the artifical divergence
This is without divergence (I also include ggttgg)
Note that in the tests above I go for -p 1 32 1, which launches only one warp (32 threads) in total. These tests above are using ncu with the command line interface. (2) COMMENTS ON THE OLD TESTS IN THIS THREAD Concerning my previous comments on stalled barriers, this does not seem to be very useful to measure thread divergence. At least, in my simple test with oxxxxx/opzxxx, the ncu metrics about stalled barriers were not helpful. I did a few more tests with ncu using the GUI. This is also interesting. For instance
ALL IN ALL, THIS SHOWS THAT EVEN A MINIMAL DIVERGENCE CAUSES BIG BIG ISSUES... Notice that the stalled barrier that I had mentioned before, instead, does not seem to have any relevance, in this example it is zero both for the divergenet and the unform test Finally, THREAD DIVERGENCE IS INDICATED IN THE FINAL NCU SECTION, SOURCE COUNTERS. This is also the one that complains about non coalesced memory access In the default non-divergent code version, I am told 100% branch efficiency, and I am not told about any non-coalesced memory access (3) ABOUT NVVP About my previous comments on the older nvvp tool, I will not reproduce tests here. I showed that using ncu, either in command line mode or GUI mode, is enough to check if ther eis thread divergence |
Finally, a few useful links about branch efficiency
I think that this can be closed
Eventually, if we do start having branch divergence (hopefully not), I think that it shoud be possible to correlate branch divergence to divergent branches (as discussed also in the thre elinks above) Closing as completed... |
PS last comment: I also checked the utilization of the ADU pipeline (address divergence unit) However in my example there does not seem to be a big difference (actually ADU is slightly more busy with no divergence?) |
Just a note as a reminder, followin g up on 'SIMD/SIMT' issues. After investigating SOA/AOS data access and showing we have no uncoalesced amemory access (issue #16) for momenta, I was wondering how to best check in the profiler if we have issues with divergent branches, i.e. threads in our warps which go out of 'lockstep'.
The only reference I found in the profiler doc is here
https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#statistical-sampler
If I understand correctly, this means that we should see "Stalled Barrier" in the Warp statistics. This seems to be always at zero.
![image](https://user-images.githubusercontent.com/3473550/90513687-50630c80-e160-11ea-86a2-a6b4c5d2bfc4.png)
I would say that we have no issues with branch divergence. Not surprising really, as all threads are doing exactly the same operations...
The text was updated successfully, but these errors were encountered: