Branch efficiency: check that we have no issues with branch divergence #25

valassi · 2020-08-18T12:37:52Z

Just a note as a reminder, followin g up on 'SIMD/SIMT' issues. After investigating SOA/AOS data access and showing we have no uncoalesced amemory access (issue #16) for momenta, I was wondering how to best check in the profiler if we have issues with divergent branches, i.e. threads in our warps which go out of 'lockstep'.

The only reference I found in the profiler doc is here
https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#statistical-sampler

If I understand correctly, this means that we should see "Stalled Barrier" in the Warp statistics. This seems to be always at zero.

I would say that we have no issues with branch divergence. Not surprising really, as all threads are doing exactly the same operations...

valassi · 2020-08-21T12:58:44Z

I trid to understand if there was any other analysis to look into this. I could only find hints about how to do it with the old tools nvprof and nvvp. (I guess profile files should be .nvvp?).

nvprof -o pippo.prof -a branch ./gcheck.exe -p 65536 128 1 
nvvp

This is a screenshot from nvvp on that profile. It just says there are no issues with divergent branches, without any more details. I guess it uses the same metrics as in stall barrier? Anyway, I think there really are no issues

valassi · 2020-12-09T22:27:21Z

Note that this is also relevant to vectorisation, #71 and #72.

The fact that we get almost the full factor 4 from AVX2 is a sign that we have no divergence on the CPU.

We should keep this open to reevaluate when we add a selection cut.

…oughput12.sh On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 6.403954e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.747784 sec 2,606,062,103 cycles # 2.648 GHz 3,536,734,749 instructions # 1.36 insn per cycle 1.052060967 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 868,352 : smsp__sass_branch_targets_threads_uniform.sum 868,352 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 16,384 ------------------------------------------------------------------------- FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.397452e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.608068 sec 2,198,704,176 cycles # 2.652 GHz 2,956,510,323 instructions # 1.34 insn per cycle 0.892671051 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% : smsp__sass_branch_targets.sum 9,053,696 : smsp__sass_branch_targets_threads_uniform.sum 9,053,696 : smsp__sass_branch_targets_threads_divergent.sum 0 : smsp__warps_launched.sum 512 =========================================================================

… divergence and measure it On itscrd70.cern.ch (V100S-PCIE-32GB): ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 5.811740e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.012377 sec 3,103,258,724 cycles # 2.652 GHz 4,387,995,862 instructions # 1.41 insn per cycle 1.308308716 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 128 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33% : smsp__sass_branch_targets.sum 1,785,856 : smsp__sass_branch_targets_threads_uniform.sum 1,720,320 : smsp__sass_branch_targets_threads_divergent.sum 65,536 : smsp__warps_launched.sum 16,384 =========================================================================

valassi · 2021-05-13T09:51:48Z

After a few months I have come back on this issue with two improvements

one, I think I understand better which are the relevant metrics
two, i made a small test that introduces artifically some divergence, just to see what this gives in the profiles
I also make some comments on my previous posts

(1) NEW TESTS AND METRICS

The code is in PR #202 and #203

The main metric is sm__sass_average_branch_targets_threads_uniform.pct

madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/throughput12.sh

Line 157 in 2510b36

    
           $(which ncu) --metrics launch__registers_per_thread,sm__sass_average_branch_targets_threads_uniform.pct --target-processes all --kernel-id "::sigmaKin:" --print-kernel-base mangled $exe $args | egrep '(sigmaKin|registers| sm)' | tr "\n" " " | awk '{print $1, $2, $3, $15, $17; print $1, $2, $3, $18, $20$19}'

This metric should be 100% for uniform execution (i.e. no divergence) and less than 100% for divergence.

Unfortunately, it is difficult to translate a percentage of non-uniformity into a throughput degradation. In the example below, I get a 96% uniformity, but the throughput degradation is around 20-30%, not just 4%!

The test is this

madgraph4gpu/epoch1/cuda/ee_mumu/SubProcesses/P1_Sigma_sm_epem_mupmum/CPPProcess.cc

Line 118 in 2510b36

#else

Essentially, in half of the threads in a warp I use the default optimized opzxxx, in the other half I use the non optimized oxxxxx.

The actual "4%" seems to be computed in the following way: there are a number of "branches" in the code in total, which can be either uniform (taken by all threads in a warp) or divergent (taken by some threads but not all, essentially). In my test, the current eemumu cuda, WE HAVE NO DIVERGENCE, and there are 53 branches, all 53 are taken in a uniform way. I guess these 53 include function calls and other possible decision points (or maybe, we actually have many ifs...). If I introduce a very silly/simple divergence as above, the number of branches goes from 53 to 109, and actually this reports 4 non unfirm branches and 105 uniform branches. The 105/109 is 96.33%. Not really helpful to translate to throughputs, but that's it.
WE SHOULD AIM TO STAY AT 100% UNIFORM BRANCH EXECUTION.

This is with the artifical divergence
b51bee6

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.745683 sec
     2,603,540,638      cycles                    #    2.655 GHz
     3,537,849,260      instructions              #    1.36  insn per cycle
       1.049477458 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       109        4.18/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       105        4.03/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     4          153.37/msecond
                             : smsp__warps_launched.sum                            1
=========================================================================

This is without divergence (I also include ggttgg)
aaa28b7

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.741551 sec
     2,589,547,187      cycles                    #    2.655 GHz
     3,537,039,425      instructions              #    1.37  insn per cycle
       1.044156654 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.602111 sec
     2,193,960,041      cycles                    #    2.654 GHz
     2,948,877,241      instructions              #    1.34  insn per cycle
       0.885704400 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
=========================================================================

Note that in the tests above I go for -p 1 32 1, which launches only one warp (32 threads) in total.

These tests above are using ncu with the command line interface.

(2) COMMENTS ON THE OLD TESTS IN THIS THREAD

Concerning my previous comments on stalled barriers, this does not seem to be very useful to measure thread divergence. At least, in my simple test with oxxxxx/opzxxx, the ncu metrics about stalled barriers were not helpful.

I did a few more tests with ncu using the GUI. This is also interesting. For instance

The throughput indeed decreases by 26%, ie the kernel time increases by that much
Memory usage degrades considerably: I now get ncu warnings about non-coalesced memory access, which I was not getting before, and the number of requests an dtransaction sincrease by 40% (not clear why?)
The number of instructions increases by 10%
Memory throughput decreases by 10%
Even the number of registers increases slightly! From 120 to 128

ALL IN ALL, THIS SHOWS THAT EVEN A MINIMAL DIVERGENCE CAUSES BIG BIG ISSUES...

Notice that the stalled barrier that I had mentioned before, instead, does not seem to have any relevance, in this example it is zero both for the divergenet and the unform test

Finally, THREAD DIVERGENCE IS INDICATED IN THE FINAL NCU SECTION, SOURCE COUNTERS. This is also the one that complains about non coalesced memory access

In the default non-divergent code version, I am told 100% branch efficiency, and I am not told about any non-coalesced memory access

(3) ABOUT NVVP

About my previous comments on the older nvvp tool, I will not reproduce tests here. I showed that using ncu, either in command line mode or GUI mode, is enough to check if ther eis thread divergence

valassi · 2021-05-13T10:04:07Z

Finally, a few useful links about branch efficiency

I think that this can be closed

the branch efficiency metric is printed out routinely via ncu in my throughtput12.sh script, we should check it is 100%
it is easy to get it from ncu also in a GUI via the source counters section

Eventually, if we do start having branch divergence (hopefully not), I think that it shoud be possible to correlate branch divergence to divergent branches (as discussed also in the thre elinks above)

Closing as completed...

valassi · 2021-05-13T16:37:52Z

PS last comment: I also checked the utilization of the ADU pipeline (address divergence unit)
https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder

However in my example there does not seem to be a big difference (actually ADU is slightly more busy with no divergence?)

valassi added the idea Possible new development (may need further discussion) label Aug 18, 2020

valassi mentioned this issue May 12, 2021

[warps] measure cuda thread divergence #202

Merged

valassi changed the title ~~Check that we have no issues with barnch divergence~~ Check that we have no issues with branch divergence May 13, 2021

valassi self-assigned this May 13, 2021

valassi added the performance How fast is it? Make it go faster! label May 13, 2021

valassi closed this as completed May 13, 2021

valassi changed the title ~~Check that we have no issues with branch divergence~~ Branch efficiency: check that we have no issues with branch divergence May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branch efficiency: check that we have no issues with branch divergence #25

Branch efficiency: check that we have no issues with branch divergence #25

valassi commented Aug 18, 2020

valassi commented Aug 21, 2020

valassi commented Dec 9, 2020

valassi commented May 13, 2021

valassi commented May 13, 2021

valassi commented May 13, 2021

Branch efficiency: check that we have no issues with branch divergence #25

Branch efficiency: check that we have no issues with branch divergence #25

Comments

valassi commented Aug 18, 2020

valassi commented Aug 21, 2020

valassi commented Dec 9, 2020

valassi commented May 13, 2021

valassi commented May 13, 2021

valassi commented May 13, 2021