Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch efficiency: check that we have no issues with branch divergence #25

Closed
valassi opened this issue Aug 18, 2020 · 5 comments
Closed
Assignees
Labels
idea Possible new development (may need further discussion) performance How fast is it? Make it go faster!

Comments

@valassi
Copy link
Member

valassi commented Aug 18, 2020

Just a note as a reminder, followin g up on 'SIMD/SIMT' issues. After investigating SOA/AOS data access and showing we have no uncoalesced amemory access (issue #16) for momenta, I was wondering how to best check in the profiler if we have issues with divergent branches, i.e. threads in our warps which go out of 'lockstep'.

The only reference I found in the profiler doc is here
https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#statistical-sampler

image

If I understand correctly, this means that we should see "Stalled Barrier" in the Warp statistics. This seems to be always at zero.
image

I would say that we have no issues with branch divergence. Not surprising really, as all threads are doing exactly the same operations...

@valassi valassi added the idea Possible new development (may need further discussion) label Aug 18, 2020
@valassi
Copy link
Member Author

valassi commented Aug 21, 2020

I trid to understand if there was any other analysis to look into this. I could only find hints about how to do it with the old tools nvprof and nvvp. (I guess profile files should be .nvvp?).

nvprof -o pippo.prof -a branch ./gcheck.exe -p 65536 128 1 
nvvp

This is a screenshot from nvvp on that profile. It just says there are no issues with divergent branches, without any more details. I guess it uses the same metrics as in stall barrier? Anyway, I think there really are no issues
image

@valassi
Copy link
Member Author

valassi commented Dec 9, 2020

Note that this is also relevant to vectorisation, #71 and #72.

The fact that we get almost the full factor 4 from AVX2 is a sign that we have no divergence on the CPU.

We should keep this open to reevaluate when we add a selection cut.

valassi added a commit to valassi/madgraph4gpu that referenced this issue May 12, 2021
…oughput12.sh

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.403954e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.747784 sec
     2,606,062,103      cycles                    #    2.648 GHz
     3,536,734,749      instructions              #    1.36  insn per cycle
       1.052060967 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       868,352
                             : smsp__sass_branch_targets_threads_uniform.sum       868,352
                             : smsp__sass_branch_targets_threads_divergent.sum     0
                             : smsp__warps_launched.sum                            16,384
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.397452e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.608068 sec
     2,198,704,176      cycles                    #    2.652 GHz
     2,956,510,323      instructions              #    1.34  insn per cycle
       0.892671051 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       9,053,696
                             : smsp__sass_branch_targets_threads_uniform.sum       9,053,696
                             : smsp__sass_branch_targets_threads_divergent.sum     0
                             : smsp__warps_launched.sum                            512
=========================================================================
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 12, 2021
… divergence and measure it

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.811740e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.012377 sec
     3,103,258,724      cycles                    #    2.652 GHz
     4,387,995,862      instructions              #    1.41  insn per cycle
       1.308308716 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       1,785,856
                             : smsp__sass_branch_targets_threads_uniform.sum       1,720,320
                             : smsp__sass_branch_targets_threads_divergent.sum     65,536
                             : smsp__warps_launched.sum                            16,384
=========================================================================
@valassi
Copy link
Member Author

valassi commented May 13, 2021

After a few months I have come back on this issue with two improvements

  • one, I think I understand better which are the relevant metrics
  • two, i made a small test that introduces artifically some divergence, just to see what this gives in the profiles
    I also make some comments on my previous posts

(1) NEW TESTS AND METRICS

The code is in PR #202 and #203

The main metric is sm__sass_average_branch_targets_threads_uniform.pct

$(which ncu) --metrics launch__registers_per_thread,sm__sass_average_branch_targets_threads_uniform.pct --target-processes all --kernel-id "::sigmaKin:" --print-kernel-base mangled $exe $args | egrep '(sigmaKin|registers| sm)' | tr "\n" " " | awk '{print $1, $2, $3, $15, $17; print $1, $2, $3, $18, $20$19}'

This metric should be 100% for uniform execution (i.e. no divergence) and less than 100% for divergence.

Unfortunately, it is difficult to translate a percentage of non-uniformity into a throughput degradation. In the example below, I get a 96% uniformity, but the throughput degradation is around 20-30%, not just 4%!

The test is this


Essentially, in half of the threads in a warp I use the default optimized opzxxx, in the other half I use the non optimized oxxxxx.

The actual "4%" seems to be computed in the following way: there are a number of "branches" in the code in total, which can be either uniform (taken by all threads in a warp) or divergent (taken by some threads but not all, essentially). In my test, the current eemumu cuda, WE HAVE NO DIVERGENCE, and there are 53 branches, all 53 are taken in a uniform way. I guess these 53 include function calls and other possible decision points (or maybe, we actually have many ifs...). If I introduce a very silly/simple divergence as above, the number of branches goes from 53 to 109, and actually this reports 4 non unfirm branches and 105 uniform branches. The 105/109 is 96.33%. Not really helpful to translate to throughputs, but that's it.
WE SHOULD AIM TO STAY AT 100% UNIFORM BRANCH EXECUTION.

This is with the artifical divergence
b51bee6

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 5.711994e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.745683 sec
     2,603,540,638      cycles                    #    2.655 GHz
     3,537,849,260      instructions              #    1.36  insn per cycle
       1.049477458 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 128
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 96.33%
                             : smsp__sass_branch_targets.sum                       109        4.18/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       105        4.03/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     4          153.37/msecond
                             : smsp__warps_launched.sum                            1
=========================================================================

This is without divergence (I also include ggttgg)
aaa28b7

On itscrd70.cern.ch (V100S-PCIE-32GB):
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.425099e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.741551 sec
     2,589,547,187      cycles                    #    2.655 GHz
     3,537,039,425      instructions              #    1.37  insn per cycle
       1.044156654 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       53         2.89/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
-------------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.454874e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.602111 sec
     2,193,960,041      cycles                    #    2.654 GHz
     2,948,877,241      instructions              #    1.34  insn per cycle
       0.885704400 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
                             : smsp__sass_branch_targets.sum                       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_uniform.sum       17,683     1.52/usecond
                             : smsp__sass_branch_targets_threads_divergent.sum     0          0/second
                             : smsp__warps_launched.sum                            1
=========================================================================

Note that in the tests above I go for -p 1 32 1, which launches only one warp (32 threads) in total.

These tests above are using ncu with the command line interface.

(2) COMMENTS ON THE OLD TESTS IN THIS THREAD

Concerning my previous comments on stalled barriers, this does not seem to be very useful to measure thread divergence. At least, in my simple test with oxxxxx/opzxxx, the ncu metrics about stalled barriers were not helpful.

I did a few more tests with ncu using the GUI. This is also interesting. For instance

  • The throughput indeed decreases by 26%, ie the kernel time increases by that much
  • Memory usage degrades considerably: I now get ncu warnings about non-coalesced memory access, which I was not getting before, and the number of requests an dtransaction sincrease by 40% (not clear why?)
  • The number of instructions increases by 10%
  • Memory throughput decreases by 10%
  • Even the number of registers increases slightly! From 120 to 128

ALL IN ALL, THIS SHOWS THAT EVEN A MINIMAL DIVERGENCE CAUSES BIG BIG ISSUES...

image

Notice that the stalled barrier that I had mentioned before, instead, does not seem to have any relevance, in this example it is zero both for the divergenet and the unform test

image

Finally, THREAD DIVERGENCE IS INDICATED IN THE FINAL NCU SECTION, SOURCE COUNTERS. This is also the one that complains about non coalesced memory access

image

In the default non-divergent code version, I am told 100% branch efficiency, and I am not told about any non-coalesced memory access

image

(3) ABOUT NVVP

About my previous comments on the older nvvp tool, I will not reproduce tests here. I showed that using ncu, either in command line mode or GUI mode, is enough to check if ther eis thread divergence

@valassi valassi changed the title Check that we have no issues with barnch divergence Check that we have no issues with branch divergence May 13, 2021
@valassi valassi self-assigned this May 13, 2021
@valassi valassi added the performance How fast is it? Make it go faster! label May 13, 2021
@valassi
Copy link
Member Author

valassi commented May 13, 2021

Finally, a few useful links about branch efficiency

I think that this can be closed

  • the branch efficiency metric is printed out routinely via ncu in my throughtput12.sh script, we should check it is 100%
  • it is easy to get it from ncu also in a GUI via the source counters section

Eventually, if we do start having branch divergence (hopefully not), I think that it shoud be possible to correlate branch divergence to divergent branches (as discussed also in the thre elinks above)

Closing as completed...

@valassi valassi closed this as completed May 13, 2021
@valassi valassi changed the title Check that we have no issues with branch divergence Branch efficiency: check that we have no issues with branch divergence May 13, 2021
@valassi
Copy link
Member Author

valassi commented May 13, 2021

PS last comment: I also checked the utilization of the ADU pipeline (address divergence unit)
https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-decoder

However in my example there does not seem to be a big difference (actually ADU is slightly more busy with no divergence?)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Possible new development (may need further discussion) performance How fast is it? Make it go faster!
Projects
None yet
Development

No branches or pull requests

1 participant