Comparing madgraph4gpu abstractions for performance

The aim is to compare abstractions, e.g. alpaka, sycl, kokkos. In this wiki the focus is on performance metrics, although a comparison might include other considers, like time required or ease of porting.

In order to prepare performance metrics an agreement on the software and conditions of tests is needed. Also agreement on exactly what is measured. Here we try to explore that needs to be controlled and eventually document what is decided.

To start issues are considered:

Software version

The abstractions are understood to be generated by madgraph5 plugins, and in the usual way they are produced for a given process(es). It is proposed that abstractions should be tested by running on a given type of GPU, e.g. V100. It is thought that they can be compared to a reference (i.e. the CUDA version produced with specified versions of madgraph and the cuda plugin).

In some cases, e.g. Alpaka, the abstraction plugin was based on a modified CUDA plugin. So the closet version of CUDA could be understood to be the version of the CUDA plugin on it was based. Changes in the abstraction plugin with respect to the CUDA version are assumed to be a reasonable set needed to make the abstraction work. If there are cases with an abstraction plugin is not directly based on the CUDA one, it has to be established if and which CUDA version could be used as a reference.

It seems most likely that the abstractions should be such that the same version of the CUDA reference applies to all.

To establish the current situation a collection information about the abstraction plugins is included below; for instance which version of the CUDA plugin (if any) on which they are based, and which version of madgraph the operate with. Other notable constrains or differences in their function may also be relevant:

Alpaka (which also uses the cupla porting layer)

The alpaka-producing plugin is based on the CUDA Plugin from the epochX directory at the golden_epochX4 tag. It has been used with madgraph5 2.7.0-gpu revision 370. It uses alpaka 0.8.0, cupla 0.3.0, gcc 10.2.0. The alpaka code produced is configured to target GPU using CUDA, and in this case uses CUDA for compilation. CUDA version 11.6 has been used so far.

Other notable changes: Where possible it was thought that defaults should be set so that CUDA isn't specificially required, unless the target is actually an Nvidia GPU, in which case CUDA is needed for the compilation. e.g. the default complex number class implementation is a custom one (rather that cucomplex or thrust). The default random generator is the common one (which is host side only) rather than curandGenerateUniformDouble() etc., which can run on either host or device.

Objects has to be generated with the -dc option of nvcc (equivalent to --relocatable-device-code).

Processes

It has been suggested that ee->mumu, gg->tt and gg->ttgg be used.

Run conditions

Double precision for primary result, possibly with single precision as a comparison point.

Run conditions: other

The number of blocks, threads per block & number of iterations to be used are not yet established. For the blocks & threads/block values it is suggested that the set which is optimal (in terms of the metric) for each abstraction could be used. For the number of iterations it is suggested that a single value could be decided and used for all abstractions. A possibility is to determine the number of iterations needed to make the standard error of the mean (assuming this exists), of whatever is measured, to be small.

Results

It has been suggested that two types of result could be collected, a pure ME calculation, i.e. the average duration of the calculation of the matrix element. The other would be a result intended to be include the timing from some device<->host transfers, to refelect better the expected situation when in use. It has been suggested that for the pure ME calculation there is no need to coordinate random numbers between abstractions.

The metric for the second types of result also could be the duration, from starting to calculate the matrix element to having it back on the host. To be decided if this should also include time to initialy generate the random numbers and transfer them to the host. If the random number portion is to be included it seems it might be needed to use the same generator and have the same choice of whether it runs on the host or device between abstractions.

It is unclear if the stread of the result(s) between runs will be large and it should be quoted in the final result. The spread might be due to not only the error of the mean but also of factors which might change between runs of the code on the test device. So the variation should checked. It is proposed that a number of runs be done and the interquartile range of the results be found.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing madgraph4gpu abstractions for performance

Clone this wiki locally