Skip to content

Commit

Permalink
Better readme
Browse files Browse the repository at this point in the history
  • Loading branch information
zhanglx13 committed Jul 22, 2024
1 parent cd736ce commit 6ee25cc
Show file tree
Hide file tree
Showing 2 changed files with 181 additions and 88 deletions.
261 changes: 177 additions & 84 deletions scripts/amd/gemm/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,164 @@
# GEMM tuning script v2
# GEMM tuning script (current v3.3)

This is the v2 version of the gemm tuning script, which is based on @scxiao's v1 (https://github.com/ROCmSoftwarePlatform/triton/pull/309) and @alefimov-amd's thread pool https://github.com/ROCmSoftwarePlatform/triton/pull/310
## matmul kernel

### Main features
- `rocprof` is used to measure the time for kernels in the full tuning space
- Each kernel is executed 10 times and the execution time of the last instance is used
- All kernels are compiled in parallel
- Two modes for correctness checking
- During tuning, check correctness with the best perf_config for the current gemm size
- Without tuning, check correctness based on the tuning results, which includes best perf_config for each gemm size
- The process takes about 30 - 40 minutes for the full tuning space with ~15000 configs
- Limitations
- For now, only support fp16 as inputs. It should be trivial to extend to other types, but may require some work for mixed inputs
The matmul kernel implementation can be found as [matmul_kernel.py](https://github.com/ROCm/triton/blob/triton-mlir/scripts/amd/gemm/matmul_kernel.py), which includes the following features:
- grouping order of workgroup id, which is controled by `GROUP_SIZE_M`, that
implements L2 cache optimization introduced in the [tutorial](https://triton-lang.org/main/getting-started/tutorials/03-matrix-multiplication.html#l2-cache-optimizations).
- split-k algorithm, which is controled by `SPLIT_K`.
- Bias along M dim, which is controled by `BIAS` and `bias_ptr`.
- Masked load along K dim inside the loop, which is controled by `EVEN_K`.
This means `BLOCK_SIZE_K` does not need to divide K dim.

### Usage
Go to the script dir
```bash
cd triton/scripts/amd/gemm/
### Differences between the tutorial

Unlike the [matmul tutorial](https://github.com/triton-lang/triton/blob/main/python/tutorials/03-matrix-multiplication.py) (referred as the tutorial),
the matmul kernel used in the tuning script (referred as the kerel) does not
guard load along M and N dim
([this](https://github.com/triton-lang/triton/blob/main/python/tutorials/03-matrix-multiplication.py#L282-L283) shows how this is done in the tutorial).
When `BLOCK_SIZE_M` or `BLOCK_SIZE_N` does not divide M or N, the kernel will
load out-of-bound data.
In most cases this is fine, since the kernel does masked store at the end.
However, this may lead to GPU memory access fault in some cases, especially
when the tensor is large.
We will fix this issue in the future.


## Tuning script usage

### Tuning mode

The tuning script can take one or more gemm sizes and run tuning for them.
The input gemm sizes are prepared in a yaml file. Here is an example yaml file:
```yaml
- {'M': 4864, 'N': 4096, 'K': 8256, 'rowMajorA': 'T', 'rowMajorB': 'N'}
- {'M': 512, 'N': 512, 'K': 512, 'rowMajorA': 'T', 'rowMajorB': 'N'}
```
1. Tune gemm sizes given in a yaml file and check correctness on the way
```bash
python tune_gemm.py --gemm_size_file input_gemm_sizes.yaml --compare
The tuning script works as follows
```python
./tune_gemm --gemm_size_file input.yaml [options]
```
The following `options` are supported in the tuning mode

- Input data types:
- `-dtype_a dtype`, `-dtype_b dtype`, and `-dtype_c dtype`: input and output element type.
- Supported `dtype`: fp16 (default), bf16, fp8, bf8, int8, int32, fp32
- Parallel compilation of kernels:
- `num_threads n` controls that n threads will
be used in the compilation stage. The default value is 32.
- `--no_warmup` can be used to skip the compilation stage. Thus kernels will be
compiled during the profiling stage. This increases tuning time. But it's
required for some old torch version, in which some function used in the warmup
kernel launch is not supported.
- Parallel profiling of kernels: The tuning space is first divided into a number
of tasks, which is controled by `--jobs n`. And all the tasks can be profiled in
parallel on a number of GPUs in the system. There are two ways to specify which
GPU(s) we want to use for profiling. Note that these flags cannot be use together.
By default, only one task is generated and profiled on GPU0.
- `--ngpus n`: GPU 0,1,.., n-1 will be used.
- `--gpu_ids ids`: `ids` are comma separated gpu ids and GPUs in `ids` will be used.
- General tuning control flags
- `--init_type INIT_TYPE` defines how input data are initialized. `INIT_TYPE` can be
- hpl: uniform distribution between -.5 and .5
- trig_float: the distribution of elements in the flattened tensor follow
the `sin` function.
- zeros: initialize all data as 0, i.e. `torch.zeros`
- randn (default): normal distribution, i.e. `torch.randn`
- `--rotating_tensor SIZE`: provide the size of memory used for rotatin tensor.
The default is 0, meaning rotating tensor is not used.
- `--icahe_flush`: If true, the script will generate a kernel to flush i-cache.
The default is False.
- `--bias_vector`: If true, a bias vector along the M dim is applied.
The default is False.
- Correctness check
- `--compare` will check the correctness of the best config for each gemm size.
- `--compare_wo_tuning` will check the correctness of the config provided in
the yaml file. If this is set, user needs to provide all the parameters in
the input yaml file. Example can be found in the benchmark mode section.
- Logistics
- `--keep` can be used to keep the files generated during the tuning process.
Be default, intermediate files are removed at the end.
- `--time_breakdown`: If set, the script will print out elapsed time during
each stage of the tuning in real-time. The default is False.
- `--verbose` will enable more logging message than `--time_breakdown`, such
as output from rocprofv2
- `--o OUTPUT` can be used to control the output filename to store the tuning
result. The default filename is `tuning_results_branchName@gitCommit_timeStamp.yaml`.
Therefore, each time the user runs the tuning script, a different output file
will be generated.

Here are some example usages of running the script for tuning:

Tune some gemm sizes with f16 input
```python
./tune_gemm --gemm_size_file input.yaml --ngpus 8 --jobs 32 --o output.yaml
```
It's recommended to use as many GPUs as possible and set `--jobs` to
a value that is 4 to 6 times the number of GPUs.

2. Tune a single gemm size
```bash
python tune_gemm.py -m 16 -n 16 -k 16
If you are only allowed to use a subset of the GPUs, you can
```python
./tune_gemm --gemm_size_file input.yaml --gpu_ids 0,1,3,4 --jobs 32 --o output.yaml
```
This runs the profiling on GPU 0,1,3,4.

3. Choose the file to store tuning results
```bash
python tune_gemm.py --gemm_size_file input_gemm_sizes.yaml --o output_tuning.yaml
For bf8 input
```python
./tune_gemm --gemm_size_file input.yaml --ngpus 8 --jobs 32 -dtype_a bf8 -dtype_b bf8
```

4. Only check correctness given the tuning results
```bash
python tune_gemm.py --gemm_size_file output_tuning.yaml --compare_wo_tuning
Check correctness of the tuned configs
```python
./tune_gemm --gemm_size_file output.yaml --compare_wo_tuning
```


### Benchmark mode

In benchmark mode, the script will run a single given config multiple times to
collect performance data. The benchmark mode works as
The tuning script works as follows
```python
./tune_gemm --gemm_size_file input.yaml [options] --benchmark
```
Note that the tuning results file are provided as the `gemm_size_file` in this scenario.
The supported `options` are as followings
- `-dtype_a dtype`, `-dtype_b dtype`, and `-dtype_c dtype`: same as tuning mode.
- `--iters n` controls the number of iterations to run the kernel.
The default value is 1000.


## Tuning script implementation overview

The general idea of the tuning script can be summarized as
- Compile all the kernels in the tuning space in parallel.
- Divide the tuning space into tasks and invoke `rocprofv2` once per
task. This will save invocation overhead of the profiler.
- Profile tasks in parallel on multiple GPUs.

For detailed implementation, please refer to the changelog of each version.


# Changelog

## GEMM tuning script v1

Shucai (@scxiao) implemented the first version of gemm tuning script: https://github.com/ROCmSoftwarePlatform/triton/pull/309

## GEMM tuning script v2

This version is based on v1 and @alefimov-amd's thread pool https://github.com/ROCmSoftwarePlatform/triton/pull/310

### Main features
- `rocprof` is used to measure the time for kernels in the full tuning space
- Each kernel is executed 10 times and the execution time of the last instance is used
- All kernels are compiled in parallel
- Two modes for correctness checking
- During tuning, check correctness with the best perf_config for the current gemm size
- Without tuning, check correctness based on the tuning results, which includes best perf_config for each gemm size
- The process takes about 30 - 40 minutes for the full tuning space with ~15000 configs
- Limitations
- For now, only support fp16 as inputs. It should be trivial to extend to other types, but may require some work for mixed inputs

### Overview of implementations

Expand All @@ -63,7 +183,7 @@ Workflow of the tuning process
5. Invoke `rocprof` on the generated script
6. Post process `results.csv` by extract the execution time of the last instance of each kernel. Pick the best one, write to file, and return.

# GEMM Tuning Script v3
## GEMM Tuning Script v3

### API changes

Expand All @@ -89,66 +209,15 @@ This is necessary to keep each file "small" in terms of execution time.
- Added error recovery. This helps when rocprof crashes in multi-processing mode.


### Example Usage

Let's say we have an input yaml file, named `gemm_input.yaml`, that contains the following configs
```yaml
- {'M': 4864, 'N': 4096, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N'}
- {'M': 8192, 'N': 8192, 'K': 8192, 'rowMajorA': 'T', 'rowMajorB': 'N'}
```
1. Tuning with bf8 input types with gpu 4,5,6,7, and save output to `output.yaml`
```bash
python ./tune_gemm.py --gemm_size_file gemm_input.yaml -dtype_a bf8 -dtype_b bf8 --gpu_ids 4,5,6,7 --o output.yaml
```

2. Check the correctness of the tuned configs
```bash
python ./tune_gemm.py --gemm_size_file output.yaml -dtype_a bf8 -dtype_b bf8 --compare_wo_tuning
```

3. Run benchmark of the tuned configs
```bash
python ./tune_gemm.py --gemm_size_file output.yaml -dtype_a bf8 -dtype_b bf8 --benchmark
```

A sample output from `benchmark` looks like
```bash
Benchmarking gemm with bf8 inputs (peak tflops: 1298)
trans M N K TFLOPS Efficiency
NT 4864 4096 8192 841.22 65%
NT 8192 8192 8192 745.31 57%
```

# GEMM Tuning Script v3.1
## GEMM Tuning Script v3.1

### API changes

- Added `matrix_instr_nonkdim` into the tuning space. Now we can tune mfma instruction size.


# One config running script

`one_config.py` is a script that runs one given matmul config.
It is an interface to `tune_gemm.py` functionality and could be used for triton debugging.

### Usage

This script supports two methods to specify configuration parameters.

Variant 1: Separate command line attributes.

```bash
python one_config.py -m 256 -n 256 -k 256 --block_m 64 --block_n 64 --block_k 64 --group_m 1 --split_k 2 --num_warps 2 --num_stages 0 --waves_per_eu 0 --matrix_instr_nonkdim 16 --kpack 2
```

Variant 2: one-line config description.
This is how configs are printed by `tune_gemm.py` script

```bash
python one_config.py --config_str M16_N8_K128_BM64_BN64_BK64_GM1_SK2_nW2_nS0_EU0_kP2_mfma16
```

# GEMM Tuning Script v3.2
## GEMM Tuning Script v3.2

### API changes

Expand All @@ -160,7 +229,8 @@ Rotating tensor and icache flush are to make perf numbers are closer to that in
- Added `--bias_vector` to support kernel execution with bias (bias vector is of the same size as the number of rows of the output matrix,
so each element of the bias vector is added to all elements of the corresponding row of the output matrix.)

# GEMM Tuning Script v3.3

## GEMM Tuning Script v3.3

### API changes

Expand Down Expand Up @@ -201,3 +271,26 @@ that cannot divide `K`.
- Tuning result file is open and closed inside the tuning loop, enabling timely flush
of the tuning results.
- Now we use `rocprofv2` to measure kernel time.


# One config running script

`one_config.py` is a script that runs one given matmul config.
It is an interface to `tune_gemm.py` functionality and could be used for triton debugging.

## Usage

This script supports two methods to specify configuration parameters.

Variant 1: Separate command line attributes.

```bash
python one_config.py -m 256 -n 256 -k 256 --block_m 64 --block_n 64 --block_k 64 --group_m 1 --split_k 2 --num_warps 2 --num_stages 0 --waves_per_eu 0 --matrix_instr_nonkdim 16 --kpack 2
```

Variant 2: one-line config description.
This is how configs are printed by `tune_gemm.py` script

```bash
python one_config.py --config_str M16_N8_K128_BM64_BN64_BK64_GM1_SK2_nW2_nS0_EU0_kP2_mfma16
```
8 changes: 4 additions & 4 deletions scripts/amd/gemm/tune_gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,7 +572,7 @@ def parse_args():
parser.add_argument("--compare_wo_tuning",
action='store_true',
default=False,
help="Whether check result correctness")
help="Whether check result correctness without tuning.")
parser.add_argument("--benchmark",
action='store_true',
default=False,
Expand All @@ -596,11 +596,11 @@ def parse_args():
parser.add_argument("--jobs",
type=int,
default=1,
help="number of generated files")
help="number of tasks during the profiling process")
parser.add_argument("--iters",
type=int,
default=1000,
help="number of generated files")
help="number of iterations used in --benchmark mode")
parser.add_argument(
"--init_type",
type=str,
Expand All @@ -626,7 +626,7 @@ def parse_args():
parser.add_argument("--no_warmup",
action='store_true',
default=False,
help="Do not call the warmup kernel")
help="Whether we want to skip the compilation stage")
args = parser.parse_args()
if not args.o:
if args.benchmark:
Expand Down

0 comments on commit 6ee25cc

Please sign in to comment.