[tuning] gemm tuning script v3.3 #606

zhanglx13 · 2024-06-27T04:04:27Z

Please check the README for changes introduced in v3.3.

This PR enables

Put all the kernels to be compiled in a single file ==> this greatly reduces compilation time. In the example of M=N=K=512 which has about 3800 configs, the compilation time reduces from 50 minutes to 1.5 minutes.
Extract GPU kernels of all configs into a separate file and let compile and profile driver files import it. In this way, compile and profile stages can shared the cache.
Allow reuse of compiled kernels across different gemm sizes. This is achieved by
- keeping track of the tuning space across the tuning loop, i.e. at every iteration, we only add new configs from the current gemm size into the tuning space.
- remove M_N_K from the configStr in kernel name
Refactor the script. Now some utility functions are separated from tune_gemm.py

Example tuning session of 2 gemm sizes

~/AMD-triton/scripts/amd/gemm $ python tune_gemm.py --gemm_size_file gemm_config.yaml --ngpus 8 --jobs 32
Tuning 2 gemm sizes starts at: 2024-07-20 22:03:12.604555
SIZE: 512 512 512 TN nConfigs: 3824 TFLOPS: 60.47 time(us): 4.44 best_config: BM32_BN32_BK256_GM4_SK1_nW4_nS0_EU0_kP2_mfma16
>>> Elapsed time: 0:14:22.585976 = 0:01:24.212892 (compile) + 0:12:30.447794 (profile) + 0:00:27.836915 (post processing)
SIZE: 512 512 512 TN nConfigs: 3824 TFLOPS: 75.28 time(us): 3.57 best_config: BM64_BN16_BK128_GM1_SK1_nW4_nS0_EU0_kP2_mfma16
>>> Elapsed time: 0:12:35.324931 = 0:00:19.680196 (compile) + 0:11:52.336533 (profile) + 0:00:23.055614 (post processing)
Tuning ends at: 2024-07-20 22:30:11.100077
Total tuning time (h:m:s): 0:26:58.495522

The elapsed time of the kernel is very small, so hw noises play more roles here. This example is to demonstrate the compilation time of the tuning process.
One thing to note is that the second gemm's compilation time is much smaller than the first one, indicating cache reuse between the two gemms.

cc+ @xiaohuguo2023 You can try this one on your large-sample stream-K tuning to see if it helps.

scripts/amd/gemm/matmul_kernel.py

~/.local/bin/yapf -i --style='{based_on_style: pep8}' tune_gemm.py

xiaohuguo2023 · 2024-07-22T12:56:44Z

There seems issue of dealing with M=1 ?

root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py  --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
    compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
    stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
    run_bash_command(f"python {fname} -n {num_threads}",
  File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
    proc = subprocess.run(commandstring,
  File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.

scripts/amd/gemm/README.md

zhanglx13 · 2024-07-22T16:07:01Z

There seems issue of dealing with M=1 ?

root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py  --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
    compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
    stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
    run_bash_command(f"python {fname} -n {num_threads}",
  File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
    proc = subprocess.run(commandstring,
  File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.

This should be fixed with 1daec1f

scxiao · 2024-07-22T21:50:09Z

scripts/amd/gemm/matmul_kernel.py

@@ -41,8 +42,12 @@ def matmul_kernel(
    acc_dtype = tl.float32 if a_ptr.type.element_ty != tl.int8 else tl.int32
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=acc_dtype)
    for k in range(0, tl.cdiv(K, BLOCK_SIZE_K * SPLIT_K)):
-        a = tl.load(a_ptrs)
-        b = tl.load(b_ptrs)
+        if EVEN_K:


For line 35 and 36, maybe we can add the module of M and N of the offset added to a_ptr and b_ptr to support random M and N values.

We can, but that will harm performance a lot.
Let me create a ticket for this issue.

scxiao · 2024-07-22T21:52:09Z

Is there anything need to change for the script one_config.py?

zhanglx13 · 2024-07-23T03:47:48Z

Is there anything need to change for the script one_config.py?

No, since we don't change any API to the script.

zhanglx13 · 2024-07-23T04:25:16Z

@xiaohuguo2023 @vgokhale @scxiao Re --gpu_ids does not work.

Compilation stage:

During compilation, each thread will query the GPU info, such as torch.version.hip and utils.get_device_properties(), to start the compilation flow. Such queries will result in hip runtime functions. And since there are so many threads running in parallel, we see all the GPUs busy at the beginning of the compilation stage.

I tried to set ROCR_VISIBLE_DEVICES=0 to force everyone to use GPU0, but it does not work. All GPUs will still be busy.
And this is not decent, since compilation should not need any runtime functions.

Therefore, I introduced a very hacky option, i.e. --hack_triton_compiler, which can be used to modify the triton front-end source code and provide a static backend so that the compilation flow can start without running any runtime function.

Profiling stage:

This is very tricky. --gpu_ids actually works, but in a very surprising way. Since the mapping from ROCR_VISIBLE_DEVICES and GPU id from rocm-smi is not an identity function, but the following

ROCR_VISIBLE_DEVICES	GPUid
0	3
1	2
2	0
3	1
4	7
5	6
6	4
7	5

This could be some settings in my own docker, so could you confirm if this is also the case in your environment?

Another thing regarding the profiling stage. I found that invoking rocprof/rocprofv2 will make all GPUs busy for a very short period of time before the kernel start executing. I suspect this is due to rocprof/rocprofv2 query all GPU information in the system. I'm not sure if we can avoid this, but the GPU busy time is definitely insignificant.

xiaohuguo2023 · 2024-07-23T08:17:13Z

Yeah, I have the similar observation, this is my setting

export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6

and my rocm-smi

xiaohugu@smc300x-ccs-aus-GPUF292:~/openai/triton_bench$ rocm-smi


=================================================== ROCm System Management Interface ===================================================
============================================================= Concise Info =============================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK     MCLK     Fan  Perf              PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
========================================================================================================================================
0       4     0x74a1,   8554   41.0°C      124.0W    NPS1, SPX, 0        249Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
1       5     0x74a1,   19011  40.0°C      117.0W    NPS1, SPX, 0        151Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
2       3     0x74a1,   30036  41.0°C      130.0W    NPS1, SPX, 0        233Mhz   900Mhz   0%   perf_determinism  750.0W  1%     3%
3       2     0x74a1,   23964  40.0°C      294.0W    NPS1, SPX, 0        1402Mhz  1300Mhz  0%   perf_determinism  750.0W  1%     26%
4       8     0x74a1,   1197   40.0°C      114.0W    NPS1, SPX, 0        158Mhz   900Mhz   0%   perf_determinism  750.0W  0%     0%
5       9     0x74a1,   41351  39.0°C      114.0W    NPS1, SPX, 0        146Mhz   900Mhz   0%   perf_determinism  750.0W  0%     0%
6       7     0x74a1,   26775  41.0°C      200.0W    NPS1, SPX, 0        430Mhz   1300Mhz  0%   perf_determinism  750.0W  1%     17%
7       6     0x74a1,   45536  38.0°C      117.0W    NPS1, SPX, 0        172Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
========================================================================================================================================
========================================================= End of ROCm SMI Log ==========================================================

zhanglx13 · 2024-07-23T14:22:50Z

@xiaohuguo2023 Thanks for confirmation. This is weird. I'll file a ticket for this issue.
If nothing else, can we do a final round of review and merge this PR?

vgokhale

LGTM. Thanks for addressing the comments

xiaohuguo2023

LGTM, excellent work

* Copy *tune_gemm* from `triton-mlir` branch to `main_perf` branch The source commit in `triton-mlir` branch is the following one: ``` commit cf44637 Author: Lixun Zhang <[email protected]> Date: Tue Jul 23 14:22:01 2024 -0500 [tuning] gemm tuning script v3.3 (#606) ``` *tune_gemm* was copied from the source branch directory `scripts/amd/gemm` to the destination branch directory `python/perf-kernels/tune_gemm`. The SHA-256 hashes of *tune_gemm* files are the following ones: ``` 423aef1deb6c60f6578a1ecfc94d2473f8746b00d0368c553d31641fcfa5e354 README.md 46ab93978fee33f75df23332f12546dae7910478c391f08b7b1ebd415d8266b7 icache_flush.py f18711544641b810a652e6a6629bfa2b613f6ade87399e88fdf05b81d4af58a4 matmul.py 84a1c80ede36d3154e51188276eda2d2d0f52ed4f496ff69349c390d83b8ec10 matmul_kernel.py 2812b40183637bc8d7e47d283c7d66b1792134a43de76f3eacf7b9b3e1c2431a one_config.py 0ac09c33b0173cea06ddabbf9f4e3afa1816781dea4fdcce5894a7e7d6a80e19 rocprof_gemm.py 00eff41cf1c0bfc41d623e42b51706af67639fec76146741e2067d2a93e0148a utils/file_generator.py cb7afb773ccee835b00396cccf87e0d44fe513131161f031fae42453725b3c82 utils/utils.py 59f23811b660e49e566927853926a21f02a7014bb19c8ea67e6b382db6c59900 tune_gemm.py e787f35d750b869f113b3c01692f64243a9cb8a71a18ade2f0465f614f7284e4 tune_gemm.sh ``` The files were kept as-is despite `pre-commit` intentions to change them. After that, *tune_gemm* directory in code and documentation was fixed to reflect it's new location.

zhanglx13 force-pushed the pipeline_tune_gemm branch 4 times, most recently from 678c36c to f9a7115 Compare June 28, 2024 02:26

xiaohuguo2023 reviewed Jun 28, 2024

View reviewed changes

scripts/amd/gemm/matmul_kernel.py Show resolved Hide resolved

zhanglx13 force-pushed the pipeline_tune_gemm branch from 33436e8 to f1601ef Compare July 17, 2024 19:39

zhanglx13 added 3 commits July 20, 2024 15:10

gemm tuning script v3.3

c28b2e8

Add support for EVEN_K

061900f

Output amdgcn

b3b7d34

zhanglx13 force-pushed the pipeline_tune_gemm branch 2 times, most recently from 2e4ee4d to 3e84965 Compare July 20, 2024 23:53

Extract matmul_kernel_configStr kernels into a separate file

c387760

zhanglx13 force-pushed the pipeline_tune_gemm branch from 3e84965 to 6138ae6 Compare July 21, 2024 00:00

zhanglx13 added 2 commits July 20, 2024 22:52

Format python code with yapf

344e6f2

~/.local/bin/yapf -i --style='{based_on_style: pep8}' tune_gemm.py

Construct the tuning space, i.e. myKernels accumulatively

cd736ce

zhanglx13 force-pushed the pipeline_tune_gemm branch from 6138ae6 to cd736ce Compare July 21, 2024 03:53

zhanglx13 marked this pull request as ready for review July 21, 2024 04:07

zhanglx13 requested review from vgokhale, scxiao and xiaohuguo2023 July 21, 2024 04:08

zhanglx13 force-pushed the pipeline_tune_gemm branch 7 times, most recently from 591550a to 60760b6 Compare July 22, 2024 00:27

Better readme

6ee25cc

zhanglx13 force-pushed the pipeline_tune_gemm branch from 60760b6 to 6ee25cc Compare July 22, 2024 00:28