Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tuning] gemm tuning script v3.3 #606

Merged
merged 12 commits into from
Jul 23, 2024
Merged

Conversation

zhanglx13
Copy link

@zhanglx13 zhanglx13 commented Jun 27, 2024

Please check the README for changes introduced in v3.3.

This PR enables

  • Put all the kernels to be compiled in a single file ==> this greatly reduces compilation time. In the example of M=N=K=512 which has about 3800 configs, the compilation time reduces from 50 minutes to 1.5 minutes.
  • Extract GPU kernels of all configs into a separate file and let compile and profile driver files import it. In this way, compile and profile stages can shared the cache.
  • Allow reuse of compiled kernels across different gemm sizes. This is achieved by
    • keeping track of the tuning space across the tuning loop, i.e. at every iteration, we only add new configs from the current gemm size into the tuning space.
    • remove M_N_K from the configStr in kernel name
  • Refactor the script. Now some utility functions are separated from tune_gemm.py

Example tuning session of 2 gemm sizes

~/AMD-triton/scripts/amd/gemm $ python tune_gemm.py --gemm_size_file gemm_config.yaml --ngpus 8 --jobs 32
Tuning 2 gemm sizes starts at: 2024-07-20 22:03:12.604555
SIZE: 512 512 512 TN nConfigs: 3824 TFLOPS: 60.47 time(us): 4.44 best_config: BM32_BN32_BK256_GM4_SK1_nW4_nS0_EU0_kP2_mfma16
>>> Elapsed time: 0:14:22.585976 = 0:01:24.212892 (compile) + 0:12:30.447794 (profile) + 0:00:27.836915 (post processing)
SIZE: 512 512 512 TN nConfigs: 3824 TFLOPS: 75.28 time(us): 3.57 best_config: BM64_BN16_BK128_GM1_SK1_nW4_nS0_EU0_kP2_mfma16
>>> Elapsed time: 0:12:35.324931 = 0:00:19.680196 (compile) + 0:11:52.336533 (profile) + 0:00:23.055614 (post processing)
Tuning ends at: 2024-07-20 22:30:11.100077
Total tuning time (h:m:s): 0:26:58.495522

The elapsed time of the kernel is very small, so hw noises play more roles here. This example is to demonstrate the compilation time of the tuning process.
One thing to note is that the second gemm's compilation time is much smaller than the first one, indicating cache reuse between the two gemms.

cc+ @xiaohuguo2023 You can try this one on your large-sample stream-K tuning to see if it helps.

@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch 4 times, most recently from 678c36c to f9a7115 Compare June 28, 2024 02:26
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from 33436e8 to f1601ef Compare July 17, 2024 19:39
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch 2 times, most recently from 2e4ee4d to 3e84965 Compare July 20, 2024 23:53
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from 3e84965 to 6138ae6 Compare July 21, 2024 00:00
~/.local/bin/yapf -i --style='{based_on_style: pep8}' tune_gemm.py
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from 6138ae6 to cd736ce Compare July 21, 2024 03:53
@zhanglx13 zhanglx13 marked this pull request as ready for review July 21, 2024 04:07
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch 7 times, most recently from 591550a to 60760b6 Compare July 22, 2024 00:27
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from 60760b6 to 6ee25cc Compare July 22, 2024 00:28
@xiaohuguo2023
Copy link
Member

xiaohuguo2023 commented Jul 22, 2024

There seems issue of dealing with M=1 ?

root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py  --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
    compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
    stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
    run_bash_command(f"python {fname} -n {num_threads}",
  File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
    proc = subprocess.run(commandstring,
  File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.

@zhanglx13
Copy link
Author

There seems issue of dealing with M=1 ?

root@smc300x-ccs-aus-GPUF292:/home/work/triton/scripts/amd/gemm# python tune_gemm.py  --gemm_size_file memory_bound_sizes.yaml --ngpus 6 --jobs 24
Tuning 5 gemm sizes starts at: 2024-07-22 12:49:32.056419
SIZE: 1 8192 28672 TN nConfigs: 880 Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28215, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 28212, in main
    compile_kernels(1, 8192, 28672, rotating_buffer_size, 1, numThreads)
  File "/home/work/triton/scripts/amd/gemm/utils/../compile_driver.py", line 26420, in compile_kernels
    stride_bias = tensors['bias'][0].stride(0) if bias_size > 0 else 0
IndexError: Dimension specified as 0 but tensor has no dimensions
Traceback (most recent call last):
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 921, in <module>
    sys.exit(main())
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 825, in main
    minTime, bestConfig, compile_time, profile_time, post_time = tune_gemm_config(
  File "/home/work/triton/scripts/amd/gemm/tune_gemm.py", line 233, in tune_gemm_config
    run_bash_command(f"python {fname} -n {num_threads}",
  File "/home/work/triton/scripts/amd/gemm/utils/utils.py", line 45, in run_bash_command
    proc = subprocess.run(commandstring,
  File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python /home/work/triton/scripts/amd/gemm/utils/../compile_driver.py -n 32' returned non-zero exit status 1.

This should be fixed with 1daec1f

@@ -41,8 +42,12 @@ def matmul_kernel(
acc_dtype = tl.float32 if a_ptr.type.element_ty != tl.int8 else tl.int32
accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=acc_dtype)
for k in range(0, tl.cdiv(K, BLOCK_SIZE_K * SPLIT_K)):
a = tl.load(a_ptrs)
b = tl.load(b_ptrs)
if EVEN_K:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For line 35 and 36, maybe we can add the module of M and N of the offset added to a_ptr and b_ptr to support random M and N values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but that will harm performance a lot.
Let me create a ticket for this issue.

@scxiao
Copy link

scxiao commented Jul 22, 2024

Is there anything need to change for the script one_config.py?

@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from c26e317 to a61967b Compare July 22, 2024 22:04
@zhanglx13
Copy link
Author

Is there anything need to change for the script one_config.py?

No, since we don't change any API to the script.

@zhanglx13
Copy link
Author

@xiaohuguo2023 @vgokhale @scxiao Re --gpu_ids does not work.

Compilation stage:

During compilation, each thread will query the GPU info, such as torch.version.hip and utils.get_device_properties(), to start the compilation flow. Such queries will result in hip runtime functions. And since there are so many threads running in parallel, we see all the GPUs busy at the beginning of the compilation stage.

I tried to set ROCR_VISIBLE_DEVICES=0 to force everyone to use GPU0, but it does not work. All GPUs will still be busy.
And this is not decent, since compilation should not need any runtime functions.

Therefore, I introduced a very hacky option, i.e. --hack_triton_compiler, which can be used to modify the triton front-end source code and provide a static backend so that the compilation flow can start without running any runtime function.

Profiling stage:

This is very tricky. --gpu_ids actually works, but in a very surprising way. Since the mapping from ROCR_VISIBLE_DEVICES and GPU id from rocm-smi is not an identity function, but the following

ROCR_VISIBLE_DEVICES GPUid
0 3
1 2
2 0
3 1
4 7
5 6
6 4
7 5

This could be some settings in my own docker, so could you confirm if this is also the case in your environment?

Another thing regarding the profiling stage. I found that invoking rocprof/rocprofv2 will make all GPUs busy for a very short period of time before the kernel start executing. I suspect this is due to rocprof/rocprofv2 query all GPU information in the system. I'm not sure if we can avoid this, but the GPU busy time is definitely insignificant.

@xiaohuguo2023
Copy link
Member

Yeah, I have the similar observation, this is my setting

export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6

and my rocm-smi

xiaohugu@smc300x-ccs-aus-GPUF292:~/openai/triton_bench$ rocm-smi


=================================================== ROCm System Management Interface ===================================================
============================================================= Concise Info =============================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK     MCLK     Fan  Perf              PwrCap  VRAM%  GPU%
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)
========================================================================================================================================
0       4     0x74a1,   8554   41.0°C      124.0W    NPS1, SPX, 0        249Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
1       5     0x74a1,   19011  40.0°C      117.0W    NPS1, SPX, 0        151Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
2       3     0x74a1,   30036  41.0°C      130.0W    NPS1, SPX, 0        233Mhz   900Mhz   0%   perf_determinism  750.0W  1%     3%
3       2     0x74a1,   23964  40.0°C      294.0W    NPS1, SPX, 0        1402Mhz  1300Mhz  0%   perf_determinism  750.0W  1%     26%
4       8     0x74a1,   1197   40.0°C      114.0W    NPS1, SPX, 0        158Mhz   900Mhz   0%   perf_determinism  750.0W  0%     0%
5       9     0x74a1,   41351  39.0°C      114.0W    NPS1, SPX, 0        146Mhz   900Mhz   0%   perf_determinism  750.0W  0%     0%
6       7     0x74a1,   26775  41.0°C      200.0W    NPS1, SPX, 0        430Mhz   1300Mhz  0%   perf_determinism  750.0W  1%     17%
7       6     0x74a1,   45536  38.0°C      117.0W    NPS1, SPX, 0        172Mhz   900Mhz   0%   perf_determinism  750.0W  1%     1%
========================================================================================================================================
========================================================= End of ROCm SMI Log ==========================================================

@zhanglx13
Copy link
Author

@xiaohuguo2023 Thanks for confirmation. This is weird. I'll file a ticket for this issue.
If nothing else, can we do a final round of review and merge this PR?

@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from 4c94471 to 68af100 Compare July 23, 2024 15:01
@zhanglx13 zhanglx13 force-pushed the pipeline_tune_gemm branch from 68af100 to e979474 Compare July 23, 2024 15:29
@vgokhale vgokhale self-requested a review July 23, 2024 17:42
Copy link
Collaborator

@vgokhale vgokhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing the comments

Copy link
Member

@xiaohuguo2023 xiaohuguo2023 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, excellent work

@xiaohuguo2023 xiaohuguo2023 merged commit cf44637 into triton-mlir Jul 23, 2024
2 of 3 checks passed
brunomazzottiamd added a commit that referenced this pull request Aug 13, 2024
* Copy *tune_gemm* from `triton-mlir` branch to `main_perf` branch

The source commit in `triton-mlir` branch is the following one:
```
commit cf44637
Author: Lixun Zhang <[email protected]>
Date:   Tue Jul 23 14:22:01 2024 -0500

    [tuning] gemm tuning script v3.3 (#606)
```

*tune_gemm* was copied from the source branch directory `scripts/amd/gemm`
to the destination branch directory `python/perf-kernels/tune_gemm`.

The SHA-256 hashes of *tune_gemm* files are the following ones:
```
423aef1deb6c60f6578a1ecfc94d2473f8746b00d0368c553d31641fcfa5e354  README.md
46ab93978fee33f75df23332f12546dae7910478c391f08b7b1ebd415d8266b7  icache_flush.py
f18711544641b810a652e6a6629bfa2b613f6ade87399e88fdf05b81d4af58a4  matmul.py
84a1c80ede36d3154e51188276eda2d2d0f52ed4f496ff69349c390d83b8ec10  matmul_kernel.py
2812b40183637bc8d7e47d283c7d66b1792134a43de76f3eacf7b9b3e1c2431a  one_config.py
0ac09c33b0173cea06ddabbf9f4e3afa1816781dea4fdcce5894a7e7d6a80e19  rocprof_gemm.py
00eff41cf1c0bfc41d623e42b51706af67639fec76146741e2067d2a93e0148a  utils/file_generator.py
cb7afb773ccee835b00396cccf87e0d44fe513131161f031fae42453725b3c82  utils/utils.py
59f23811b660e49e566927853926a21f02a7014bb19c8ea67e6b382db6c59900  tune_gemm.py
e787f35d750b869f113b3c01692f64243a9cb8a71a18ade2f0465f614f7284e4  tune_gemm.sh
```

The files were kept as-is despite `pre-commit` intentions to change them.

After that, *tune_gemm* directory in code and documentation was fixed to reflect
it's new location.
micmelesse pushed a commit that referenced this pull request Oct 28, 2024
* Copy *tune_gemm* from `triton-mlir` branch to `main_perf` branch

The source commit in `triton-mlir` branch is the following one:
```
commit cf44637
Author: Lixun Zhang <[email protected]>
Date:   Tue Jul 23 14:22:01 2024 -0500

    [tuning] gemm tuning script v3.3 (#606)
```

*tune_gemm* was copied from the source branch directory `scripts/amd/gemm`
to the destination branch directory `python/perf-kernels/tune_gemm`.

The SHA-256 hashes of *tune_gemm* files are the following ones:
```
423aef1deb6c60f6578a1ecfc94d2473f8746b00d0368c553d31641fcfa5e354  README.md
46ab93978fee33f75df23332f12546dae7910478c391f08b7b1ebd415d8266b7  icache_flush.py
f18711544641b810a652e6a6629bfa2b613f6ade87399e88fdf05b81d4af58a4  matmul.py
84a1c80ede36d3154e51188276eda2d2d0f52ed4f496ff69349c390d83b8ec10  matmul_kernel.py
2812b40183637bc8d7e47d283c7d66b1792134a43de76f3eacf7b9b3e1c2431a  one_config.py
0ac09c33b0173cea06ddabbf9f4e3afa1816781dea4fdcce5894a7e7d6a80e19  rocprof_gemm.py
00eff41cf1c0bfc41d623e42b51706af67639fec76146741e2067d2a93e0148a  utils/file_generator.py
cb7afb773ccee835b00396cccf87e0d44fe513131161f031fae42453725b3c82  utils/utils.py
59f23811b660e49e566927853926a21f02a7014bb19c8ea67e6b382db6c59900  tune_gemm.py
e787f35d750b869f113b3c01692f64243a9cb8a71a18ade2f0465f614f7284e4  tune_gemm.sh
```

The files were kept as-is despite `pre-commit` intentions to change them.

After that, *tune_gemm* directory in code and documentation was fixed to reflect
it's new location.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants