Skip to content

Commit

Permalink
Extract matmul_kernel_configStr kernels into a separate file
Browse files Browse the repository at this point in the history
  • Loading branch information
zhanglx13 committed Jul 20, 2024
1 parent b3b7d34 commit 3e84965
Show file tree
Hide file tree
Showing 4 changed files with 475 additions and 383 deletions.
35 changes: 29 additions & 6 deletions scripts/amd/gemm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,13 +162,36 @@ so each element of the bias vector is added to all elements of the corresponding

# GEMM Tuning Script v3.3

### API changes

no API changes

### Implementation changes

- A separate file, named `generate_all_kernels.py`, is generated for compilations only.
It contains all the kernels in the tuning space and they will be compiled by 64 threads by default.
Compiling all the kernels in a single file in parallel is faster than splitting them
into multiple files. This can greatly reduce the compile time of the tuning process.
- `configStr` does not contain gemm size anymore. This allows the same matmul_{configStr} kernel
to be reused by different gemm sizes (this is not implemented yet).
- Before the start of the for loop of tuning, we generated a file named myKernels.py
(this is obtained from `get_filename_myKernels()`) which contains all the matmul_kernel
with configs in the full, un-pruned tuning space. This file will be used by this tuning
session shared by all gemm sizes.
- Inside the for loop of tuning, each iteration tunes one gemm size
1. Compilation stage:
- We generate a single compilation driver file, named compile_driver.py (this is
obtained from `get_filename_compile_driver`) which contains the wrapper functions
of all the configs in the **pruned** tuning space for this gemm size.
- All the kernels will be compiled by 32 threads by default. Compiling all the
kernels in a single file in parallel is faster than splitting them into multiple
files. This can greatly reduce the compile time of the tuning process.
- Note that we no longer generate matmul_kernel in this file. Kernels are imported
from myKernels.py.
2. Profile stage
- We generate one task file per job, named `profile_driver_MxNxK_{job_id}.py`
(this is obtained from `get_filename_profile_driver`). The only difference is
that we no longer generate matmul_kernel in this file. Kernels are imported
from myKernels.py.
- `configStr` does not contain gemm size anymore. This allows the same matmul_{configStr}
kernel to be reused by different gemm sizes.
- `configStr` does not contain `_bias` if bias is provided. This is because we do not
expect to compare the same kernel w/ and w/o bias. Therefore, we treat bias in the same
way as gemm sizes.
- Add support for `EVEN_K` in the matmul kernel. Now the kernel support `BLOCK_SIZE_K`
that cannot divide `K`.
- Now we use `rocprofv2` to measure kernel time.
Loading

0 comments on commit 3e84965

Please sign in to comment.