Extract matmul_kernel_configStr kernels into a separate file

ROCm · Jul 20, 2024 · 3e84965 · 3e84965
1 parent b3b7d34
commit 3e84965
Show file tree

Hide file tree

Showing 4 changed files with 475 additions and 383 deletions.
diff --git a/scripts/amd/gemm/README.md b/scripts/amd/gemm/README.md
@@ -162,13 +162,36 @@ so each element of the bias vector is added to all elements of the corresponding
 
 # GEMM Tuning Script v3.3
 
+### API changes
+
+no API changes
+
 ### Implementation changes
 
-- A separate file, named `generate_all_kernels.py`, is generated for compilations only.
-It contains all the kernels in the tuning space and they will be compiled by 64 threads by default.
-Compiling all the kernels in a single file in parallel is faster than splitting them 
-into multiple files. This can greatly reduce the compile time of the tuning process.
-- `configStr` does not contain gemm size anymore. This allows the same matmul_{configStr} kernel
-to be reused by different gemm sizes (this is not implemented yet).
+- Before the start of the for loop of tuning, we generated a file named myKernels.py
+(this is obtained from `get_filename_myKernels()`) which contains all the matmul_kernel
+with configs in the full, un-pruned tuning space. This file will be used by this tuning
+session shared by all gemm sizes.
+- Inside the for loop of tuning, each iteration tunes one gemm size
+  1. Compilation stage:
+    - We generate a single compilation driver file, named compile_driver.py (this is
+    obtained from `get_filename_compile_driver`) which contains the wrapper functions
+    of all the configs in the **pruned** tuning space for this gemm size.
+    - All the kernels will be compiled by 32 threads by default. Compiling all the
+    kernels in a single file in parallel is faster than splitting them into multiple
+    files. This can greatly reduce the compile time of the tuning process.
+    - Note that we no longer generate matmul_kernel in this file. Kernels are imported
+    from myKernels.py.
+  2. Profile stage
+     - We generate one task file per job, named `profile_driver_MxNxK_{job_id}.py`
+     (this is obtained from `get_filename_profile_driver`). The only difference is
+     that we no longer generate matmul_kernel in this file. Kernels are imported
+     from myKernels.py.
+- `configStr` does not contain gemm size anymore. This allows the same matmul_{configStr}
+kernel to be reused by different gemm sizes.
+- `configStr` does not contain `_bias` if bias is provided. This is because we do not
+expect to compare the same kernel w/ and w/o bias. Therefore, we treat bias in the same
+way as gemm sizes.
 - Add support for `EVEN_K` in the matmul kernel. Now the kernel support `BLOCK_SIZE_K`
 that cannot divide `K`.
+- Now we use `rocprofv2` to measure kernel time.