-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update composable_kernel to rocm-6.3.1 tag #251
base: master
Are you sure you want to change the base?
Conversation
This fixes compilation with recent versions of Clang (Clang 19 specifically). Additionally, as `DeviceGroupedConvFwdMultipleD_Wmma_CShuffle` API was changed, new wmma configuration provides 15% better performance on 7900XTX GPU (gfx1100). Closes RenderKit#250 Signed-off-by: Sv. Lockal <[email protected]>
Here are some additional details regarding WMMA configuration. As old configuration was incompatible with new API, I checked all suggested configurations from https://github.com/ROCm/composable_kernel/blob/rocm-6.3.1/library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_instance.hpp#L55. While one blocksize=256 configuration was similar to existing configuration in terms of performance, I see that few blocksize=128 and blocksize=64 are faster. Here are oidnBenchmark for the fastest configuration on 7900XTX: Details
So I switched to config with blocksize=64, which is 15% faster than previously used one. |
This fixes compilation with recent versions of Clang (Clang 19 specifically).
Additionally, as
DeviceGroupedConvFwdMultipleD_Wmma_CShuffle
API was changed, new wmma configuration provides 15% better performance on 7900XTX GPU (gfx1100).Closes #250