Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update composable_kernel to rocm-6.3.1 tag #251

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AngryLoki
Copy link

This fixes compilation with recent versions of Clang (Clang 19 specifically).

Additionally, as DeviceGroupedConvFwdMultipleD_Wmma_CShuffle API was changed, new wmma configuration provides 15% better performance on 7900XTX GPU (gfx1100).

Closes #250

This fixes compilation with recent versions of Clang (Clang 19 specifically).

Additionally, as `DeviceGroupedConvFwdMultipleD_Wmma_CShuffle` API was changed, new wmma configuration provides 15% better performance on 7900XTX GPU (gfx1100).

Closes RenderKit#250

Signed-off-by: Sv. Lockal <[email protected]>
@AngryLoki
Copy link
Author

AngryLoki commented Jan 19, 2025

Here are some additional details regarding WMMA configuration. As old configuration was incompatible with new API, I checked all suggested configurations from https://github.com/ROCm/composable_kernel/blob/rocm-6.3.1/library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_instance.hpp#L55. While one blocksize=256 configuration was similar to existing configuration in terms of performance, I see that few blocksize=128 and blocksize=64 are faster. Here are oidnBenchmark for the fastest configuration on 7900XTX:

Details

        1,    256,   256,    64,    32,  8,    16,   16,       8,       1,     S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 8>,               8
RT.hdr_alb_nrm.1920x1080 ... 14.1113 msec/image (host 0.0392849 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 13.5 msec/image (host 0.0377627 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 26.8473 msec/image (host 0.0468468 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 13.5992 msec/image (host 0.0384625 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 56.4277 msec/image (host 0.0809162 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 55.9654 msec/image (host 0.0786679 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 112.342 msec/image (host 0.0906975 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 56.2242 msec/image (host 0.074245 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.85475 msec/image (host 0.0396092 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.77316 msec/image (host 0.0378181 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 11.5752 msec/image (host 0.0452481 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.8126 msec/image (host 0.0376056 msec/image)
RTLightmap.hdr.2048x2048 ... 26.236 msec/image (host 0.0443479 msec/image)
RTLightmap.hdr.4096x4096 ... 116.579 msec/image (host 0.158215 msec/image)
RTLightmap.hdr.1024x1024 ... 6.73153 msec/image (host 0.0407303 msec/image)

        1,    128,   128,    64,    32,  8,    16,   16,       4,       2,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 4>,               8
RT.hdr_alb_nrm.1920x1080 ... 12.964 msec/image (host 0.0390827 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 12.1571 msec/image (host 0.037766 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 24.3372 msec/image (host 0.0442185 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 12.3045 msec/image (host 0.0384218 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 51.9368 msec/image (host 0.0753012 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 51.5706 msec/image (host 0.0811878 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 103.407 msec/image (host 0.0916777 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 52.1188 msec/image (host 0.0750968 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.25369 msec/image (host 0.0398166 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.22691 msec/image (host 0.0375846 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 10.6706 msec/image (host 0.044808 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.28861 msec/image (host 0.0372429 msec/image)
RTLightmap.hdr.2048x2048 ... 23.9107 msec/image (host 0.0410906 msec/image)
RTLightmap.hdr.4096x4096 ... 105.887 msec/image (host 0.153403 msec/image)
RTLightmap.hdr.1024x1024 ... 5.72713 msec/image (host 0.0395792 msec/image)

        1,    128,   256,    32,    32,  8,    16,   16,       8,       1,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 4>,               8      
RT.hdr_alb_nrm.1920x1080 ... 12.45 msec/image (host 0.0348377 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 11.8295 msec/image (host 0.0346418 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 25.9901 msec/image (host 0.0402695 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 11.9377 msec/image (host 0.0336168 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 50.3133 msec/image (host 0.0708822 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 50.0836 msec/image (host 0.0684241 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 110.263 msec/image (host 0.079265 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 50.6274 msec/image (host 0.0723901 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.35126 msec/image (host 0.0347365 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.36091 msec/image (host 0.032873 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 11.904 msec/image (host 0.0407756 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.35253 msec/image (host 0.0336231 msec/image)
RTLightmap.hdr.2048x2048 ... 23.2066 msec/image (host 0.0377071 msec/image)
RTLightmap.hdr.4096x4096 ... 103.029 msec/image (host 0.146602 msec/image)
RTLightmap.hdr.1024x1024 ... 5.79001 msec/image (host 0.0362349 msec/image)

        1,     64,    64,    32,    32,  8,    16,   16,       2,       2,     S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 2>,               8
RT.hdr_alb_nrm.1920x1080 ... 12.5248 msec/image (host 0.0394759 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 11.6173 msec/image (host 0.0374902 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 26.2325 msec/image (host 0.0461772 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 11.6801 msec/image (host 0.0382362 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 49.9959 msec/image (host 0.087514 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 49.7982 msec/image (host 0.0728801 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 113.079 msec/image (host 0.086185 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 50.0374 msec/image (host 0.072577 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.15437 msec/image (host 0.0343752 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.13636 msec/image (host 0.0360057 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 11.6422 msec/image (host 0.0394987 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.17078 msec/image (host 0.0332661 msec/image)
RTLightmap.hdr.2048x2048 ... 23.17 msec/image (host 0.0369801 msec/image)
RTLightmap.hdr.4096x4096 ... 102.724 msec/image (host 0.145674 msec/image)
RTLightmap.hdr.1024x1024 ... 5.66252 msec/image (host 0.03511 msec/image)

So I switched to config with blocksize=64, which is 15% faster than previously used one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HIP build with >=rocm-6.3.0 fails (composable-kernel update request)
1 participant