Update composable_kernel to rocm-6.3.1 tag #251

AngryLoki · 2025-01-19T20:42:20Z

This fixes compilation with recent versions of Clang (Clang 19 specifically).

Additionally, as DeviceGroupedConvFwdMultipleD_Wmma_CShuffle API was changed, new wmma configuration provides 15% better performance on 7900XTX GPU (gfx1100).

Closes #250

This fixes compilation with recent versions of Clang (Clang 19 specifically). Additionally, as `DeviceGroupedConvFwdMultipleD_Wmma_CShuffle` API was changed, new wmma configuration provides 15% better performance on 7900XTX GPU (gfx1100). Closes RenderKit#250 Signed-off-by: Sv. Lockal <[email protected]>

AngryLoki · 2025-01-19T20:48:36Z

Here are some additional details regarding WMMA configuration. As old configuration was incompatible with new API, I checked all suggested configurations from https://github.com/ROCm/composable_kernel/blob/rocm-6.3.1/library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_instance.hpp#L55. While one blocksize=256 configuration was similar to existing configuration in terms of performance, I see that few blocksize=128 and blocksize=64 are faster. Here are oidnBenchmark for the fastest configuration on 7900XTX:

Details

        1,    256,   256,    64,    32,  8,    16,   16,       8,       1,     S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 64, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 8>,               8
RT.hdr_alb_nrm.1920x1080 ... 14.1113 msec/image (host 0.0392849 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 13.5 msec/image (host 0.0377627 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 26.8473 msec/image (host 0.0468468 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 13.5992 msec/image (host 0.0384625 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 56.4277 msec/image (host 0.0809162 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 55.9654 msec/image (host 0.0786679 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 112.342 msec/image (host 0.0906975 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 56.2242 msec/image (host 0.074245 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.85475 msec/image (host 0.0396092 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.77316 msec/image (host 0.0378181 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 11.5752 msec/image (host 0.0452481 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.8126 msec/image (host 0.0376056 msec/image)
RTLightmap.hdr.2048x2048 ... 26.236 msec/image (host 0.0443479 msec/image)
RTLightmap.hdr.4096x4096 ... 116.579 msec/image (host 0.158215 msec/image)
RTLightmap.hdr.1024x1024 ... 6.73153 msec/image (host 0.0407303 msec/image)

        1,    128,   128,    64,    32,  8,    16,   16,       4,       2,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 4>,               8
RT.hdr_alb_nrm.1920x1080 ... 12.964 msec/image (host 0.0390827 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 12.1571 msec/image (host 0.037766 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 24.3372 msec/image (host 0.0442185 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 12.3045 msec/image (host 0.0384218 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 51.9368 msec/image (host 0.0753012 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 51.5706 msec/image (host 0.0811878 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 103.407 msec/image (host 0.0916777 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 52.1188 msec/image (host 0.0750968 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.25369 msec/image (host 0.0398166 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.22691 msec/image (host 0.0375846 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 10.6706 msec/image (host 0.044808 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.28861 msec/image (host 0.0372429 msec/image)
RTLightmap.hdr.2048x2048 ... 23.9107 msec/image (host 0.0410906 msec/image)
RTLightmap.hdr.4096x4096 ... 105.887 msec/image (host 0.153403 msec/image)
RTLightmap.hdr.1024x1024 ... 5.72713 msec/image (host 0.0395792 msec/image)

        1,    128,   256,    32,    32,  8,    16,   16,       8,       1,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 32, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 4>,               8      
RT.hdr_alb_nrm.1920x1080 ... 12.45 msec/image (host 0.0348377 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 11.8295 msec/image (host 0.0346418 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 25.9901 msec/image (host 0.0402695 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 11.9377 msec/image (host 0.0336168 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 50.3133 msec/image (host 0.0708822 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 50.0836 msec/image (host 0.0684241 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 110.263 msec/image (host 0.079265 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 50.6274 msec/image (host 0.0723901 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.35126 msec/image (host 0.0347365 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.36091 msec/image (host 0.032873 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 11.904 msec/image (host 0.0407756 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.35253 msec/image (host 0.0336231 msec/image)
RTLightmap.hdr.2048x2048 ... 23.2066 msec/image (host 0.0377071 msec/image)
RTLightmap.hdr.4096x4096 ... 103.029 msec/image (host 0.146602 msec/image)
RTLightmap.hdr.1024x1024 ... 5.79001 msec/image (host 0.0362349 msec/image)

        1,     64,    64,    32,    32,  8,    16,   16,       2,       2,     S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,              2,              8,              8,         1,     S<4, 16, 1>,     S<1, 0, 2>,     S<1, 0, 2>,             2,              8,              8,         1,           1,           1,               S<1, 32, 1, 2>,               8
RT.hdr_alb_nrm.1920x1080 ... 12.5248 msec/image (host 0.0394759 msec/image)
RT.ldr_alb_nrm.1920x1080 ... 11.6173 msec/image (host 0.0374902 msec/image)
RT.hdr_calb_cnrm.1920x1080 ... 26.2325 msec/image (host 0.0461772 msec/image)
RT.ldr_calb_cnrm.1920x1080 ... 11.6801 msec/image (host 0.0382362 msec/image)
RT.hdr_alb_nrm.3840x2160 ... 49.9959 msec/image (host 0.087514 msec/image)
RT.ldr_alb_nrm.3840x2160 ... 49.7982 msec/image (host 0.0728801 msec/image)
RT.hdr_calb_cnrm.3840x2160 ... 113.079 msec/image (host 0.086185 msec/image)
RT.ldr_calb_cnrm.3840x2160 ... 50.0374 msec/image (host 0.072577 msec/image)
RT.hdr_alb_nrm.1280x720 ... 5.15437 msec/image (host 0.0343752 msec/image)
RT.ldr_alb_nrm.1280x720 ... 5.13636 msec/image (host 0.0360057 msec/image)
RT.hdr_calb_cnrm.1280x720 ... 11.6422 msec/image (host 0.0394987 msec/image)
RT.ldr_calb_cnrm.1280x720 ... 5.17078 msec/image (host 0.0332661 msec/image)
RTLightmap.hdr.2048x2048 ... 23.17 msec/image (host 0.0369801 msec/image)
RTLightmap.hdr.4096x4096 ... 102.724 msec/image (host 0.145674 msec/image)
RTLightmap.hdr.1024x1024 ... 5.66252 msec/image (host 0.03511 msec/image)

So I switched to config with blocksize=64, which is 15% faster than previously used one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update composable_kernel to rocm-6.3.1 tag #251

Update composable_kernel to rocm-6.3.1 tag #251

AngryLoki commented Jan 19, 2025

AngryLoki commented Jan 19, 2025 •

edited

Loading

Update composable_kernel to rocm-6.3.1 tag #251

Are you sure you want to change the base?

Update composable_kernel to rocm-6.3.1 tag #251

Conversation

AngryLoki commented Jan 19, 2025

AngryLoki commented Jan 19, 2025 • edited Loading

AngryLoki commented Jan 19, 2025 •

edited

Loading