[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

cjnolet · 2023-02-05T01:20:44Z

While working up to the 23.02 release, we started seeing compile times for a single source file increasing to over an hour (per cuda architecture) and @ahendriksen managed to isolate the issue to a cuda loop unrolling that was added to pairwise distance accumulation computation.

So far we have observed that it doesn't seem to happen on cuda 11.6 or cuda 11.5 on sm_70 architecture but it does happen on 11.8 and 12.0 on sm_80 and above.

It's unclear to me whether this was related only to the cuda loop unrolling or if it was because the loop also contained a pow. We did notice that it was only happening to a specific distance and indexing type (Minkowski with unit32_t).

I'm just creating this GitHub issue as a placeholder to promote this bug to the proper channels when the release is over.

@benfred @vyasr @ahendriksen @sevagh @ajschmidt8 @dantegd please feel free to add or correct any details here.

vyasr · 2023-02-07T00:10:12Z

To clarify, this is all happening in device code so it's not OpenMP unrolling, it's CUDA unrolling, right?

In my case I was compiling for sm_75 (also CUDA 11.8).

To consolidate the discussion here, I'll note that in my investigations I found that there was one specific unroll in the reset_accumulator implementation that had the largest impact on compile times for me locally. That unroll was introduced in #837. I was unable to get #837 to compile on my machine (CUDA 11.8, sm_75) even overnight, but with that single unroll disabled raft compiled in under 20 minutes (I do have a pretty beefy workstation). I am uncertain if any of these findings conflict in any way with what @ahendriksen reported in #1230, but I can provide extra testing/details if needed to help narrow things down further.

cjnolet · 2023-02-07T01:43:43Z

Thanks @vyasr for sharing your results. Wow, so you re saying just sm_75 alone is enough to pretty much cause the compiler to hang? That's a pretty useful find!

To clarify, this is all happening in device code so it's not OpenMP unrolling, it's CUDA unrolling, right?

I indeed meant cuda and not openmp. I was thinking cuda while typing openmp. Thanks for clarifying. I adjusted the above description accordingly.

To consolidate the discussion here, I'll note that in my investigations I found that there was one specific unroll in the reset_accumulator implementation that had the largest impact on compile times for me locally.

I have 4 main questions on the topic:

What's really going on with the changes in Move contractions tiling logic outside of Contractions_NT #837 that's causing this?

is this because of a bug in the loop unrolling itself (which seems somewhat unlikely to me),
because the unrolling causing something like excess register usage, crazy occupancy, or barely satisfiable geometry, which is triggering a bug in the compile-time optimizer, or
additional calls to an expensive or buggy intrinsic to be compiled too many times.

Why are we finding this issue to be much more dramatic for the lp_unexpanded distances (specifically w/ uint32_t indexing)? Perhaps my second question is quite closely related to the first- maybe the additional calls to the intrinsics, in combination w/ the unrolled zero-initialization loops are causing something like excess register usage.
What changed between the prolog and the reset_accumulator functions that would have caused this?
Is the change fully necessary to implement the masked 1-nn? If so, can it be designed differently so as to have a different effect on the compile times? If this is a genuine bug, we should probably consider the possibility of an alternate approach here so we can still get the masked 1-nn into 23.04 in the meantime.

Perhaps profiling might help answer some or all of these questions. I'd like to get these changes merged into 23.04 somewhat soonish. @ahendriksen, do you have any thoughts on these questions?

ahendriksen · 2023-02-07T14:04:40Z

On 1:
The issue is that ptxas chokes on one specific kernel instantiation of lp_unexpanded_float_float_float_uint32, specifically the row-major one with Veclen==4, i.e.,

void raft::distance::detail::pairwise_matrix_kernel<raft::distance::detail::kernel_params_T<float, float, float, unsigned int, raft::linalg::KernelPolicy<float, (int)4, (int)32, (int)4, (int)4, (int)16, (int)16>, raft::distance::detail::ops::minkowski_distance_op<float>, raft::distance::detail::default_fin_op<float, float, unsigned int>, (bool)1>>(const T1::DataT *, const T1::DataT *, const T1::DataT *, const T1::DataT *, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::OutT *, T1::opT, T1::FinOpT)

There are six kernel instantiations. They take roughly 10-15s each to compile. The pathological one above takes 41 minutes to compile for SM70. This is caused by one of the compilation passes in -O2 (the default, which RAFT also uses, is -O3).

$ time ptxas -O0 --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_4.ptx"  -o "out.sm_70.cubin" 
real    0m10.734s
$ time ptxas -O1 --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_4.ptx"  -o "out.sm_70.cubin" 
real    0m14.211s
$ time ptxas -O2 --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_4.ptx"  -o "out.sm_70.cubin"
real    41m51.240s

# Other unsigned kernels take between 2-17 seconds to assemble:
$ time ptxas --warning-as-error -arch=sm_70 -m64  "ptx_lp_uint_0.ptx"  -o "ptx_lp_uint_0.sm_70.cubin"    
real    0m2.692s                                                                                                                                                                                                   
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_1.ptx"  -o "ptx_lp_uint_1.sm_70.cubin"   
real    0m2.730s   
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_2.ptx"  -o "ptx_lp_uint_2.sm_70.cubin"    
real    0m2.667s     
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_3.ptx"  -o "ptx_lp_uint_3.sm_70.cubin"  
real    0m2.666s                                                                                                                                                                                                   
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_5.ptx"  -o "ptx_lp_uint_5.sm_70.cubin" 
real    0m16.846s

# For comparison, compiling 6 kernels with int instead of uint32_t takes 44 seconds in total:
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose  lp_unexpanded_float_float_float_int.ptx  -o "lp_unexpanded_float_float_float_int.sm_70.cubin" 
real    0m43.699s

Similar behavior does not occur with lp_unexpanded_float_float_float_int, even though the generated PTX file is roughly the same size (36MB) and the number of virtual PTX registers is equal at 120K. I tried diffing the files, but with 183K lines per kernel that is a bit of a hopeless task.

ahendriksen · 2023-02-07T14:18:04Z

On 2,3:

Any of the changed lines could have caused this. We are finding this issue to be more dramatic because the minkowski distance uses powf, which requires roughly 20 instructions (even in PTX). With loop unrolling, this generates very large assembly files, which can take some time to compile. PTX files for other pairwise distance kernels are up to 72x smaller.

On 4:
The change is partly necessary for masked_nn. I would forcefully recommend merging it:

it removes one particularly nasty footgun: different constructors have subtly different parameter order a3b8587
the structure of the code is mirrored in the masked_nn code, making maintenance easier.
the masked_nn code relies on the changes to the contractions_NT class.

As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when VecLen==4 and IdxT==uint32_t. This PR removes the veclen == 4 specialization for the minkowski distance.

vyasr · 2023-02-08T20:24:22Z

Thanks @vyasr for sharing your results. Wow, so you re saying just sm_75 alone is enough to pretty much cause the compiler to hang? That's a pretty useful find!

Yup I was compiling for just the local native architecture, an RTX 8000.

As explained in #1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`. This PR removes the veclen == 4 specialization for the minkowski distance. Follow up to: #1239 Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Sean Frye (https://github.com/sean-frye) URL: #1254

As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`. This PR removes the veclen == 4 specialization for the minkowski distance. Follow up to: rapidsai#1239 Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Sean Frye (https://github.com/sean-frye) URL: rapidsai#1254

ahendriksen · 2023-04-21T11:51:19Z

Fixed by PRs #1307 , #1335

cjnolet added the bug Something isn't working label Feb 5, 2023

cjnolet added this to VS/ML/DM Primitives Release Board Feb 6, 2023

ahendriksen mentioned this issue Feb 7, 2023

Fix compile time explosion for minkowski distance #1254

Merged

ahendriksen closed this as completed Apr 21, 2023

github-project-automation bot moved this to Done in VS/ML/DM Primitives Release Board Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

cjnolet commented Feb 5, 2023 •

edited

Loading

vyasr commented Feb 7, 2023

cjnolet commented Feb 7, 2023 •

edited

Loading

ahendriksen commented Feb 7, 2023

ahendriksen commented Feb 7, 2023

vyasr commented Feb 8, 2023

ahendriksen commented Apr 21, 2023

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

Comments

cjnolet commented Feb 5, 2023 • edited Loading

vyasr commented Feb 7, 2023

cjnolet commented Feb 7, 2023 • edited Loading

ahendriksen commented Feb 7, 2023

ahendriksen commented Feb 7, 2023

vyasr commented Feb 8, 2023

ahendriksen commented Apr 21, 2023

cjnolet commented Feb 5, 2023 •

edited

Loading

cjnolet commented Feb 7, 2023 •

edited

Loading