Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

Closed
cjnolet opened this issue Feb 5, 2023 · 6 comments
Closed

[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246

cjnolet opened this issue Feb 5, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@cjnolet
Copy link
Member

cjnolet commented Feb 5, 2023

While working up to the 23.02 release, we started seeing compile times for a single source file increasing to over an hour (per cuda architecture) and @ahendriksen managed to isolate the issue to a cuda loop unrolling that was added to pairwise distance accumulation computation.

So far we have observed that it doesn't seem to happen on cuda 11.6 or cuda 11.5 on sm_70 architecture but it does happen on 11.8 and 12.0 on sm_80 and above.

It's unclear to me whether this was related only to the cuda loop unrolling or if it was because the loop also contained a pow. We did notice that it was only happening to a specific distance and indexing type (Minkowski with unit32_t).

I'm just creating this GitHub issue as a placeholder to promote this bug to the proper channels when the release is over.

@benfred @vyasr @ahendriksen @sevagh @ajschmidt8 @dantegd please feel free to add or correct any details here.

@cjnolet cjnolet added the bug Something isn't working label Feb 5, 2023
@vyasr
Copy link
Contributor

vyasr commented Feb 7, 2023

To clarify, this is all happening in device code so it's not OpenMP unrolling, it's CUDA unrolling, right?

In my case I was compiling for sm_75 (also CUDA 11.8).

To consolidate the discussion here, I'll note that in my investigations I found that there was one specific unroll in the reset_accumulator implementation that had the largest impact on compile times for me locally. That unroll was introduced in #837. I was unable to get #837 to compile on my machine (CUDA 11.8, sm_75) even overnight, but with that single unroll disabled raft compiled in under 20 minutes (I do have a pretty beefy workstation). I am uncertain if any of these findings conflict in any way with what @ahendriksen reported in #1230, but I can provide extra testing/details if needed to help narrow things down further.

@cjnolet
Copy link
Member Author

cjnolet commented Feb 7, 2023

Thanks @vyasr for sharing your results. Wow, so you re saying just sm_75 alone is enough to pretty much cause the compiler to hang? That's a pretty useful find!

To clarify, this is all happening in device code so it's not OpenMP unrolling, it's CUDA unrolling, right?

I indeed meant cuda and not openmp. I was thinking cuda while typing openmp. Thanks for clarifying. I adjusted the above description accordingly.

To consolidate the discussion here, I'll note that in my investigations I found that there was one specific unroll in the reset_accumulator implementation that had the largest impact on compile times for me locally.

I have 4 main questions on the topic:

  1. What's really going on with the changes in Move contractions tiling logic outside of Contractions_NT #837 that's causing this?
  • is this because of a bug in the loop unrolling itself (which seems somewhat unlikely to me),
  • because the unrolling causing something like excess register usage, crazy occupancy, or barely satisfiable geometry, which is triggering a bug in the compile-time optimizer, or
  • additional calls to an expensive or buggy intrinsic to be compiled too many times.
  1. Why are we finding this issue to be much more dramatic for the lp_unexpanded distances (specifically w/ uint32_t indexing)? Perhaps my second question is quite closely related to the first- maybe the additional calls to the intrinsics, in combination w/ the unrolled zero-initialization loops are causing something like excess register usage.

  2. What changed between the prolog and the reset_accumulator functions that would have caused this?

  3. Is the change fully necessary to implement the masked 1-nn? If so, can it be designed differently so as to have a different effect on the compile times? If this is a genuine bug, we should probably consider the possibility of an alternate approach here so we can still get the masked 1-nn into 23.04 in the meantime.

Perhaps profiling might help answer some or all of these questions. I'd like to get these changes merged into 23.04 somewhat soonish. @ahendriksen, do you have any thoughts on these questions?

@ahendriksen
Copy link
Contributor

On 1:
The issue is that ptxas chokes on one specific kernel instantiation of lp_unexpanded_float_float_float_uint32, specifically the row-major one with Veclen==4, i.e.,

void raft::distance::detail::pairwise_matrix_kernel<raft::distance::detail::kernel_params_T<float, float, float, unsigned int, raft::linalg::KernelPolicy<float, (int)4, (int)32, (int)4, (int)4, (int)16, (int)16>, raft::distance::detail::ops::minkowski_distance_op<float>, raft::distance::detail::default_fin_op<float, float, unsigned int>, (bool)1>>(const T1::DataT *, const T1::DataT *, const T1::DataT *, const T1::DataT *, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::OutT *, T1::opT, T1::FinOpT)

There are six kernel instantiations. They take roughly 10-15s each to compile. The pathological one above takes 41 minutes to compile for SM70. This is caused by one of the compilation passes in -O2 (the default, which RAFT also uses, is -O3).

$ time ptxas -O0 --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_4.ptx"  -o "out.sm_70.cubin" 
real    0m10.734s
$ time ptxas -O1 --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_4.ptx"  -o "out.sm_70.cubin" 
real    0m14.211s
$ time ptxas -O2 --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_4.ptx"  -o "out.sm_70.cubin"
real    41m51.240s

# Other unsigned kernels take between 2-17 seconds to assemble:
$ time ptxas --warning-as-error -arch=sm_70 -m64  "ptx_lp_uint_0.ptx"  -o "ptx_lp_uint_0.sm_70.cubin"    
real    0m2.692s                                                                                                                                                                                                   
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_1.ptx"  -o "ptx_lp_uint_1.sm_70.cubin"   
real    0m2.730s   
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_2.ptx"  -o "ptx_lp_uint_2.sm_70.cubin"    
real    0m2.667s     
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_3.ptx"  -o "ptx_lp_uint_3.sm_70.cubin"  
real    0m2.666s                                                                                                                                                                                                   
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose   "ptx_lp_uint_5.ptx"  -o "ptx_lp_uint_5.sm_70.cubin" 
real    0m16.846s

# For comparison, compiling 6 kernels with int instead of uint32_t takes 44 seconds in total:
$ time ptxas --warning-as-error -arch=sm_70 -m64 --verbose  lp_unexpanded_float_float_float_int.ptx  -o "lp_unexpanded_float_float_float_int.sm_70.cubin" 
real    0m43.699s

Similar behavior does not occur with lp_unexpanded_float_float_float_int, even though the generated PTX file is roughly the same size (36MB) and the number of virtual PTX registers is equal at 120K. I tried diffing the files, but with 183K lines per kernel that is a bit of a hopeless task.

@ahendriksen
Copy link
Contributor

On 2,3:

Any of the changed lines could have caused this. We are finding this issue to be more dramatic because the minkowski distance uses powf, which requires roughly 20 instructions (even in PTX). With loop unrolling, this generates very large assembly files, which can take some time to compile. PTX files for other pairwise distance kernels are up to 72x smaller.

On 4:
The change is partly necessary for masked_nn. I would forcefully recommend merging it:

  • it removes one particularly nasty footgun: different constructors have subtly different parameter order a3b8587
  • the structure of the code is mirrored in the masked_nn code, making maintenance easier.
  • the masked_nn code relies on the changes to the contractions_NT class.

ahendriksen added a commit to ahendriksen/raft that referenced this issue Feb 7, 2023
As explained in
rapidsai#1246 (comment),
ptxas chokes on the minkowski distance when VecLen==4 and
IdxT==uint32_t.

This PR removes the veclen == 4 specialization for the minkowski
distance.
@vyasr
Copy link
Contributor

vyasr commented Feb 8, 2023

Thanks @vyasr for sharing your results. Wow, so you re saying just sm_75 alone is enough to pretty much cause the compiler to hang? That's a pretty useful find!

Yup I was compiling for just the local native architecture, an RTX 8000.

rapids-bot bot pushed a commit that referenced this issue Feb 9, 2023
As explained in #1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`.

This PR removes the veclen == 4 specialization for the minkowski distance.

Follow up to: #1239

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Sean Frye (https://github.com/sean-frye)

URL: #1254
achirkin pushed a commit to achirkin/raft that referenced this issue Feb 9, 2023
As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`.

This PR removes the veclen == 4 specialization for the minkowski distance.

Follow up to: rapidsai#1239

Authors:
  - Allard Hendriksen (https://github.com/ahendriksen)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Sean Frye (https://github.com/sean-frye)

URL: rapidsai#1254
@ahendriksen
Copy link
Contributor

Fixed by PRs #1307 , #1335

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

3 participants