-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Potential CUDA compile time issue with loop unrolling or pow #1246
Comments
To clarify, this is all happening in device code so it's not OpenMP unrolling, it's CUDA unrolling, right? In my case I was compiling for To consolidate the discussion here, I'll note that in my investigations I found that there was one specific unroll in the |
Thanks @vyasr for sharing your results. Wow, so you re saying just
I indeed meant cuda and not openmp. I was thinking cuda while typing openmp. Thanks for clarifying. I adjusted the above description accordingly.
I have 4 main questions on the topic:
Perhaps profiling might help answer some or all of these questions. I'd like to get these changes merged into 23.04 somewhat soonish. @ahendriksen, do you have any thoughts on these questions? |
On 1: void raft::distance::detail::pairwise_matrix_kernel<raft::distance::detail::kernel_params_T<float, float, float, unsigned int, raft::linalg::KernelPolicy<float, (int)4, (int)32, (int)4, (int)4, (int)16, (int)16>, raft::distance::detail::ops::minkowski_distance_op<float>, raft::distance::detail::default_fin_op<float, float, unsigned int>, (bool)1>>(const T1::DataT *, const T1::DataT *, const T1::DataT *, const T1::DataT *, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::IdxT, T1::OutT *, T1::opT, T1::FinOpT) There are six kernel instantiations. They take roughly 10-15s each to compile. The pathological one above takes 41 minutes to compile for SM70. This is caused by one of the compilation passes in
Similar behavior does not occur with |
On 2,3: Any of the changed lines could have caused this. We are finding this issue to be more dramatic because the minkowski distance uses On 4:
|
As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when VecLen==4 and IdxT==uint32_t. This PR removes the veclen == 4 specialization for the minkowski distance.
Yup I was compiling for just the local native architecture, an RTX 8000. |
As explained in #1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`. This PR removes the veclen == 4 specialization for the minkowski distance. Follow up to: #1239 Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Sean Frye (https://github.com/sean-frye) URL: #1254
As explained in rapidsai#1246 (comment), ptxas chokes on the minkowski distance when `VecLen==4` and `IdxT==uint32_t`. This PR removes the veclen == 4 specialization for the minkowski distance. Follow up to: rapidsai#1239 Authors: - Allard Hendriksen (https://github.com/ahendriksen) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Corey J. Nolet (https://github.com/cjnolet) - Sean Frye (https://github.com/sean-frye) URL: rapidsai#1254
While working up to the 23.02 release, we started seeing compile times for a single source file increasing to over an hour (per cuda architecture) and @ahendriksen managed to isolate the issue to a cuda loop unrolling that was added to pairwise distance accumulation computation.
So far we have observed that it doesn't seem to happen on cuda 11.6 or cuda 11.5 on sm_70 architecture but it does happen on 11.8 and 12.0 on sm_80 and above.
It's unclear to me whether this was related only to the cuda loop unrolling or if it was because the loop also contained a pow. We did notice that it was only happening to a specific distance and indexing type (Minkowski with unit32_t).
I'm just creating this GitHub issue as a placeholder to promote this bug to the proper channels when the release is over.
@benfred @vyasr @ahendriksen @sevagh @ajschmidt8 @dantegd please feel free to add or correct any details here.
The text was updated successfully, but these errors were encountered: