-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Conda packages and CI workflows #33
Conversation
…into add-conda-packages
…into add-conda-packages
ci/build_conda.sh
Outdated
rapids-print-env | ||
|
||
rapids-logger "Install CUDA Toolkit" | ||
source "$(dirname "$0")/install_latest_cuda_toolkit.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This process of installing system-wide cudatoolkit feels like an antipattern to me. It goes against what conda is good at, and I'm concerned that it might make things more brittle. As such, the work that we are needing to do now to address confusion with finding libraries feels like wasted effort.
The fundamental issue is that appropriate cudatoolkit libraries are not available on Conda, right? Is there a different approach we could take here, where we work on addressing that fundamental issue first and then use more standard conda patterns in this library?
Have I misunderstood what is going on here? Are there other reasons to be using a system install of the CTK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to build with CUDA 12.2, those libraries are available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue is that the package needs to ship the latest version of libnvjitlink_static
for it to be able to handle the PTX from the cuDF build. So far we've used the same pattern for the pip packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty confused here, so please correct me where I have no doubt gotten something wrong.
- We need pynvjitlink in order to finish our CUDA 12.2 release
- We can't use our existing parts of CUDA 12.2 to build pynvjitlink, because we need libnvjitlink from CUDA 12.3
Is that right? If so, I suppose you have no choice but the system packages, but it also seems like a really awkward position to be in. How was pynvjitlink both mandatory to 12.2 and not functional until 12.3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great questions. I can try to illuminate the goals, design, and requirements, which might help.
nvjitlink is a library added in CUDA 12 that provides "a set of APIs which can be used at runtime to link together GPU device code". pynvjitlink provides Python bindings to nvjitlink. cuDF ships a bundle of PTX code generated at build time that is used by Numba and linked together with PTX generated from user code (via Numba and NVVM). This process needs pynvjitlink in order to provide backward compatibility, if the PTX bundled with cuDF at build time is too new to run on the user's driver.
For example, if we build cuDF with CUDA 12.2, it will generate and bundle some CUDA 12.2 PTX code with the library. Numba wants to link this with user-provided code but the result will not run on driver 525 (which shipped alongside CUDA 12.0) because it doesn't know how to use the newer PTX code when linking it all together. In order for RAPIDS to migrate to building with CUDA 12.x (x > 0), we must depend on pynvjitlink in cuDF.
The goal is to release a new pynvjitlink (which statically links libnvjitlink) for each new CUDA Toolkit minor version. (Other reasons to make a new version would include other matrix changes like Python version updates or updates to the pynvjitlink bindings themselves.) This release process is not fundamentally synced to the RAPIDS release cycle, and will likely involve intermittent but off-cycle releases that are not aligned to RAPIDS 24.02, 24.04, etc.
We want (need?) to build with the newest possible version of nvjitlink for compatibility reasons. RAPIDS wants to move to building with CUDA 12.2, so it's reasonable to ask why we can't just use nvjitlink from CUDA 12.2 (which has conda-forge packages). There are some additional bugs in nvjitlink that were only recently resolved in 12.3 (see other thread: #33 (comment)), so this isn't viable. We also want pynvjitlink
to be usable by Numba consumers outside of RAPIDS, so supporting PTX from CUDA "12.latest" (and not just limiting to the CUDA build version used by RAPIDS) is important.
Generally, conda packages of CUDA have lagged other channel sources (like RPM packages) in availability. Because we need to build pynvjitlink against CUDA "12.latest" for pip users (who could have any CUDA 12 driver/toolkit versions), we chose to pull nvjitlink from RPM packages in wheel builds. It is possible for conda packages of pynvjitlink to take a different strategy, but my initial hope was that we could just do the same thing and pull the latest CUDA RPM packages for the conda build. Clearly there are hiccups with CMake and conda support around this since it didn't work out of the box. I am certainly open to other strategies here but it will be difficult to maintain pynvjitlink if we are pulling nvjitlink from two different sources with two different versions, because we want to make new pynvjitlink releases for both pip and conda simultaneously.
We require the nvjitlink library from CUDA 12.3+ to proceed. CUDA 12.3 is not on conda-forge
yet but that still wouldn't necessarily solve our problem if conda-forge continues to lag mainstream availability (we need to release pip and conda support together and want "12.latest" support for pip). Perhaps we could pull nvjitlink from nvidia
channel packages or fix the issues with using RPM packages (which I trust to be released on "day one" of a new CUDA minor version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was pynvjitlink both mandatory to 12.2 and not functional until 12.3?
Zooming in on this question because it's a tl;dr for some pieces of my longer comment above.
- pynvjitlink is mandatory for RAPIDS to build with CUDA 12.2 because RAPIDS requires nvjitlink features (called through pynvjitlink) to support older CUDA 12 drivers with PTX built by newer CUDA 12 minor versions (like 12.2).
- nvjitlink isn't usable (at least in the ways that RAPIDS wants) until CUDA 12.3 because of bugs in the library. These bugs were found during the development of pynvjitlink and not fixed until later CUDA Toolkit releases (specifically 12.3).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While discussing offline, @bdice and I realized the following case can occur. Suppose a user has:
- 12.0 Driver (R525)
- 12.2 used to build cuDF
- 12.2u2 nvjitlink used to build pynvjitlink
- 12.3 CTK on the user machine
In this case even with the presence ofpynvjitlink
we will still encounter a failure since there's a piece of the PTX that the 12.3 CTK generates which will be too new forlibnvjitlink
from12.2u2
. While conda provides some tools to constrain the target environment such that this would never happen, the same is not the case for pip packages. This provides some motivation to pursue the RPM package route.
pyproject.toml
Outdated
@@ -1,6 +1,8 @@ | |||
[tool.scikit-build] | |||
cmake.minimum-version = "3.26.4" | |||
cmake.verbose = true | |||
cmake.args = ["-DCUDAToolkit_ROOT=/usr/local/cuda-12.3", "--debug-find"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it looks like we are building with CUDA 12.3 here. Do we need CUDA 12.3? Or could we build with CUDA 12.2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 12.3
version of libnvjitlink
has a couple of bugfixes we could use. cc @gmarkall for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvbug 4375843 is one issue that I know of that was fixed in 12.3 but there may be others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandon-b-miller @gmarkall Is that the same issue tracked in #3 / #4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think that is a different bug. My understanding is the memory corruption bug was fixed in 12.2u1 whereas conda-forge packages are on 12.2u2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakirkham, you are correct. The bug ids for the memory corruption issue are 4080381 and 4121608. 12.2u1 should have a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to cases like this we actually do need to build with 12.3 (12.latest in general) for compatibility reasons. The bug fixes may be irrelevant since we have to use 12.3 (or 12.latest) to provide a proper compatibility story with pynvjitlink.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're already reading from pyproject.toml we may as well read as much metadata as we can from there.
/merge |
Thanks all! 🙏 Let's follow up on anything else separately |
- compute-matrix | ||
uses: rapidsai/shared-workflows/.github/workflows/[email protected] | ||
with: | ||
build_type: pull-request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be build_type: ${{ inputs.build_type || 'branch' }}
.
also, it will need the following:
branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the build-conda
portion, which IIUC is just for testing the packages do build
pynvjitlink/.github/workflows/build.yaml
Lines 31 to 38 in a1479d3
build-conda: | |
needs: | |
- compute-matrix | |
uses: rapidsai/shared-workflows/.github/workflows/[email protected] | |
with: | |
build_type: pull-request | |
build_script: "ci/build_conda.sh" | |
matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }} |
AFAICT this matches what the build-wheels
section has
pynvjitlink/.github/workflows/build.yaml
Lines 23 to 30 in a1479d3
build-wheels: | |
needs: | |
- compute-matrix | |
uses: rapidsai/shared-workflows/.github/workflows/[email protected] | |
with: | |
build_type: branch | |
script: "ci/build_wheel.sh" | |
matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }} |
The publish-conda
section has these changes already
pynvjitlink/.github/workflows/build.yaml
Lines 50 to 59 in a1479d3
publish-conda: | |
needs: | |
- build-conda | |
secrets: inherit | |
uses: rapidsai/shared-workflows/.github/workflows/[email protected] | |
with: | |
build_type: ${{ inputs.build_type || 'branch' }} | |
branch: ${{ inputs.branch }} | |
date: ${{ inputs.date }} | |
sha: ${{ inputs.sha }} |
It appears similar to publish-wheels
, which also has them
pynvjitlink/.github/workflows/build.yaml
Lines 39 to 49 in a1479d3
publish-wheels: | |
needs: | |
- build-wheels | |
secrets: inherit | |
uses: rapidsai/shared-workflows/.github/workflows/[email protected] | |
with: | |
build_type: ${{ inputs.build_type || 'branch' }} | |
branch: ${{ inputs.branch }} | |
sha: ${{ inputs.sha }} | |
date: ${{ inputs.date }} | |
package-name: pynvjitlink |
Based on this, do we have what we need already?
Or do both Conda & Wheels sections need some updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, it is odd that build-conda
has build_type: pull-request
instead of build_type: branch
like in build-wheels
. Think we want
build_type: pull-request | |
build_type: branch |
Fixing with PR ( #43 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the build_type
change noted above was all that was needed. Though please let us know if we are missing something
Now that conda-forge has CUDA 12.3 ( conda-forge/cuda-feedstock#15 ), have submitted PR ( #50 ) to update to CUDA 12.3. Hopefully that closes out a few threads above |
Contributes to rapidsai/build-planning#72. Proposes replacing a use of `.get()` in the conda recipe with `[]`, to ensure that we get a loud build error if the `"project"` table is not present when `conda-build` reads `pyproject.toml`. ## Notes for Reviewers I know that at some time in the past, the `.get()` was introduced as a possible solution for errors observed at build time: #33 (comment) Pushed this to see if those issues are still present, and they don't seem to be!
This PR adds build scripts for creating and testing conda packages and the associated CI workflows.