Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Conda packages and CI workflows #33

Merged
merged 38 commits into from
Jan 12, 2024
Merged

Add Conda packages and CI workflows #33

merged 38 commits into from
Jan 12, 2024

Conversation

brandon-b-miller
Copy link
Contributor

This PR adds build scripts for creating and testing conda packages and the associated CI workflows.

rapids-print-env

rapids-logger "Install CUDA Toolkit"
source "$(dirname "$0")/install_latest_cuda_toolkit.sh"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This process of installing system-wide cudatoolkit feels like an antipattern to me. It goes against what conda is good at, and I'm concerned that it might make things more brittle. As such, the work that we are needing to do now to address confusion with finding libraries feels like wasted effort.

The fundamental issue is that appropriate cudatoolkit libraries are not available on Conda, right? Is there a different approach we could take here, where we work on addressing that fundamental issue first and then use more standard conda patterns in this library?

Have I misunderstood what is going on here? Are there other reasons to be using a system install of the CTK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to build with CUDA 12.2, those libraries are available

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue is that the package needs to ship the latest version of libnvjitlink_static for it to be able to handle the PTX from the cuDF build. So far we've used the same pattern for the pip packages.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty confused here, so please correct me where I have no doubt gotten something wrong.

  • We need pynvjitlink in order to finish our CUDA 12.2 release
  • We can't use our existing parts of CUDA 12.2 to build pynvjitlink, because we need libnvjitlink from CUDA 12.3

Is that right? If so, I suppose you have no choice but the system packages, but it also seems like a really awkward position to be in. How was pynvjitlink both mandatory to 12.2 and not functional until 12.3?

Copy link
Contributor

@bdice bdice Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions. I can try to illuminate the goals, design, and requirements, which might help.

nvjitlink is a library added in CUDA 12 that provides "a set of APIs which can be used at runtime to link together GPU device code". pynvjitlink provides Python bindings to nvjitlink. cuDF ships a bundle of PTX code generated at build time that is used by Numba and linked together with PTX generated from user code (via Numba and NVVM). This process needs pynvjitlink in order to provide backward compatibility, if the PTX bundled with cuDF at build time is too new to run on the user's driver.

For example, if we build cuDF with CUDA 12.2, it will generate and bundle some CUDA 12.2 PTX code with the library. Numba wants to link this with user-provided code but the result will not run on driver 525 (which shipped alongside CUDA 12.0) because it doesn't know how to use the newer PTX code when linking it all together. In order for RAPIDS to migrate to building with CUDA 12.x (x > 0), we must depend on pynvjitlink in cuDF.

The goal is to release a new pynvjitlink (which statically links libnvjitlink) for each new CUDA Toolkit minor version. (Other reasons to make a new version would include other matrix changes like Python version updates or updates to the pynvjitlink bindings themselves.) This release process is not fundamentally synced to the RAPIDS release cycle, and will likely involve intermittent but off-cycle releases that are not aligned to RAPIDS 24.02, 24.04, etc.

We want (need?) to build with the newest possible version of nvjitlink for compatibility reasons. RAPIDS wants to move to building with CUDA 12.2, so it's reasonable to ask why we can't just use nvjitlink from CUDA 12.2 (which has conda-forge packages). There are some additional bugs in nvjitlink that were only recently resolved in 12.3 (see other thread: #33 (comment)), so this isn't viable. We also want pynvjitlink to be usable by Numba consumers outside of RAPIDS, so supporting PTX from CUDA "12.latest" (and not just limiting to the CUDA build version used by RAPIDS) is important.

Generally, conda packages of CUDA have lagged other channel sources (like RPM packages) in availability. Because we need to build pynvjitlink against CUDA "12.latest" for pip users (who could have any CUDA 12 driver/toolkit versions), we chose to pull nvjitlink from RPM packages in wheel builds. It is possible for conda packages of pynvjitlink to take a different strategy, but my initial hope was that we could just do the same thing and pull the latest CUDA RPM packages for the conda build. Clearly there are hiccups with CMake and conda support around this since it didn't work out of the box. I am certainly open to other strategies here but it will be difficult to maintain pynvjitlink if we are pulling nvjitlink from two different sources with two different versions, because we want to make new pynvjitlink releases for both pip and conda simultaneously.

We require the nvjitlink library from CUDA 12.3+ to proceed. CUDA 12.3 is not on conda-forge yet but that still wouldn't necessarily solve our problem if conda-forge continues to lag mainstream availability (we need to release pip and conda support together and want "12.latest" support for pip). Perhaps we could pull nvjitlink from nvidia channel packages or fix the issues with using RPM packages (which I trust to be released on "day one" of a new CUDA minor version).

Copy link
Contributor

@bdice bdice Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was pynvjitlink both mandatory to 12.2 and not functional until 12.3?

Zooming in on this question because it's a tl;dr for some pieces of my longer comment above.

  • pynvjitlink is mandatory for RAPIDS to build with CUDA 12.2 because RAPIDS requires nvjitlink features (called through pynvjitlink) to support older CUDA 12 drivers with PTX built by newer CUDA 12 minor versions (like 12.2).
  • nvjitlink isn't usable (at least in the ways that RAPIDS wants) until CUDA 12.3 because of bugs in the library. These bugs were found during the development of pynvjitlink and not fixed until later CUDA Toolkit releases (specifically 12.3).

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While discussing offline, @bdice and I realized the following case can occur. Suppose a user has:

  • 12.0 Driver (R525)
  • 12.2 used to build cuDF
  • 12.2u2 nvjitlink used to build pynvjitlink
  • 12.3 CTK on the user machine
    In this case even with the presence of pynvjitlink we will still encounter a failure since there's a piece of the PTX that the 12.3 CTK generates which will be too new for libnvjitlink from 12.2u2. While conda provides some tools to constrain the target environment such that this would never happen, the same is not the case for pip packages. This provides some motivation to pursue the RPM package route.

pyproject.toml Outdated
@@ -1,6 +1,8 @@
[tool.scikit-build]
cmake.minimum-version = "3.26.4"
cmake.verbose = true
cmake.args = ["-DCUDAToolkit_ROOT=/usr/local/cuda-12.3", "--debug-find"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it looks like we are building with CUDA 12.3 here. Do we need CUDA 12.3? Or could we build with CUDA 12.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 12.3 version of libnvjitlink has a couple of bugfixes we could use. cc @gmarkall for details.

Copy link
Contributor

@bdice bdice Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvbug 4375843 is one issue that I know of that was fixed in 12.3 but there may be others.

Copy link
Contributor

@bdice bdice Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandon-b-miller @gmarkall Is that the same issue tracked in #3 / #4?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think that is a different bug. My understanding is the memory corruption bug was fixed in 12.2u1 whereas conda-forge packages are on 12.2u2

Copy link
Contributor

@bdice bdice Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakirkham, you are correct. The bug ids for the memory corruption issue are 4080381 and 4121608. 12.2u1 should have a fix.

Copy link
Contributor

@bdice bdice Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to cases like this we actually do need to build with 12.3 (12.latest in general) for compatibility reasons. The bug fixes may be irrelevant since we have to use 12.3 (or 12.latest) to provide a proper compatibility story with pynvjitlink.

@bdice bdice requested review from msarahan and bdice and removed request for bdice January 11, 2024 23:09
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're already reading from pyproject.toml we may as well read as much metadata as we can from there.

conda/recipes/pynvjitlink/meta.yaml Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved
@jakirkham
Copy link
Member

/merge

@vyasr vyasr merged commit c2353cb into main Jan 12, 2024
32 checks passed
@jakirkham
Copy link
Member

Thanks all! 🙏

Let's follow up on anything else separately

- compute-matrix
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be build_type: ${{ inputs.build_type || 'branch' }}.

also, it will need the following:

branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the build-conda portion, which IIUC is just for testing the packages do build

build-conda:
needs:
- compute-matrix
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: pull-request
build_script: "ci/build_conda.sh"
matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }}

AFAICT this matches what the build-wheels section has

build-wheels:
needs:
- compute-matrix
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: branch
script: "ci/build_wheel.sh"
matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }}


The publish-conda section has these changes already

publish-conda:
needs:
- build-conda
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: ${{ inputs.build_type || 'branch' }}
branch: ${{ inputs.branch }}
date: ${{ inputs.date }}
sha: ${{ inputs.sha }}

It appears similar to publish-wheels, which also has them

publish-wheels:
needs:
- build-wheels
secrets: inherit
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
build_type: ${{ inputs.build_type || 'branch' }}
branch: ${{ inputs.branch }}
sha: ${{ inputs.sha }}
date: ${{ inputs.date }}
package-name: pynvjitlink


Based on this, do we have what we need already?

Or do both Conda & Wheels sections need some updates?

Copy link
Member

@jakirkham jakirkham Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, it is odd that build-conda has build_type: pull-request instead of build_type: branch like in build-wheels. Think we want

Suggested change
build_type: pull-request
build_type: branch

Fixing with PR ( #43 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the build_type change noted above was all that was needed. Though please let us know if we are missing something

jakirkham added a commit that referenced this pull request Jan 13, 2024
* Add Conda packages and CI workflows (#33)
* Set pipefail in wheel build scripts (#34)
* [REVIEW] remove references to setup.py in docs (#36)
* Update README.md (#37)
* Remove usages of rapids-env-update (#41)
@jakirkham
Copy link
Member

jakirkham commented Feb 8, 2024

Now that conda-forge has CUDA 12.3 ( conda-forge/cuda-feedstock#15 ), have submitted PR ( #50 ) to update to CUDA 12.3. Hopefully that closes out a few threads above

bdice pushed a commit that referenced this pull request Jun 13, 2024
Contributes to rapidsai/build-planning#72.

Proposes replacing a use of `.get()` in the conda recipe with `[]`, to
ensure that we get a loud build error if the `"project"` table is not
present when `conda-build` reads `pyproject.toml`.

## Notes for Reviewers

I know that at some time in the past, the `.get()` was introduced as a
possible solution for errors observed at build time:
#33 (comment)

Pushed this to see if those issues are still present, and they don't
seem to be!
@jakirkham jakirkham deleted the add-conda-packages branch June 13, 2024 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants