Add Conda packages and CI workflows #33

brandon-b-miller · 2024-01-05T15:39:24Z

This PR adds build scripts for creating and testing conda packages and the associated CI workflows.

…into add-conda-packages

msarahan · 2024-01-08T22:59:25Z

ci/build_conda.sh

+rapids-print-env
+
+rapids-logger "Install CUDA Toolkit"
+source "$(dirname "$0")/install_latest_cuda_toolkit.sh"


This process of installing system-wide cudatoolkit feels like an antipattern to me. It goes against what conda is good at, and I'm concerned that it might make things more brittle. As such, the work that we are needing to do now to address confusion with finding libraries feels like wasted effort.

The fundamental issue is that appropriate cudatoolkit libraries are not available on Conda, right? Is there a different approach we could take here, where we work on addressing that fundamental issue first and then use more standard conda patterns in this library?

Have I misunderstood what is going on here? Are there other reasons to be using a system install of the CTK?

If we want to build with CUDA 12.2, those libraries are available

The main issue is that the package needs to ship the latest version of libnvjitlink_static for it to be able to handle the PTX from the cuDF build. So far we've used the same pattern for the pip packages.

I'm pretty confused here, so please correct me where I have no doubt gotten something wrong.

We need pynvjitlink in order to finish our CUDA 12.2 release

We can't use our existing parts of CUDA 12.2 to build pynvjitlink, because we need libnvjitlink from CUDA 12.3

Is that right? If so, I suppose you have no choice but the system packages, but it also seems like a really awkward position to be in. How was pynvjitlink both mandatory to 12.2 and not functional until 12.3?

Great questions. I can try to illuminate the goals, design, and requirements, which might help.

nvjitlink is a library added in CUDA 12 that provides "a set of APIs which can be used at runtime to link together GPU device code". pynvjitlink provides Python bindings to nvjitlink. cuDF ships a bundle of PTX code generated at build time that is used by Numba and linked together with PTX generated from user code (via Numba and NVVM). This process needs pynvjitlink in order to provide backward compatibility, if the PTX bundled with cuDF at build time is too new to run on the user's driver.

For example, if we build cuDF with CUDA 12.2, it will generate and bundle some CUDA 12.2 PTX code with the library. Numba wants to link this with user-provided code but the result will not run on driver 525 (which shipped alongside CUDA 12.0) because it doesn't know how to use the newer PTX code when linking it all together. In order for RAPIDS to migrate to building with CUDA 12.x (x > 0), we must depend on pynvjitlink in cuDF.

The goal is to release a new pynvjitlink (which statically links libnvjitlink) for each new CUDA Toolkit minor version. (Other reasons to make a new version would include other matrix changes like Python version updates or updates to the pynvjitlink bindings themselves.) This release process is not fundamentally synced to the RAPIDS release cycle, and will likely involve intermittent but off-cycle releases that are not aligned to RAPIDS 24.02, 24.04, etc.

We want (need?) to build with the newest possible version of nvjitlink for compatibility reasons. RAPIDS wants to move to building with CUDA 12.2, so it's reasonable to ask why we can't just use nvjitlink from CUDA 12.2 (which has conda-forge packages). There are some additional bugs in nvjitlink that were only recently resolved in 12.3 (see other thread: #33 (comment)), so this isn't viable. We also want pynvjitlink to be usable by Numba consumers outside of RAPIDS, so supporting PTX from CUDA "12.latest" (and not just limiting to the CUDA build version used by RAPIDS) is important.

Generally, conda packages of CUDA have lagged other channel sources (like RPM packages) in availability. Because we need to build pynvjitlink against CUDA "12.latest" for pip users (who could have any CUDA 12 driver/toolkit versions), we chose to pull nvjitlink from RPM packages in wheel builds. It is possible for conda packages of pynvjitlink to take a different strategy, but my initial hope was that we could just do the same thing and pull the latest CUDA RPM packages for the conda build. Clearly there are hiccups with CMake and conda support around this since it didn't work out of the box. I am certainly open to other strategies here but it will be difficult to maintain pynvjitlink if we are pulling nvjitlink from two different sources with two different versions, because we want to make new pynvjitlink releases for both pip and conda simultaneously.

We require the nvjitlink library from CUDA 12.3+ to proceed. CUDA 12.3 is not on conda-forge yet but that still wouldn't necessarily solve our problem if conda-forge continues to lag mainstream availability (we need to release pip and conda support together and want "12.latest" support for pip). Perhaps we could pull nvjitlink from nvidia channel packages or fix the issues with using RPM packages (which I trust to be released on "day one" of a new CUDA minor version).

How was pynvjitlink both mandatory to 12.2 and not functional until 12.3?

Zooming in on this question because it's a tl;dr for some pieces of my longer comment above.

pynvjitlink is mandatory for RAPIDS to build with CUDA 12.2 because RAPIDS requires nvjitlink features (called through pynvjitlink) to support older CUDA 12 drivers with PTX built by newer CUDA 12 minor versions (like 12.2).

nvjitlink isn't usable (at least in the ways that RAPIDS wants) until CUDA 12.3 because of bugs in the library. These bugs were found during the development of pynvjitlink and not fixed until later CUDA Toolkit releases (specifically 12.3).

While discussing offline, @bdice and I realized the following case can occur. Suppose a user has:

12.0 Driver (R525)

12.2 used to build cuDF

12.2u2 nvjitlink used to build pynvjitlink

12.3 CTK on the user machine
In this case even with the presence of pynvjitlink we will still encounter a failure since there's a piece of the PTX that the 12.3 CTK generates which will be too new for libnvjitlink from 12.2u2. While conda provides some tools to constrain the target environment such that this would never happen, the same is not the case for pip packages. This provides some motivation to pursue the RPM package route.

jakirkham · 2024-01-08T23:05:00Z

pyproject.toml

@@ -1,6 +1,8 @@
 [tool.scikit-build]
 cmake.minimum-version = "3.26.4"
 cmake.verbose = true
+cmake.args = ["-DCUDAToolkit_ROOT=/usr/local/cuda-12.3", "--debug-find"]


Though it looks like we are building with CUDA 12.3 here. Do we need CUDA 12.3? Or could we build with CUDA 12.2?

The 12.3 version of libnvjitlink has a couple of bugfixes we could use. cc @gmarkall for details.

nvbug 4375843 is one issue that I know of that was fixed in 12.3 but there may be others.

@brandon-b-miller @gmarkall Is that the same issue tracked in #3 / #4?

Think that is a different bug. My understanding is the memory corruption bug was fixed in 12.2u1 whereas conda-forge packages are on 12.2u2

@jakirkham, you are correct. The bug ids for the memory corruption issue are 4080381 and 4121608. 12.2u1 should have a fix.

Due to cases like this we actually do need to build with 12.3 (12.latest in general) for compatibility reasons. The bug fixes may be irrelevant since we have to use 12.3 (or 12.latest) to provide a proper compatibility story with pynvjitlink.

conda/recipes/pynvjitlink/meta.yaml

conda/recipes/pynvjitlink/conda_build_config.yaml

…lid.

vyasr

Since we're already reading from pyproject.toml we may as well read as much metadata as we can from there.

conda/recipes/pynvjitlink/meta.yaml

pyproject.toml

conda/recipes/pynvjitlink/meta.yaml

Co-authored-by: jakirkham <[email protected]>

jakirkham · 2024-01-12T00:14:28Z

/merge

jakirkham · 2024-01-12T01:06:44Z

Thanks all! 🙏

Let's follow up on anything else separately

ajschmidt8 · 2024-01-12T17:40:35Z

.github/workflows/build.yaml

+      - compute-matrix
+    uses: rapidsai/shared-workflows/.github/workflows/[email protected]
+    with:
+      build_type: pull-request


this should be build_type: ${{ inputs.build_type || 'branch' }}.

also, it will need the following:

branch: ${{ inputs.branch }} date: ${{ inputs.date }} sha: ${{ inputs.sha }}

This is the build-conda portion, which IIUC is just for testing the packages do build

pynvjitlink/.github/workflows/build.yaml

Lines 31 to 38 in a1479d3

build-conda:

needs:

- compute-matrix

uses: rapidsai/shared-workflows/.github/workflows/[email protected]

with:

build_type: pull-request

build_script: "ci/build_conda.sh"

matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }}

AFAICT this matches what the build-wheels section has

pynvjitlink/.github/workflows/build.yaml

Lines 23 to 30 in a1479d3

build-wheels:

needs:

- compute-matrix

uses: rapidsai/shared-workflows/.github/workflows/[email protected]

with:

build_type: branch

script: "ci/build_wheel.sh"

matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }}

The publish-conda section has these changes already

pynvjitlink/.github/workflows/build.yaml

Lines 50 to 59 in a1479d3

publish-conda:

needs:

- build-conda

secrets: inherit

uses: rapidsai/shared-workflows/.github/workflows/[email protected]

with:

build_type: ${{ inputs.build_type || 'branch' }}

branch: ${{ inputs.branch }}

date: ${{ inputs.date }}

sha: ${{ inputs.sha }}

It appears similar to publish-wheels, which also has them

pynvjitlink/.github/workflows/build.yaml

Lines 39 to 49 in a1479d3

publish-wheels:

needs:

- build-wheels

secrets: inherit

uses: rapidsai/shared-workflows/.github/workflows/[email protected]

with:

build_type: ${{ inputs.build_type || 'branch' }}

branch: ${{ inputs.branch }}

sha: ${{ inputs.sha }}

date: ${{ inputs.date }}

package-name: pynvjitlink

Based on this, do we have what we need already?

Or do both Conda & Wheels sections need some updates?

That said, it is odd that build-conda has build_type: pull-request instead of build_type: branch like in build-wheels. Think we want

Suggested change

build_type: pull-request

build_type: branch

Fixing with PR ( #43 )

It looks like the build_type change noted above was all that was needed. Though please let us know if we are missing something

* Add Conda packages and CI workflows (#33) * Set pipefail in wheel build scripts (#34) * [REVIEW] remove references to setup.py in docs (#36) * Update README.md (#37) * Remove usages of rapids-env-update (#41)

jakirkham · 2024-02-08T02:35:13Z

Now that conda-forge has CUDA 12.3 ( conda-forge/cuda-feedstock#15 ), have submitted PR ( #50 ) to update to CUDA 12.3. Hopefully that closes out a few threads above

Contributes to rapidsai/build-planning#72. Proposes replacing a use of `.get()` in the conda recipe with `[]`, to ensure that we get a loud build error if the `"project"` table is not present when `conda-build` reads `pyproject.toml`. ## Notes for Reviewers I know that at some time in the past, the `.get()` was introduced as a possible solution for errors observed at build time: #33 (comment) Pushed this to see if those issues are still present, and they don't seem to be!

add empty files

35c03c9

brandon-b-miller assigned bdice and brandon-b-miller Jan 5, 2024

brandon-b-miller and others added 19 commits January 5, 2024 07:43

add basic CI job for build-conda

13db64f

fix uses: section in conda-build

99f9a92

fix uses: again

c6aae2d

Add executable permission.

1b51ade

Add content to build_conda.sh.

a4a19a8

Merge branch 'add-conda-packages' of github.com:rapidsai/pynvjitlink …

fa627c9

…into add-conda-packages

script -> build_script

1e3bb5b

add a test-conda section

b4cb091

add build-conda to release workflow

51d0c56

add publish-conda to release workflows

325ba98

Add recipe.

fd3a8ce

Merge branch 'add-conda-packages' of github.com:rapidsai/pynvjitlink …

a4c69e3

…into add-conda-packages

Add ninja.

dc0d9ff

Move CUDA compiler to test requirements.

4ce6fc1

Fix style.

3d092a8

Use gcc 11.

ab2922c

Debug find.

fd194a9

Debug CMake.

5a54cdc

Add script that correctly handles driver installation on Rocky Linux 8.

e7a4ccf

bdice mentioned this pull request Jan 5, 2024

Set pipefail in wheel build scripts #34

Merged

bdice added 4 commits January 5, 2024 14:26

Merge remote-tracking branch 'upstream/main' into add-conda-packages

8da5db6

Try setting -DCUDAToolkit_ROOT in conda recipe.

68e3e74

Debug.

916e808

Debug.

507244c

msarahan reviewed Jan 8, 2024

View reviewed changes

jakirkham reviewed Jan 8, 2024

View reviewed changes

conda/recipes/pynvjitlink/meta.yaml Outdated Show resolved Hide resolved

bdice requested review from msarahan and bdice and removed request for bdice January 11, 2024 23:09

vyasr mentioned this pull request Jan 11, 2024

Embed build-time nvjitlink version and make accessible #38

Closed

bdice added 2 commits January 11, 2024 17:15

Conditional on variable declaration.

1ad4439

Handle None.

cc57af8

jakirkham reviewed Jan 11, 2024

View reviewed changes

conda/recipes/pynvjitlink/conda_build_config.yaml Outdated Show resolved Hide resolved

Specify minor version in conda_build_config so that the pinning is va…

b5d04c3

…lid.

vyasr requested changes Jan 11, 2024

View reviewed changes

bdice and others added 3 commits January 11, 2024 15:54

Remove cuda-version pinning from build.

fd96908

Co-authored-by: jakirkham <[email protected]>

Consolidate more metadata between pyproject.toml and meta.yaml

579e8eb

Remove extra

399191d

vyasr approved these changes Jan 12, 2024

View reviewed changes

jakirkham approved these changes Jan 12, 2024

View reviewed changes

raydouglass approved these changes Jan 12, 2024

View reviewed changes

vyasr merged commit c2353cb into main Jan 12, 2024
32 checks passed

ajschmidt8 reviewed Jan 12, 2024

View reviewed changes

This was referenced Jan 12, 2024

Release 0.1.7 #42

Merged

Add support for CUDA 12.2 conda packages rapidsai/build-planning#8

Closed

jakirkham deleted the add-conda-packages branch June 13, 2024 15:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Conda packages and CI workflows #33

Add Conda packages and CI workflows #33

brandon-b-miller commented Jan 5, 2024

msarahan Jan 8, 2024

jakirkham Jan 8, 2024

brandon-b-miller Jan 8, 2024

msarahan Jan 8, 2024

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

brandon-b-miller Jan 9, 2024 •

edited by bdice

Loading

jakirkham Jan 8, 2024

brandon-b-miller Jan 8, 2024

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

jakirkham Jan 9, 2024

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

vyasr left a comment

jakirkham commented Jan 12, 2024

jakirkham commented Jan 12, 2024

ajschmidt8 Jan 12, 2024

jakirkham Jan 13, 2024

jakirkham Jan 13, 2024 •

edited

Loading

jakirkham Jan 16, 2024

jakirkham commented Feb 8, 2024 •

edited

Loading

	build-conda:
	needs:
	- compute-matrix
	uses: rapidsai/shared-workflows/.github/workflows/[email protected]
	with:
	build_type: pull-request
	build_script: "ci/build_conda.sh"
	matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }}

	build-wheels:
	needs:
	- compute-matrix
	uses: rapidsai/shared-workflows/.github/workflows/[email protected]
	with:
	build_type: branch
	script: "ci/build_wheel.sh"
	matrix_filter: ${{ needs.compute-matrix.outputs.BUILD_MATRIX }}

	publish-conda:
	needs:
	- build-conda
	secrets: inherit
	uses: rapidsai/shared-workflows/.github/workflows/[email protected]
	with:
	build_type: ${{ inputs.build_type \|\| 'branch' }}
	branch: ${{ inputs.branch }}
	date: ${{ inputs.date }}
	sha: ${{ inputs.sha }}

	publish-wheels:
	needs:
	- build-wheels
	secrets: inherit
	uses: rapidsai/shared-workflows/.github/workflows/[email protected]
	with:
	build_type: ${{ inputs.build_type \|\| 'branch' }}
	branch: ${{ inputs.branch }}
	sha: ${{ inputs.sha }}
	date: ${{ inputs.date }}
	package-name: pynvjitlink

Add Conda packages and CI workflows #33

Add Conda packages and CI workflows #33

Conversation

brandon-b-miller commented Jan 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

bdice Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

brandon-b-miller Jan 9, 2024 • edited by bdice Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

bdice Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

bdice Jan 9, 2024 • edited Loading

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

jakirkham commented Jan 12, 2024

jakirkham commented Jan 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham Jan 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakirkham commented Feb 8, 2024 • edited Loading

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

brandon-b-miller Jan 9, 2024 •

edited by bdice

Loading

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

bdice Jan 9, 2024 •

edited

Loading

jakirkham Jan 13, 2024 •

edited

Loading

jakirkham commented Feb 8, 2024 •

edited

Loading