Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move libtorch bits to PREFIX & enable cuDSS, cuSPARSELt, CUPTI on win; fix duplicate .pyc versions #328

Merged
merged 41 commits into from
Jan 25, 2025

Conversation

mgorny
Copy link
Contributor

@mgorny mgorny commented Jan 19, 2025

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

Fixes #319
Fixes #320
Fixes #321
Fixes #327
Fixes #329

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Jan 19, 2025

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe/meta.yaml:

  • ℹ️ The recipe is not parsable by parser conda-souschef (grayskull). This parser is not currently used by conda-forge, but may be in the future. We are collecting information to see which recipes are compatible with grayskull.
  • ℹ️ The recipe is not parsable by parser conda-recipe-manager. The recipe can only be automatically migrated to the new v1 format if it is parseable by conda-recipe-manager.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12939294964. Examine the logs at this URL for more detail.

@h-vetinari
Copy link
Member

Would also be good to address #327 in this PR.

@hmaarrfk hmaarrfk marked this pull request as draft January 19, 2025 23:31
@hmaarrfk
Copy link
Contributor

(draft until all builds are enabled again)

@h-vetinari
Copy link
Member

So the extra CUDA deps on windows seem to work fine 🥳

@mgorny
Copy link
Contributor Author

mgorny commented Jan 20, 2025

Okay, I'll try adding some logic for that here later today.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 20, 2025

Ok, I've decided to instead start by adding error handling to .bat, to see which of the existing commands fail.

By the way, any chance you could give me the power to run CI myself?

@h-vetinari
Copy link
Member

By the way, any chance you could give me the power to run CI myself?

You already should have AFAIU, and I don't have the power to change it further. This should be a matter of Quansight/open-gpu-server#52 plus a sync of the setup to https://github.com/conda-forge/.cirun (which happened), so I'm really not sure where this goes wrong. Pinging @jaimergp & @aktech

One conceivable angle is that the lack of authorization for the windows agents prevents you from starting any job at all (though IIRC you already weren't able to start things before we merged windows support). For that, you could make a PR like conda-forge/.cirun#27

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to add

if %ERRORLEVEL% neq 0 exit 1

to calls that can fail, rather than adding || goto error everywhere. There's no functional difference AFAIK, and although both styles have their pros & cons, the goto-variant is quite rare in conda-forge (a quick search shows 170 feedstocks using goto, while ~1.6 feedstocks use an errorlevel variant).

recipe/bld.bat Outdated Show resolved Hide resolved
@aktech
Copy link

aktech commented Jan 20, 2025

@mgorny You need to add your name here: https://github.com/conda-forge/.cirun/blob/218150c90899c7dfeb771d0bd9b176ca191848f3/.access.yml#L89 like this PR: conda-forge/.cirun#27 and this is subject to @wolfv's approval I believe as they (prefix.dev) are the ones sponsoring the windows machine.

and you need to be in any of these roles for this repo to be able to trigger workflows that involves GPU CI runs. cc @jaimergp

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need existence tests for the DLLs to be under %LIBRARY_BIN%.

The tests are already there, and just need to be modified, e.g.

- if not exist %SP_DIR%\torch\lib\{{ each_lib }}.dll exit 1 # [win]

(similar below)

@h-vetinari
Copy link
Member

Thanks for the quick response @aktech!

and you need to be in any of these roles for this repo to be able to trigger workflows that involves GPU CI runs.

Ah, perhaps it's just a question of making @mgorny a maintainer here? That's a hypothesis we can test easily (given your involvement here, I'm assuming you have nothing against this @mgorny) 😉

@h-vetinari
Copy link
Member

By the way, any chance you could give me the power to run CI myself?

You already should have AFAIU, and I don't have the power to change it further.

Well, perhaps I did have the power after all: 4ff92b3 ;-)

We'll find out on your next push here!

@mgorny
Copy link
Contributor Author

mgorny commented Jan 20, 2025

Ok, I'll replace || goto :error with if errorlevel checks, then push and take a break.

@h-vetinari
Copy link
Member

h-vetinari commented Jan 20, 2025

We'll find out on your next push here!

Sorry, I was talking about the linux builds. For the windows agents, we'll definitely still need to wait for conda-forge/.cirun#32

@h-vetinari
Copy link
Member

Ok, I'll replace || goto :error with if errorlevel checks

You're probably aware, but just for completeness (and because batch is an insane language): if errorlevel is a very different construct from if %errorlevel%. In particular, the more "advanced" relational operators like NEQ are not supported; more details

@mgorny
Copy link
Contributor Author

mgorny commented Jan 20, 2025

Ah, okay, sorry. Either case, I really need to take a break now, so maybe it'll be done when I get back.

@h-vetinari
Copy link
Member

Ah, okay, sorry. Either case, I really need to take a break now, so maybe it'll be done when I get back.

No worries. Perhaps until then the PR is merged. ;-)

When you do get back, please squash f9fe7bf & ad726bc to avoid the back-and-forth.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 23, 2025

Don't you want to enable the remaining Windows builds? I wanted to add removal of TH_BINARY_BUILD too. If that's fine, I'll push it along with rerender to restore all build targets.

@h-vetinari
Copy link
Member

Don't you want to enable the remaining Windows builds?

Of course, but I wanted to solve all the current issues first, and also ask @hmaarrfk and @isuruf for review, if they have time. CC also @danpetry if interested.

I wanted to add removal of TH_BINARY_BUILD too. If that's fine, I'll push it along with rerender to restore all build targets.

How high is your confidence that this will work first try? I think this PR is already doing a lot, and I'd like to get it out so we can start building dependent packages and test out whether things work as intended. I wouldn't be surprised if follow-ups turn out to be necessary anyway, so I'd prefer to put TH_BINARY_BUILD into the next round.

@h-vetinari h-vetinari changed the title Attempt reenabling cuDSS, cuSPARSELt and CUPTI on Windows Move libtorch bits to PREFIX & enable cuDSS, cuSPARSELt, CUPTI on win; fix duplicate .pyc versions Jan 23, 2025
@h-vetinari
Copy link
Member

FWIW, #329 is now fixed as well, so I think this is ready (modulo a revert of 3ebf1a9 + a rerender, which I'm holding off on for now, in case we need some more rounds after feedback).

@h-vetinari h-vetinari marked this pull request as ready for review January 23, 2025 22:05
@h-vetinari
Copy link
Member

Nevermind, I mind found an issue with the windows CUDA builds that needs fixing anyway, so pushing a rerendered build, but will cancel everything but the windows+CUDA job. And it'll also be good to see if CI on osx passes with the couple relevant changes here.

@h-vetinari
Copy link
Member

the win+cuda build passed fine, and so did the osx builds. I've launched aarch+CUDA as a final smoketest, but I expect this PR to be ready. Please let me know if there are comments @conda-forge/pytorch-cpu, otherwise I'm looking to merge this soon.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 24, 2025

I'd still do the TH_BINARY_BUILD thing, but perhaps after all builds pass, so we don't bump the revision twice (or we can defer that to anaconda sync PR) — i.e. basically add it, if CI passes, merge with it, if CI fails, merge the previous commit.

@h-vetinari
Copy link
Member

Can you raise a PR on your fork against this branch? If you enable actions on your repo, this should actually run on the open-gpu server (at least it does by default on my fork). That way we can keep the CI runs separate and don't have to roll back if it goes wrong.

@mgorny
Copy link
Contributor Author

mgorny commented Jan 25, 2025

Can you raise a PR on your fork against this branch? If you enable actions on your repo, this should actually run on the open-gpu server (at least it does by default on my fork). That way we can keep the CI runs separate and don't have to roll back if it goes wrong.

Done that in mgorny#4. It seems that it created the 16 jobs, but I don't see the CIrun gating job, so I don't think they'll start. But I have another idea: I'll just open another PR to this repo, adding the extra commits, and you'll merge whichever works.

@mgorny mgorny mentioned this pull request Jan 25, 2025
5 tasks
@h-vetinari
Copy link
Member

But I have another idea: I'll just open another PR to this repo, adding the extra commits, and you'll merge whichever works.

Sounds good to me!

@h-vetinari
Copy link
Member

h-vetinari commented Jan 25, 2025

Any comments here @conda-forge/pytorch-cpu? CI in #331 (which adds one more change on top of this PR) is looking good and I'd like to merge this soon.

@h-vetinari
Copy link
Member

Alrighty! Thanks a lot for the reviews! Post-merge comments are still very welcome, we'll have yet another round with #318 coming right up anyway where we can take this up. My goal would be to wrap up that PR in a few days as well and then get 2.6 out the door. 🙃

@h-vetinari h-vetinari merged commit 1c3a925 into conda-forge:main Jan 25, 2025
21 of 27 checks passed
@h-vetinari
Copy link
Member

h-vetinari commented Jan 25, 2025

Ah dammit, the workaround for conda-forge/conda-forge-pinning-feedstock#6910, causes windows to fail here 😑

Checking out the ref
  "C:\Program Files\Git\bin\git.exe" checkout --progress --force -B main refs/remotes/origin/main
  Error: error: unable to create file .ci_support/linux_64_blas_implgenericc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13github_actions_labelscirun-openstack-cpu-2xlargeis_rcFalse.yaml: Filename too long
  Error: error: unable to create file .ci_support/linux_64_blas_implgenericc_compiler_version13channel_targetsconda-forge_maincuda_compilerNonecuda_compiler_versionNonecxx_compiler_version13github_actions_labelscirun-openstack-gpu-2xlargeis_rcFalse.yaml: Filename too long

(logs, originally for commit 1c645fa, now force-pushed away)

@mgorny mgorny deleted the more-win branch January 26, 2025 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants