Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Pixi for dependency management #4

Merged
merged 30 commits into from
Aug 16, 2024
Merged

Use Pixi for dependency management #4

merged 30 commits into from
Aug 16, 2024

Conversation

jayqi
Copy link
Member

@jayqi jayqi commented Jul 29, 2024

Sets up dependency management with Pixi instead of conda-lock / conda, including both generating lock files and managing the environment. Also prevents logging outside of smoke tests (resolves issue 83)

Generating lockfiles

The lockfile is now generated within a docker container, because of the known Pixi limitation that we can't solve cross platform pypi dependencies. To generate runtime/pixi.lock, the command make update-lockfile:

  1. Builds a docker image based on Dockerfile-lock. This new dockerfile just runs the command to generate pixi.lock. It doesn't install any dependencies or run the submission.
  2. Uses docker create to create a dummy container from the image without running it
  3. Copies pixi.lock from the dummy container back to the host
  4. Deletes the dummy container

Having a separate Dockerfile allows us to update the lockfile more quickly. If we used the existing Dockerfile, the full submission would run every time we need to update the lockfile.

make update-lockfiles runs from scratch (with no existing pixi.lock) in about 2 minutes. I also tested that it works with some pypi packages included.

Outstanding

  • Check whether we still need test_lockfile.py -- try and install a package in both conda and pip, see if pixi lets us --> pixi resolves this for us, we don't need test_lockfile.py

    • The previous test_lockfile.py just checks whether there are both conda and pip versions of the same package.
    • It's possible that Pixi automatically resolves pip dependencies with conda dependencies in a way that conda-lock does not. (pixi list shows one entry for each package that has been installed. That one entry is either installed from pypi or conda.) Scrap work in this commit
  • Make sure entrypoint.sh runs python main.py in the correct environment

    • One change from conda lock is that we use pixi run to run main.py. This means we also need to specify which pixi environment with CPU_OR_GPU, so we set CPU_OR_GPU as an environment variable in the dockerfile
  • Update README (tracked in separate project issue)


Background about Pixi from JQ

Pixi is the new thing in the Conda ecosystem that has been gaining momentum for a while. It's made by the team that makes Mamba. They claim to be production-ready.

What benefits does Pixi have?

  • Lockfiles are a primary part of their default workflow. There's a lot of focus on doing it well.
  • The UX is pretty nice. Conflict messages are pretty clear.
  • It's fast because it's written in Rust.
  • It does minimal updates to lockfiles, i.e., if you change something and existing versions in your lockfile already satisfy the constraint, it won't change them. conda-lock does not do this.

Some notes on implementation:

  • I've shoved common dependencies into the default environment a "base" feature that gets inherited by cpu and gpu. The default environment on its own shouldn't be used though.
  • I saw that PyTorch has a new thing where they have pytorch-cuda and cpuonly metapackages to help you pin, so I'm using those. It's how they do it in their official docs.

Running it yourself:

  • Install Pixi. I have it with brew install pixi on macOS.
  • To run the locking, you can run pixi ls or pixi tree which are basically no-op commands normally but will trigger it to check the lockfile for freshness.
    • If nothing happens, it means the lockfile satisfies current constraints.
    • If you want to force it to rerun, rm pixi.lock first.

Some helpful references:

Also, I saw the conda-lock maintainers are considering endorsing Pixi as a better default solution so that feels like more of a turnoff from using conda-lock.

Copy link
Collaborator

@klwetstone klwetstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayqi This is excellent. 👏 👏 An enormous thank you for digging into this and finding a much more efficient solution!!

Poking around, I agree and like the idea of switching to Pixi. It's faster and easy to work with, a really great suggestion.

Pixi is a full environment management tool. It supports Conda and PyPI packages but is not interoperable with Conda environments. That means we'll need to update our Dockerfile and entrypoint.sh script to use Pixi. I think this is probably fine: I expect pixi install -e gpu and pixi run -e gpu should work.

Before officially switching over, I'd like to poke around with making these updates to check if we run into any additional issues.

pixi.toml Outdated Show resolved Hide resolved
@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

@klwetstone it'll also be good to throw in some PyPI packages just to make sure that works as expected.

@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

@klwetstone you should just take over this branch and push any commits you think make sense.

@klwetstone
Copy link
Collaborator

Documenting steps updating docker container:

  • updated all commands
  • Installing using pixi leads to an error:
(base) root@02b4239deb18:/tmp# pixi install --manifest-path /tmp/pixi.toml -e gpu
⠴ creating environment 'gpu'
⠤ creating environment 'gpu'
    download & extract   [00:06:10] [━━━━━━━━━━━━━━━━━━━━] 7.89 GiB @ 21.83 MiB/s pytorch
    installing packages  [00:06:10] [━━━━━━━━━━━━━━━━━━━━]   294/294                                                                            × failed to fetch pytorch-2.1.1-py3.10_cuda11.8_cudnn8.7.0_0.tar.bz2
  ├─▶ an io error occurred
  ├─▶ failed to unpack `/root/.cache/rattler/cache/pkgs/pytorch-2.1.1-py3.10_cuda11.8_cudnn8.7.0_0/info/files`
  ├─▶ failed to unpack `info/files` into `/root/.cache/rattler/cache/pkgs/pytorch-2.1.1-py3.10_cuda11.8_cudnn8.7.0_0/info/files`
  ╰─▶ No space left on device (os error 28)

We may need to change the base image from mambaorg/micromamba:1.5.3-bookworm-slim. We can have a slimmed down base image, then we may have space to install and use pixi

@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

@klwetstone Yeah, we don't need the Micromamba image because Pixi is entirely an alternative to Micromamba.

Looks like there are official Pixi Docker images: https://github.com/prefix-dev/pixi-docker

There's a bookworm-slim-based one and also Nvidia-based ones.

@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

(In case you didn't know, bookworm refers to a particular version of Debian, which is a Linux distribution, and slim is just the lightweight version of it.)

@klwetstone
Copy link
Collaborator

klwetstone commented Jul 30, 2024

So weird -- I searched docker hub and that one didn't come up. I ended up making it worth using nvidia/cuda:11.8.0-base-ubuntu22.04 as the base image and just installing pixi (current dockerfile on this branch).

@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

They're not on Docker Hub, they're on a GitHub Container Registry repository.

@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

@klwetstone I'd recommend using the images they already publish. https://github.com/prefix-dev/pixi-docker/pkgs/container/pixi

Firstly, you won't need to muck with installing Pixi. Secondly, that lets you easily pin to a particular version of Pixi, instead of installing the latest version each time that you build.

@klwetstone
Copy link
Collaborator

Agreed -- thank you! I didn't think to search Github container registry in addition to docker hub

@jayqi
Copy link
Member Author

jayqi commented Jul 30, 2024

Agreed -- thank you! I didn't think to search Github container registry in addition to docker hub

The thing to do is just here is to just google "whatever docker image" instead of searching on Docker Hub. Docker Hub is a commercial service that costs money so open source projects won't always be there.

@klwetstone klwetstone marked this pull request as draft July 30, 2024 22:05
@klwetstone
Copy link
Collaborator

@klwetstone it'll also be good to throw in some PyPI packages just to make sure that works as expected.

@jayqi I ran into some interesting snags here -- I'm continuing to work on debugging, curious if you have any initial thoughts about the right directions to investigate. When I add in pypi dependencies, pixi only runs in the docker container and errors on my Mac. I haven't fully confirmed, but it doesn't seem like this happens with conda-lock -- I'm not finding great documentation on what conda-lock actually does under the hood, so I'm not sure.

Some hypothesis I have for what might be going on here:

  • There's some other difference between pixi and conda-lock in their ability to check dependencies for a platform different than the current machine. Pixi has a bunch of commands to run things in the specified environment, so I wonder whether it always expects the current machine to be the same as the specified platform.
  • There could also be some snags in pixi's support of pypi.

I added to pixi.toml:

[feature.cpu.pypi-dependencies]
pytest = {version = "*"}
chromadb = { version = "*" }
sacremoses = { version = "*" }

Error when I run on my mac:

$ pixi tree --manifest-path runtime/pixi.toml --platform linux-64
  × Unable to solve pypi dependencies for the cpu environment because no compatible python interpreter can be installed for the current platform
   ╭─[4:13]
 3 │ channels = ["nvidia", "conda-forge", "pytorch", "xformers"]
 4 │ platforms = ["linux-64"]
   ·             ──────┬─────
   ·                   ╰── even though the projects does include support for 'osx-64'
 5 │ 
   ╰────
  help: Try converting your [pypi-dependencies] to conda [dependencies]

(the above runs fine in the docker container)

@jayqi
Copy link
Member Author

jayqi commented Jul 31, 2024

Okay, quick fix is that I think you're going to have to add osx-64 to the platforms. We're using Pixi in a way that I think isn't 100% intended, so this complaint feels like something that isn't necessary but is just how it's set up right now.

@jayqi
Copy link
Member Author

jayqi commented Jul 31, 2024

@klwetstone alternative idea: does it make sense to set up a Docker entrypoint that runs the locking inside a container? That way, we can still run the locking without needing to add macOS or Windows to the platforms. (While this PyPI kludge is a thing.) One possible other benefit there is that someone can run the locking without needing to install Pixi.

@jayqi
Copy link
Member Author

jayqi commented Jul 31, 2024

Okay this is a known limitation in Pixi right now: prefix-dev/pixi#1130

Basically, because of the way Python packages work (packages can run arbitrary Python code to set their package metadata), Pixi needs a Python interpreter to resolve the PyPI packages. Because your OS (osx-64) is not listed in the platforms, Pixi is unable to install a Python that fits the requirements. By my understanding, this is a technically correct thing. I think it's possible they come up with something smart that can work around this in the future, or they relax some assumptions (I think conda-lock and pip-compile make assumptions that platform and/or Python versions don't matter), but for now this is a limitation.

The two approaches here would be:

  1. Add osx-64 (and maybe Windows if we want Windows participants to be able to run locking) to the platforms. This means that it'll make Pixi try to resolve stuff that is able to work on all 2 or 3 platforms unnecessarily (from the POV that we only care about Linux for the code execution).
  2. Run locking in a Linux Docker container.

@klwetstone
Copy link
Collaborator

klwetstone commented Aug 1, 2024

Thank you this is extremely helpful!!

  1. Add osx-64 (and maybe Windows if we want Windows participants to be able to run locking) to the platforms. (per your suggestion)

I don't think this will work with our GPU environment. When I add osx-64, I'm getting:

  × failed to solve the conda requirements of 'gpu' 'osx-64'
  ╰─▶ Cannot solve the request because of: No candidates were found for cudatoolkit
      ==11.8.

This makes sense, because we can't install cudatoolkit on a mac.

I also see an option 3, which is to stick with conda-lock for now and get it to run more efficiently by significantly simplify the requirements (a la this comment). I'm going to see if I can get option 3 working for now, mainly in the interest of time so I can test some different node sizes.

I do think option 2 (using pixi) is the better long term solution, because conda-lock might just get extremely slow again once participants start adding packages. We could have another Dockerfile that just updates and checks the lock files, and run that dockerfile with make update-lockfiles.

@jayqi
Copy link
Member Author

jayqi commented Aug 1, 2024

@klwetstone FWIW, conda-lock being slow like what you experienced is not normal behavior, and should either be considered a bug or a pathological edge case that should be addressed rather than lived with.

@klwetstone
Copy link
Collaborator

Thanks! For my understanding, by "pathological edge case" do you mean it's more than just unresolvable dependencies -- Ie. It may be related to my installation of conda, and not just something that can be solved by loosening dependencies? Either way, I think it's a good idea to test out running conda-lock on some super-basic yamls to check that it runs in a reasonable time.

@jayqi
Copy link
Member Author

jayqi commented Aug 2, 2024

"related to my installation of conda" = bug

"pathological edge case" = your specific set of dependencies and versions and the solver interact in a specific way where the algorithm works in an especially inefficient way. https://en.wikipedia.org/wiki/Worst-case_complexity

@klwetstone klwetstone marked this pull request as ready for review August 6, 2024 17:53
@klwetstone
Copy link
Collaborator

@jayqi I think I have everything set up to use pixi instead of conda-lock, do you have time to review and make sure the new approach makes sense? Details are in the updated PR description

* Fix Dockerfiles

* Fix permissions and directories and stuff

* Update test command

* Fix command ordering

* Use clean cache command

* Add maximize build space action

* Add more root reserve space

* Remove unwanted software directly

* Add some diagnostics

* Print out pixi info

* Fix typo

---------

Co-authored-by: Jay Qi <[email protected]>
@jayqi
Copy link
Member Author

jayqi commented Aug 9, 2024

There's still something weird and horrible going on with the GPU image. I've spent too much time on this already, but just some loose ideas:


Test is failing:

   × The platform you are running on should at least have the virtual package
  │ __cuda on version 11.8, build_string: 0

What is this? This error comes from Pixi, and I think __cuda is a Pixi thing: it's a "virtual packages" that Pixi uses to track whether CUDA stuff got installed or not. If you run pixi info (I do this in the Docker build command, you can see it in the logs here) it'll list out metapackages.

In theory, if we have CUDA installed correctly, Pixi should "detect" this as an available virtual package. If you google about this, there have been bugs in the past but nothing obviously still open that is relevant.


In general, our GPU stuff is kind of brittle.

  • We're mixing conda-forge and nvidia channel packages which I think may lead to weird things
    • It feels like, from general googling, that people would like to prioritize versions in the nvidia channel over conda-forge (since Nvidia maintains those)
  • This whole ecosystem with conda-forge, nvidia, and pytorch channels seems really confusing and brittle. They also keep changing what metapackages they're using to pin versions so things end up with very brittle compatibility.
    • For example, it seems like cudatoolkit is maybe outdated but also some things like tensorflow-gpu seem to depend on it.
  • Tensorflow via conda-forge gives me less confidence because I don't think Tensorflow maintainers actually maintain it. We might just want to install it with PyPI because that's their official instruction, if tensorflow causes problems.

Copy link

@r-b-g-b r-b-g-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

  • I can re-run make update-lockfile and nothing changes
  • If I delete pixi.lock it regenerates without error (nicely, regenerating from scratch will result in a different pixi.lock, proving that running on top of an existing pixi.lock doesn't change anything it doesn't need to)
  • It's fast!

One suggested change, on principle more than anything else. Otherwise, looks good to me!

runtime/Dockerfile-lock Outdated Show resolved Hide resolved
@klwetstone
Copy link
Collaborator

@r-b-g-b ready for a final look!

Copy link

@r-b-g-b r-b-g-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smallest nit! Looks really good!

runtime/Dockerfile-lock Outdated Show resolved Hide resolved
@klwetstone klwetstone merged commit 9824c1f into main Aug 16, 2024
2 checks passed
@klwetstone klwetstone deleted the jyq-pixi branch August 16, 2024 13:22
@klwetstone klwetstone restored the jyq-pixi branch August 16, 2024 15:48
@klwetstone klwetstone deleted the jyq-pixi branch August 16, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants