Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New lock file format #1209

Open
wolfv opened this issue Oct 12, 2021 · 55 comments
Open

New lock file format #1209

wolfv opened this issue Oct 12, 2021 · 55 comments

Comments

@wolfv
Copy link
Member

wolfv commented Oct 12, 2021

It would be great to have a new lockfile format.
The current conda lockfile format (explicit env format) has quite a bunch of shortcomings: it's a weird ad-hoc format and only supports MD5 sums (and not even by default, I think & SHA256 is much better).
The command to export an explicit environment in conda is conda list --explicit [--md5]

Micromamba already improves on this by changing the command to micromamba env export --explicit [--no-md5] (ie. it uses the env subcommand and defaults to add --md5 hashes).

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: osx-arm64
@EXPLICIT
https://conda.anaconda.org/conda-forge/osx-arm64/argp-standalone-1.3-h3422bc3_0.tar.bz2#b744f29f1ef63fcedcb63b45d7ceed4a
https://conda.anaconda.org/conda-forge/osx-arm64/git-lfs-2.13.3-hce30654_0.tar.bz2#d0a6dda324b5d970c296dca838965193
...

I am thinking it would be nice to replace this with a proper YAML based format.

I am proposing something like:

metadata:
  spec: explicit-1.0
  description: ... # optional

name: myenv
# channels: ?
explicit-packages:
  linux-64:
  - name: xyz
    version: 0.15.0
    resolved: https://conda.anaconda.org/conda-forge/linux-64/xyz-0.15.0-had123.tar.bz2
    sha256: 123123123123123sjadalkjdlkajsk
    signature: ... ? # we need to also have certain metadata to validate signatures, though
  - name: pip
    version: 1.15.0
    resolved: https://conda.anaconda.org/conda-forge/noarch/pip-1.15.0-had123.tar.bz2
    sha256: 123123123123123sjadalkjdlkajsk
  osx-64:
  - name: xyz
    version: 0.15.0
    resolved: https://conda.anaconda.org/conda-forge/osx-64/xyz-0.15.0-had123.tar.bz2
    sha256: 123123123123123sjadalkjdlkajsk
  - name: abc
    version: 0.16.0
    resolved: https://conda.anaconda.org/conda-forge/osx-64/abc-0.16.0-had123.tar.bz2
    sha256: 123jk1lk23j1kl2j3kj12k3jlj1lk2

The explicit packages would contain a list per (supported) subdir. The list would be the full env resolution (including noarch pacakges) and in the correct order for installation (as current lockfiles today).

@mariusvniekerk
Copy link
Contributor

This WIP PR conda/conda-lock#106 has a bunch of handy things that could also go into metadata

@mariusvniekerk
Copy link
Contributor

This lockfile spec should have a version and ideally a reference to some standard jsonschema representation of the structure.

version: 1
$schema: https://some/url/for/schema_v1.json

@maresb
Copy link
Contributor

maresb commented Oct 12, 2021

Ooh, this is very exciting!!!

My thoughts regarding the lockfile metadata are that I'd like my lockfiles to be self-documenting. For instance, I'd like them to know how they were created, for example with which command. I'd also like to be able to add "comments" as explanation for colleagues. That way by looking at the lockfile, it'll be obvious what it is, where it came from, and how to update it.

I think it's useful to be able to choose which fields to include or exclude. For instance, some people may find it useful to include the timestamp and the username, but others might find the timestamp annoying with git, or might not want to leak their username.

I have also realized that the feature I'd really like is to be able to stick my "non-explicit" dependencies in the metadata so that I can run a command like conda-lock update environment.lock and have it rerun the solver and upgrade the lockfile in-place.

I was not bold enough to propose a new format, so I have been working with a header consisting of commented yaml. Obviously a proper YAML format like this would be better.

I have things mostly implemented; I'm just working on a system for versioning the metadata generation process so that it can be extended or modified in the future. I wanted to finish it this weekend but I ended up not having enough time.

@wolfv
Copy link
Member Author

wolfv commented Oct 12, 2021

Yeah, we could make this format a super set of the existing YAML environment files. So that you could have

name: ...
metadata:
...
channels:
- conda-forge
- bioconda
dependencies:
- abc >0.5
- xyz =1.15
package-lock:
  linux-64:
     - ...

then some micromamba command could automatically update the lockfiles + certain metadata keys.

@maresb
Copy link
Contributor

maresb commented Oct 12, 2021

I think I already tested at some point whether or not conda would accept an extra top-level item in the environment file (like metadata), and unfortunately it didn't work. Thus we're unfortunately looking at a breaking change. (But who cares about conda anymore? 😉)

EDIT: Or was it mamba that didn't work??? Sorry, I take it back. I'm not sure anymore...

@wolfv
Copy link
Member Author

wolfv commented Oct 12, 2021

hmm, I don't think micromamba would complain about extra keys. With mamba, we're just using conda code though, so it might happen, IDK! :)

@maresb
Copy link
Contributor

maresb commented Oct 12, 2021

For a normal environment file it seems to produce a warning.

$ conda env create -n testenv --file=env.yaml

EnvironmentSectionNotValid: The following section on '/env.yaml' is invalid and will be ignored:
 - metadata

For an explicit lockfile, I'm getting:

CondaValueError: invalid package specification: metadata: asdf

@wolfv
Copy link
Member Author

wolfv commented Oct 12, 2021

Yeah, in explicit lockfiles you need to use a comment # metadata: whatever...

@mariusvniekerk
Copy link
Contributor

@maresb yeah, none of this exists yet, we're designing what we want from a lockfile format for conda.

@maresb
Copy link
Contributor

maresb commented Oct 13, 2021

Here is a summary of what I came up with in my PR:

conda-lock-metadata:
  about: This lockfile was generated by conda-lock to ensure reproducibility.
  comment: |-
    Run the following command to update this project's dependencies.
  command: conda-lock -f environment.yml --metadata=all
  command_with_path: /root/conda/envs/conda-lock/bin/conda-lock -f environment.yml --metadata=all
  conda_lock_version: 0.11.3.dev0+gf2ba8d4.d20210904
  created_by: root
  input_hash: f15a045753a401da73dd7c1693fd031e0ad41c0b4c9ca8545c0a8ab56c21d16c
  platform: win-64
  timestamp: 2021-09-05 23:43:18+02:00
  metadata_version: v1
  dependencies:
    - mamba
    - conda-lock

On the command line, you should specify --metadata=v1,about,platform,command,dependencies or similar to select desired fields and specify their order.

@mariusvniekerk
Copy link
Contributor

mariusvniekerk commented Oct 15, 2021

Thinking a bit through some of the human consumable parts for this would we want something like this instead

metadata:
  spec: explicit-1.0
  description: ... # optional
  channels:
    - conda-forge

name: myenv
packages:
  # probably will be alphabetically ordered
  xyz:
    linux-64:
      version: 0.15.0
      resolved: https://conda.anaconda.org/conda-forge/linux-64/xyz-0.15.0-had123.tar.bz2
      sha256: 123123123123123sjadalkjdlkajsk
      signature: ... ? # we need to also have certain metadata to validate signatures, though
    osx-64:
      version: 0.15.0
      resolved: https://conda.anaconda.org/conda-forge/osx-64/xyz-0.15.0-had123.tar.bz2
      sha256: 123123123123123sjadalkjdlkajsk
  pip:
    linux-64:
      version: 1.15.0
      resolved: https://conda.anaconda.org/conda-forge/noarch/pip-1.15.0-had123.tar.bz2
      sha256: 123123123123123sjadalkjdlkajsk
  abc:
    osx-64:
      version: 0.16.0
      resolved: https://conda.anaconda.org/conda-forge/osx-64/abc-0.16.0-had123.tar.bz2
      sha256: 123jk1lk23j1kl2j3kj12k3jlj1lk2
install_order:
  linux-64:
    - xyz
    - pip
  osx-64:
    - xyz
    - abc    

Noarch packages will still be repeated per platform since there may be a platform specific variant of something that is usually noarch.

By grouping the packages together we make it easier to review updates to lockfiles as when you relock you can ensure that all the versions you expect to move, move.

@wolfv
Copy link
Member Author

wolfv commented Oct 15, 2021

I would also like to make it a superset of a regular yaml environment file, by the way.

@maartenbreddels
Copy link

you could also consider blake3 instead of sha256, I think it's much faster (parallel/multithreaded).

@wolfv
Copy link
Member Author

wolfv commented Oct 15, 2021

I don't think the hashing speed matters so much here. One nice thing about sha256 is that we can directly pull from a OCI registry with it.

@mariusvniekerk
Copy link
Contributor

I would also like to make it a superset of a regular yaml environment file, by the way.

I'm -1 on this. Lockfiles are generated by machines. The sources are generated by humans. When both a human and a machine edit the same file you're asking for trouble.

@maresb
Copy link
Contributor

maresb commented Oct 15, 2021

I'm -1 on this. Lockfiles are generated by machines. The sources are generated by humans. When both a human and a machine edit the same file you're asking for trouble.

@mariusvniekerk I know it's a bit dangerous, and I've been debating this point with myself for a while.

  • I feel convinced that there could be a substantial benefit to having everything in the same file. The benefit is that it's more intuitive to have everything in one place.
  • On the other hand, I think the confusion of having separate files could be mostly mitigated by explanatory comments and metadata. (That was the motivation behind my PR.)

The fundamental problem that I'm trying to solve is as follows:

I'm working on some project, generate a lockfile for it, but then I forget to document how I generated the lockfile. I move on to something else. Some weeks/months later, I return to the project. I'd like to update the lockfile, but I forgot the exact conda-lock command that I ran to generate it. So I have to study the documentation and recreate the command I used to create it.

Put in a different way, a lockfile is supposed to guarantee reproducibility of the environment. I think it would be great if the lockfile could also guarantee its own reproducibility updatability. (For reproducibility I opened #1214).

@mariusvniekerk
Copy link
Contributor

Oh i'm 100% for stuffing as much metadata into the lockfile as possible for reproducibility, but i do not want the ability to accidentally use a lockfile for something its not for.

Basically every language community around that has lockfiles as a core concept makes the output of the locking process as a separate file with its own dedicated format (cargo, go.mod, yarn, etc)

@maresb
Copy link
Contributor

maresb commented Oct 15, 2021

I'd like to put the environment file into the lockfile's metadata. As soon as this has been done, it becomes extremely tempting to edit that and/or use that copy of the environment file as a new basis for generating the lockfile.

Where do we draw the line? Do we say that we can include a copy of the environment file, but we refuse to acknowledge that copy as machine-readable?

@mariusvniekerk
Copy link
Contributor

All for dumping the source files into the metadata for the lock. It can even be machine readable. But once you have that be editable by a human bad things will happen.

@maresb
Copy link
Contributor

maresb commented Oct 15, 2021

I'm not sure I understand your -1 then...

Let's say we define our new lockfile format which includes the source environment.yaml file. Then on the conda-lock side we implement conda-lock update environment.lock.

Unless we somehow provide some deterrent, people will then naturally delete the original environment.yaml file and edit the dependencies from environment.lock. (It's a natural thing to do, especially to maintain a single source of truth.)

You say that bad things will now happen. What specifically, data corruption? How can we prevent/discourage those bad things?

@mariusvniekerk
Copy link
Contributor

What happens is that users will just edit the user-editable part of the lock file and not update it. At that point the lock is entirely a lie.

@maresb
Copy link
Contributor

maresb commented Oct 16, 2021

Thanks! Now I understand.

One potential mitigation for this problem could be to include a checksum based on the dependencies from which the lock was generated. Any program which installs a lockfile should verify this checksum. In case it doesn't match, scream "These dependencies are a lie!!!" and refuse to do anything until the lockfile is updated.

This would require the cooperation of any program which can install a lockfile. The programs I'm aware of are Conda, Mamba, Micromamba, and conda-lock. Among the two of you, we have pretty good coverage in here! 🤣

@wolfv
Copy link
Member Author

wolfv commented Oct 16, 2021

I've just started to work on a cmake-micromamba extension that will allow CMake users to directly call micromamba to create an environment -- and I realized that a lockfile will be quite useful for this! :)

@jvansanten
Copy link
Contributor

I found my way here via a hint from @wolfv at PackagingCon, and would also like to see a richer, structured, multi-platform lock file. Here are some more fields that would be useful to include for each package, mostly to support extensions to conda-lock.

To support optional subsets of packages that need to be mutually compatible, but that you may not want to install in some contexts (e.g. installing dev dependencies in CI, but only required dependencies in production):

  • optional: bool = False
  • category: str = "main"

(This is mostly relevant in the context of requirements parsed out of a pyproject.toml)

To support pip interoperability:

  • manager: Literal["conda", "pip"] = "conda"

(In pip mode, the url would point to a wheel or sdist rather than a conda package)

@baszalmstra
Copy link
Contributor

I think the format should be human-readable. To me, one of the key requirements for the lock file format should be that it's diffable. Lock files tend to become difficult to grasp but resolving conflicts on them should still be possible. Cargo went through a similar process I think we can learn from that!

@jvansanten
Copy link
Contributor

That's really good point, thanks for pointing to the Cargo discussion!

FYI, here's an example of (and model for) what we settled on for conda-lock after some back-and-forth with @mariusvniekerk and @wolfv. After skimming the Cargo thread, I think we've addressed most of the points raised there, namely:

  • It's now YAML, so fairly readable
  • Package entries are sorted by (manager, name, platform), so updates to e.g. a single conda package will be a contiguous diff, even if it spans multiple platforms
  • Package hashes are in the package entry
  • The metadata section depends only on the dependency specification, so is stable under lock refreshes

Are there any other considerations we missed?

@wolfv
Copy link
Member Author

wolfv commented Dec 6, 2021

@jvansanten @mariusvniekerk one small nitpick I have would be that maybe instead of hash it could be md5 OR sha256 (or both) as keys.

Or alternatively it could be hash: md5-xyz or hash: sha256-xyz or some similar format.

Or

hash:
  sha256: xyz
  md5: abc

@baszalmstra
Copy link
Contributor

@jvansanten Yes thanks!

Quick side question: Does conda-lock also support minimal updates?

@zmbc
Copy link

zmbc commented Jan 27, 2022

For instance, would updating a package be equivalent to iterating on the PACKAGE versions ver from latest to current, and then running conda | mamba install PACKAGE=ver until it succeeds?

@maresb @jvansanten

From this comment in the source code, what I've gleaned is that:

  • Updating a package always updates that package to the latest version compatible with your spec.
  • Dependencies of that package, if still compatible, are not changed.
  • Packages that depend on the updated package, if still compatible, are not changed.

What I still don't understand is what happens when other packages are not compatible with the newly updated package. I see a few things that could be done:

  1. Update them to the latest compatible version. If even more packages need to be updated now, so be it.
  2. Update them to the most recent version that does not require updating any other packages (or, if there is no such version, the most recent that does not require updating two packages, and so on).
  3. Update them the minimum amount required by the new spec. Or maybe the minimum major.minor version, then apply algorithm 2 to the patch versions.

I don't think there's an obvious best way to do this. It's a hard problem that has definitely been grappled with before, and there are some pretty convoluted solutions. For example, Ruby's package manager Bundler updates to the latest if the spec isn't compatible OR there are no transitive dependencies.

I see --update as the key feature of conda-lock. If the updating algorithm was good, easy enough to understand, and clearly stated in the documentation, it would be a huge win.

@dhirschfeld
Copy link
Contributor

dhirschfeld commented Jan 28, 2022

Requested specs would be nice so that we can "prune" the environment later on

My understanding of @maresb's request for including the source environment.yaml is for exactly this reason. The lockfile will include all transitive dependencies which the author of the environment.yaml might not care about.

The original environment.yaml is the requested spec and the lockfile is the fulfilment of that spec at a given point in time. If a user wants to update an environment created from a lockfile they will always carry around all the transitive dependencies, even if they're no longer required by the updated requested dependencies.

If the user was able to update the lockfile from the original requested spec (environment.yaml) they they could prune no longer required transitive dependencies from their environment. By including the source environment.yaml in the lockfile metadata it ensures the lockfile can be updated/recreated from the original spec.

@dhirschfeld
Copy link
Contributor

You could then also allow adding new packages to the requested deps or specifying other constraints when updating the lockfile - e.g.

conda-lock update --file=env.lock 'pandas>=1.4' mynewpackage

@shughes-uk
Copy link
Contributor

shughes-uk commented Aug 1, 2022

What is the status of this feature? #1577 seems to have implemented them but my attempts to use

micromamba install -f conda-lock.yml -n testing2

result in


                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/

warning  libmamba 'root_prefix' set with default value: /Users/samanthahughes/micromamba
Transaction

  Prefix: /Users/samanthahughes/micromamba/envs/testing2

  Nothing to do


Transaction starting
Transaction finished

Micromamba 0.25

@wolfv
Copy link
Member Author

wolfv commented Aug 2, 2022

@shughes-uk I think the file has to end with .lock

@shughes-uk
Copy link
Contributor

@shughes-uk I think the file has to end with .lock

Mamba hits me with

EnvironmentFileExtensionNotValid: '/Users/samanthahughes/programming/cloud/conda-lock.lock' file extension must be one of '.txt', '.yaml' or '.yml'

micromamba hits me with


                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/

warning  libmamba 'root_prefix' set with default value: /Users/samanthahughes/micromamba
critical libmamba Invalid spec, no package name found: 

@shughes-uk
Copy link
Contributor

Here's the lockfile (renamed conda-lock.txt from conda-lock.lock so github would let me upload it)

conda-lock.txt

@wolfv
Copy link
Member Author

wolfv commented Aug 2, 2022

Should've taken a proper look before -- ...-lock.yml or ...-lock.yaml is the magic ending.

@shughes-uk
Copy link
Contributor

I was using conda-lock.yml in my first attempt. Tried .yaml for good luck but still no joy.

@wolfv
Copy link
Member Author

wolfv commented Aug 2, 2022

What platform are you on? E.g. I tried your lockfile on a M1 mac, and nothing got done (becuase there are no packages in the lockfile for this platform).

@shughes-uk
Copy link
Contributor

shughes-uk commented Aug 2, 2022

Ahhh that would do it. Thank you!! Perhaps a nice fix here would have the lack of a relevant platform section be an error code instead?

@maresb
Copy link
Contributor

maresb commented Sep 3, 2022

I've been trying to make serious use of the new lock file format lately. I've encountered various issues with Micromamba's implementation, all of which I've reported.

I think I've also discovered a logical oversight in the specification itself, namely with the category field:

category: str = "main"

Problem: This construction requires that each package belongs to a single category, but packages should be able to belong to multiple categories. (For instance, I want pip to be both a main and dev requirement.)

Suggestion: Convert this to

categories: list[str] = ["main"]

Background: As a refresher, we can have multiple environment files, for instance environment-main.yml and environment-dev.yml. In environment-dev.yml, I can add a top-level category: dev entry. Now when I run conda-lock -f environment-main.yml -f environment-dev.yml, each resulting dev package entry in conda-lock.yml will inherit category: dev.

Let's suppose I'm developing a containerized app. I have a devcontainer where I install both main and dev dependencies, and I have a production container where I install only the main dependencies. In order to be able to run pip install for setting up the production container, I need Pip to be installed. However, if I list pip in environment-dev.yml, it acquires category: dev, and thus it will not be installed in my production container.

As a workaround, I have to remove pip from my environment-dev.yml. But I think I should be able to leave pip in both environment files.

Question: Does my suggestion make sense, or am I somehow thinking about this in the wrong way?

Thanks!

@jonashaag
Copy link
Contributor

One way to solve for dev-only, prod-only, and both-dev-and-prod would be to have 3 different categories. But that will require 2^n categories, generally speaking. I agree that it would be nice to be able to attach multiple categories.

@maresb
Copy link
Contributor

maresb commented Sep 7, 2022

We should probably also include a field for the schema version so that we can recognize when the lockfile's consumer needs to be updated.

@jvansanten
Copy link
Contributor

I also agree that multiple categories would be handy. All the dependency solvers I'm aware of demand that the total solution is self-consistent, i.e. if you install all packages, you should get no conflicts. The same should be true of any subset, and there's no reason those subsets need to be disjoint, other than the fact that some solvers like poetry treat them that way.

@jvansanten
Copy link
Contributor

We should probably also include a field for the schema version so that we can recognize when the lockfile's consumer needs to be updated.

Definitely. Luckily there's https://github.com/conda-incubator/conda-lock/blob/7d9bd6d67d59fdd30d92a2ace3ee344aafa6a2b1/conda_lock/src_parser/lockfile.py#L79

@maresb
Copy link
Contributor

maresb commented Sep 7, 2022

Thanks @jvansanten, I just noticed that. (I was looking in the metadata section, not the top-level.)

@wholtz
Copy link
Member

wholtz commented Nov 19, 2022

I was rather surprised that my lock file needed to end with -lock.yaml to work, and the error message was not helpful to me:

critical libmamba Invalid spec, no package name found: 

A better error message or more intelligent parsing of the yaml file to detect it is a lock file would be useful.

@maresb
Copy link
Contributor

maresb commented Nov 19, 2022

@wholtz, there have also been complaints about this on the conda-lock side (conda/conda-lock#280). Probably the strategy there will be to first attempt to parse as yaml (for new style) and on failure fall back to parsing as explicit ( = old style).

@mfisher87
Copy link
Contributor

mfisher87 commented Feb 28, 2023

I was rather surprised that my lock file needed to end with -lock.yaml to work

I was surprised recently in the opposite respect: when I try to install from a standard environment YAML formatted "lock file" (without hashes) generated with conda env export | grep -v "^prefix" > environment-lock.yml, micromamba complains that the file is in the wrong format. This file is accepted by mamba and conda with no issues.

critical libmamba YAML parsing error while reading environment lockfile located at '/tmp/environment-lock.yml' : invalid node; first invalid key: "version"

I work around this by renaming the file, but this was still surprising to not be able to install from a file following standard environment YAML format if it has a certain naming pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests