Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement experimental registry HTTP API from RFC #8890

Closed
wants to merge 86 commits into from

Conversation

jonhoo
Copy link
Contributor

@jonhoo jonhoo commented Nov 24, 2020

This change implements the HTTP registry API from rust-lang/rfcs#2789.

The client implementation supports the changelog extension in the RFC, but the server (that is, GitHub's static file serving of the registry) does not at the time of writing.

This implementation does not download multiple index files concurrently, since the Registry trait doesn't lend itself well to that at the moment.

Jon Gjengset added 19 commits November 19, 2020 15:50
This commit implements the HTTP registry API from
rust-lang/rfcs#2789.

A proof-of-concept server implementation exists at
https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept

The client implementation supports the changelog extension in the RFC,
but the server does not at the time of writing.

This first draft does not take advantage of `Etag` to avoid
re-downloads if the server does not support changelog. It also does not
download multiple index files concurrently, since the Registry trait
doesn't lend itself well to that at the moment.
With this change, a registry that does not support the optional
changelog feature will no longer purge the index on every build.
Instead, it will double-check the current file contents with the server
for each load, which is fairly efficient if a given index file has _not_
changed. The changelog is (unsurprisingly) still much more efficient.
They're handled the same anyway.
@rust-highfive
Copy link

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Nov 24, 2020
@jonhoo
Copy link
Contributor Author

jonhoo commented Nov 24, 2020

The biggest open question at the moment is how to make this effectively utilize HTTP/2 the way the RFC suggests. The RegistryData trait only has a load method which is expected to be run synchronously. To get to pipelined downloads, we'd need to change the trait to allow the loads to be queued and processed later, which is going to require some nontrivial reorganizing of cargo/sources/registry/index.rs, and an update of all the other RegistryData impls to match. Essentially, the resolver will need to be restructured into an event loop that just continuously parses the "next" available index file, and enqueues dependencies as it walks, until it's processed them all.

I'm personally leaning towards landing the experimental registry support in its current state, and then doing that re-architecting in a dedicated follow-up PR, but am happy to do it as part of this PR instead if that's preferable.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 24, 2020

I have only had time to do a cursory skim of the code. But it looks like a lot of very good work. Thank you!

Essentially, the resolver will need to be restructured into an event loop that just continuously parses the "next" available index file, and enqueues dependencies as it walks, until it's processed them all.

I think this is a bit of a non starter. The resolver is already handling a NP-Hard problem, any loss of control increases the chance of exponential blow up. Not to mention it is not the easiest to test, so we don't want to make that harder, for example by making it non deterministic. The RFC describes a Greedy fetch solution to this quandary, that gets (or atleast starts to get) the relevant ranges before the resolver gets started. Of course the optimum strategy is somewhere in between "prefetch everything you could possibly need" and "load only when needed", and we can work on designing that balanced strategy as this gets more established. Of course it is also possible that you have something perfectly reasonable in mind and I am misunderstanding.

I'm personally leaning towards landing the experimental registry support in its current state, and then doing that re-architecting in a dedicated follow-up PR, but am happy to do it as part of this PR instead if that's preferable.

Given what I said above I am definitely for waiting on bigger re-architecting. Let's get this set up enough so users can test if there are bugs, then figure out how to make it efficient. It is also possible that the work on using pubgrub-rs will proceed to the point that we can start testing in Cargo, and when that happens it will probably also need to change the Registry trait. It may be months or it may stall out once again. But if works out it would be nice if the changes were compatible.

One thought I have had for improving performance while we are testing out with both approaches is that Cargo stores a copy on disk after parsing from the git index. It may be cool to find some way for both ways of retrieving the registry to have a shared representation in that dir. I have not yet figured out how to make that technically possible, but it is something to think about.

@alexcrichton
Copy link
Member

IIRC one of the biggest open questions on the RFC was whether this was feasible in terms of how long it might take. Do you have some numbers on how long this implementation takes locally on some crate graphs? For example how long does cargo update take on Cargo itself? Or rust-lang/rust?

I think when this lands will be guided by experimental feedback from the implementation. If this is almost done except for some cleanup and performs well, seems fine to land. If this is orders of magnitude slower than what we have today and has no viable path to improvement without major refactorings, I think it's best to hold off on landing because it's otherwise dead code largely.

@jonhoo
Copy link
Contributor Author

jonhoo commented Nov 24, 2020

Of course it is also possible that you have something perfectly reasonable in mind and I am misunderstanding.

What I was specifically thinking was to change RegistryData to have the methods:

fn enqueue_load(&mut self, root: &Path, path: &Path) -> CargoResult<()>;
fn load_next(&mut self) -> CargoResult<Option<LoadedData<'_>>>;

where

struct LoadedData<'a> {
  path: &'a Path,
  bytes: &'a [u8],
}

And then restructure the resolver so that instead of walking the dependency tree directly, it continuously calls load_next, and uses the new information to figure out additional dependencies to load. Once load_next returns None, all the dependencies have been walked. It does change the design, but I'm not sure if it limits the control of the algorithm -- I'm probably not the best person to evaluate that.

Greedy fetch

So, the reason I didn't implement this immediately was that it feels unfortunate to essentially have two resolvers in the codebase, even though the greedy one is much simpler. The greedy fetch now also has to parse the index files (ideally re-using the cache in the process) and understand basic semver requirements specs. I'm happy to tackle that next though, as I agree that one of the two approaches above likely need to be implemented for this approach to be competitive.

Do you have some numbers on how long this implementation takes locally on some crate graphs? For example how long does cargo update take on Cargo itself? Or rust-lang/rust?

I haven't had a chance to test that yet, but will do it shortly. Without the greedy fetch above it's likely to do a bit worse, though I don't know by how much. It may still beat a complete git clone of the index.

The lack of support for a changelog file in the crates.io registry is also likely to make the evaluation yield incomplete and suboptimal results. I wonder how tricky it'd be to update crates.io to start keeping a changelog already. The changelog is super simple on the server side, so I feel like it shouldn't be too much work to get one included.

I think when this lands will be guided by experimental feedback from the implementation. [..]

Seems reasonable -- I'll go ahead and start benchmarking. I think I'll probably then also implement the Greedy fetch from the RFC, as I think without that it's unlikely to be competitive especially in incremental cases.

@Eh2406
Copy link
Contributor

Eh2406 commented Nov 24, 2020

As one of the people that know the resolvers codebase the best, one of the big things in favor of this RFC was not having to make massive changes to the resolver until this has proved itself effective and is in widespread testing. That being said some integration may be better than a pure "greedy fetch" approach.

An approach toward a next step that may be a plausible diff is:

  • for enqueue_load Somewhere here, instead of returning all the candidates have it return some kind of promise.
  • for load_next Somewhere here, have it pop the best among the ones that have loaded.

But I am open to other reasonably small diffs. If we can't find one, we may have to do a full "greedy fetch" until it is clear that this is a good RFC for a good number of people.

If crates.io wants to start publishing a changelog, we may want to have it hosted on a different branch/fork so if things don't work out or it needs changes it is not in the main line history. The -Z flag can switch what url is used, while we experiment.

@jonhoo
Copy link
Contributor Author

jonhoo commented Nov 24, 2020

I started implementing the greedy fetch as a prefetching step in jonhoo@c5b2468, but I have to say I don't like it much. It causes a lot of what feels like duplication of concern with the resolver, and also introduces a decent amount of complexity in keeping track of which dependencies we have to fetch at which versions. Not to mention that it duplicates much of the logic around accessing the registry cache. Some of that can probably be fixed by tidying up the way the prefetching is done, but it feels like the wrong approach. Perhaps you had a better idea for how this greedy fetching should be implemented that I can try @Eh2406?

Otherwise, I think trying to re-use the existing resolver logic based on the path you suggested above might be a more promising avenue. How exactly to structure this as a promise without cargo actually having async is going to be a bit of a puzzle, but I'll do some thinking and tinkering.

Is it a problem if the changelog appears in the main line history for a short amount of time? Regardless, doing it on a branch should be easy enough (though it means crates.io would have to publish to two branches, which may be a pain) -- you already opt into the HTTP-based access mechanism by using a rfc+ URL (which should likely change before merging).

@alexcrichton
Copy link
Member

I'm not so certain the final implementation is going to really use all that much from this PR itself. This already mostly copies the already-existing parallel download code of packages, and it's not like we need to erase the git history so it'll still be here. I agree that from a peformance standpoint it is not too useful to land this without prefetching, but I think it's more important to land maintainable code than an implementation in an interim state which we know we want to change and/or remove. I don't expect we'll ask users to test this out immediately after landing, what I'd prefer is that implementation work continued in-tree, but things are broken up into PRs for separate review.

I don't personally mind reviewing larger PRs, but a longer discussion can get unwieldy (as is the case with this PR at this point). I'm ok if you'd prefer to land the resolver changes first and then land this afterwards, I am currently of the opinion that we just don't want to land the temporary implementation of prefetching here (since it infects a good number of places rather than being an isolated change).

@jonhoo
Copy link
Contributor Author

jonhoo commented Dec 16, 2020

In that case I'm in favor of holding off on this until the resolver change lands. I suppose another piece of work that could then be parallelized and happen before this PR is to generalize Downloads, though I suspect that will be decently tricky. Not just because of the thread-locals, but also because of different ways to report progress, support for conditional headers, etc. I may be able to find time to do that work, but cannot promise anything as of yet.

@bors
Copy link
Contributor

bors commented Jan 6, 2021

☔ The latest upstream changes (presumably #8994) made this pull request unmergeable. Please resolve the merge conflicts.

@jonhoo
Copy link
Contributor Author

jonhoo commented Jan 6, 2021

I'm going to hold off on dealing with the conflicts until #8985 lands.

@ehuss
Copy link
Contributor

ehuss commented Jan 12, 2021

I could not find a tracking issue for this, so I opened #9069.

@secana
Copy link

secana commented Apr 18, 2021

Hi, I'm looking for an HTTP like API for fetching crates from a registry, like it's done in NuGet for .NET. This seems like the RFC for exactly that feature. I created an issue #9364 for that, which I closed as it duplicates this RFC. Can I help somehow to implement or test the feature? I could implement a prototype in our private registry for testing, if that's easier than adding the feature to crates.io.

Have you though about using the same authorization token that is used for pushing a crate for the pull mechanism? This would be helpful for private registries, where the pull should be authorized, too.

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 18, 2021

I am glad to here that you are excited to help with this experiment! The work in this PR was incredibly helpful in figuring out that we want the RFC, but suggested some big restructurings to Cargo that would make it easier to maintain and faster to run. I got started on the refactor but got distracted ( #8985 ). I am very excited by it and expect to get back to it when I am settled into my new job.

So there is some progress on having a draft of it available on nightly for use with private registries.
The next step after that would be setting up a test server that mirrors the data from Crates.io, so that people can start testing use of the full index in the real world. I know a lot of people that are very excited to get there CI moved over, even if just on nightly.
Then things get more uncertain about what needs to be done and in what order. Changes will need to be made to the exact files that get served, what changes depend on what we learn in testing. A production ready server will need to be incorporated into Crates.io codebase, in such a way that they don't mined maintaining it and so it stay consistent with the Git based one. When it is working better then the Git based one we will switch the default place stable users get there Crates.io index. When we are ready to keep it available forever, it will need to be stabilized for private registries.

Ideas for how to make authorization work better for private registries, would need to be a different RFC. There is previous attempt at rust-lang/rfcs#2719, but I don't yet know enough to evaluate if that is the design we would want. I was hoping to do a deep dive into that design space, so as to be able to talk competently about that, when this RFC is in a good place.

@secana
Copy link

secana commented Apr 19, 2021

@Eh2406 thanks for the fast response. I can imagen that juggling between two dependent components (cargo, crates.io) is tough. I'll checkout the code from @jonhoo and get familiar with the changes, to get an idea on how the API proposal works. Maybe I can help after that :)

@secana
Copy link

secana commented Apr 20, 2021

Hi,
I've looked at the code and have three questions so far:

cargo sets the If-Modified-Since header, which gets the value from the registries Last-Modified header. The header seems to be in the format format!("{:?}", SystemTime) instead of HttpDate, which is usually used for the header. What's the reasoning for that?

Why is the request path for a crate index split, e.g. /te/st/testcrate? It would be enough if the path is only the crate name, e.g. testcrate, as the name is enough to identify the crate and return the correct index, or a 304. We are not bound to a strict folder structure as in git, with the http registry approach, or am I wrong?

How to I get cargo to set the If-Modified-Since and Last-Modified in a request? It seems it either requests without the headers at all, or no request is made as the carte is already locally cached.

not sure who can help @jonhoo or @Eh2406

best regards,
@secana

@pickfire
Copy link
Contributor

Why is the request path for a crate index split, e.g. /te/st/testcrate? It would be enough if the path is only the crate name, e.g. testcrate, as the name is enough to identify the crate and return the correct index, or a 304. We are not bound to a strict folder structure as in git, with the http registry approach, or am I wrong?

That I believe is to improve the indexing indexing and retriving of data stored, most likely for performance reasons or maybe something to do with sharding? Or maybe because it wants to be similar to git clone like mentioned in the RFC?

Changes transport from an ahead-of-time Git clone to HTTP fetch as-needed.

@jonhoo
Copy link
Contributor Author

jonhoo commented Apr 20, 2021

Hi!

cargo sets the If-Modified-Since header, which gets the value from the registries Last-Modified header. The header seems to be in the format format!("{:?}", SystemTime) instead of HttpDate, which is usually used for the header. What's the reasoning for that?

I'm not sure I follow — the idea here is that cargo will echo back the date exactly as set by the server, which should be in the right format. At least my understanding from the HTTP spec(s) is that If-Modified-Since and Last-Modified both have the same format. None of that (should) be tied to SystemTime.

Why is the request path for a crate index split, e.g. /te/st/testcrate? It would be enough if the path is only the crate name, e.g. testcrate, as the name is enough to identify the crate and return the correct index, or a 304. We are not bound to a strict folder structure as in git, with the http registry approach, or am I wrong?

As @pickfire alluded to, the idea here was primarily to be able to serve the index simply by fronting a git checkout of the index with an HTTP server. Of course, you don't need to do that (e.g., you can just use the last path parameter), but if we didn't include them with the request, then it'd be a pain.

How to I get cargo to set the If-Modified-Since and Last-Modified in a request? It seems it either requests without the headers at all, or no request is made as the carte is already locally cached.

When you say "cargo" here do you mean cargo as it is in this PR? With this PR, cargo should always send requests with If-Modified-Since or with If-None-Match if the server supports ETags if it has a locally-cached version of a given index file. Are you not seeing that?

@Eh2406
Copy link
Contributor

Eh2406 commented Apr 20, 2021

Wonderful questions! And grate examples of how we are still figuring out what design we will want. Please help hold us to account and make shore we adequately think about them before we stabilize.

We are not bound to a strict folder structure as in git, with the http registry approach, or am I wrong?

When this PR was opened the RFC wanted to be able to serve an HTTP index by checking out a GIT index and pointing a static file server at it. Maybe even just using raw.githubusercontent.com to point at the current index. We figured out that it is not going to be that simple, although I cant remember why at the moment. If we need to make changes to the structure, then maybe we should simplify this one as well.

@secana
Copy link

secana commented Apr 23, 2021

Thank you all for your explanations. The more I poke around in the code, the more I get the feeling that trying to be as close as possible to the git approach does not work well with an HTTP API.
Instead I think, an HTTP API corresponding to the API defined to upload crates is cleaner and allows for future improvements and additions.

Current workflow

Please correct me, if I got something wrong here.

Add a .cargo/config file with the content:

[source.crates-io]
replace-with = 'alternative-registry'

[source.alternative-registry']
registry = 'sparse+http://my.registry.com'

Cargo pulls the config,json from that URL with the content:

{
  "dl": "http://my.registry.com/api/v1/crates",
  "api": "http://my.registry.com"
}

If a crate dependency aabbcc should be resolved, cargo does a http get to http://my.registry.com/aa/bb/aabbcc to get the index with all metadata for the crate.
Then cargo chooses the best version and does a http get to http://my.registry.com/api/v1/crates/<package>/<version>/download to download the crate.

That's a lot to configure and do for basically two simple get requests.

Proposal

Add a .cargo/config file with the content:

[source.crates-io]
replace-with = 'alternative-registry'

[source.alternative-registry']
http-registry = 'http://my.registry.com' # Replace `sparse` with a new key. I just think that is cleaner and less error prone

Skip the step where cargo pulls the config.json. We already know, where we get the index and the crates from as we do not have the split between the index git and a crate file server anymore. Instead the registry defines an HTTP API to get both.

Get the index with an http get from http://my.registry.com/api/v1/crates/<package>/index. This allows for breaking changes in the future as we have an encoded version as we have to the download part. The URI mirrors the way the crate itself is downloaded.

Download the chosen crate version from http://my.registry.com/api/v1/crates/<package>/<version>/download as usual.

The approach with the ETag and Last-Modified header stays the same to skip the index update if it is already up-to-date on disk.

I find this approach a lot cleaner and easier to understand. What do you think?

@kornelski
Copy link
Contributor

There were two main motivations for the existing URL scheme:

  1. git registries are stable, and available for alternative registries, and to uphold Rust's stability promise they should remain supported in Cargo forever, even after crates.io stops using the git format. Therefore, any deviations from the current git registry design don't enable simplification in Cargo, but add more code instead.

  2. The sparse registry design was motivated by Ruby Bundler's experience. Bundler has intentionally abandoned server-side logic in favor of using completely static files, because static file hosting is much cheaper and trivial to scale. Even something as simple as a redirect or URL rewriting may mean that a static-file host can't be used. This has been a problem for Rust already, e.g. doc.rust-lang.org can't use HTTP redirects.

So the RFC intentionally uses 1:1 layout of the existing registry, so that it can be simply done by putting a CDN in front of the git repo.

The /aa/bb directory prefix is not pretty, but AFAIK it's mainly an aesthetic issue, not a technical problem.

The config needs to remain supported for the git registry, and ability to pass it through as-is may be useful for mirroring git registries via sparse http protocol.

@kornelski
Copy link
Contributor

kornelski commented Apr 24, 2021

As for .cargo/config, the sparse+http protocol prefix in the URL string itself helps keeping all config keys as they are. There's no need to worry how to handle cases where both old and new keys were set. New registry protocols can be added without adding even more keys.

The keys aren't very consistent though. There's registry config:

[registries]
my-registry = { index = "https://my-intranet:8080/git/index" }

but source replacement uses registry instead of index:

[source.the-source-name]
registry = "https://example.com/path/to/index"

so kinda maybe we could bikeshed an alternative property that works everywhere. http-registry isn't that IMHO, especially that git-over-HTTP is supported already, so "http" is ambiguous.

@secana
Copy link

secana commented Apr 25, 2021

@kornelski that makes sense. Thanks for the detailed explanation.

I've tried to add optional authorization to the HTTP registry branch, as such a feature would be useful for private registries where not only the push of crates needs to be restricted, but the pull, too. As it's my first time contributing to Cargo, I'm pretty sure that the code is not the best it can be, so please give me feedback if such a feature is wanted at all and where I can improve the code, if you think optional authorization is a good idea.

PR: jonhoo#1

@jonhoo
Copy link
Contributor Author

jonhoo commented Mar 9, 2022

Now that #10064 has landed, I think this can be picked up again and turned into a more real implementation with fewer hacks!

@arlosi
Copy link
Contributor

arlosi commented Mar 9, 2022

@jonhoo I have an implementation (based on this PR) that I can send out soon. Is that alright? Or would you like to own getting this in?

@jonhoo
Copy link
Contributor Author

jonhoo commented Mar 9, 2022

@arlosi Oh, no, go right ahead if you have something ready!

@arlosi arlosi mentioned this pull request Mar 9, 2022
5 tasks
@jonhoo
Copy link
Contributor Author

jonhoo commented Mar 10, 2022

I'm going to go ahead and close this in favor of #10470.

@jonhoo jonhoo closed this Mar 10, 2022
bors added a commit that referenced this pull request Mar 24, 2022
HTTP registry implementation

Implement HTTP registry support described in [RFC 2789](rust-lang/rfcs#2789).

Adds a new unstable flag `-Z http-registry` which allows cargo to interact with remote registries served over http rather than git. These registries can be identified by urls starting with `sparse+http://` or `sparse+https://`.

When fetching index metadata over http, cargo only downloads the metadata for needed crates, which can save significant time and bandwidth over git.

The format of the http index is identical to a checkout of a git-based index.

This change is based on `@jonhoo's` PR #8890.

cc `@Eh2406`

Remaining items:
- [x] Performance measurements
- [x] Make unstable only
- [x] Investigate unification of download system. Probably best done in separate change.
- [x] Unify registry tests (code duplication in `http_registry.rs`)
- [x] Use existing on-disk cache, rather than adding a new one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.