-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network based dependency resolution #5597
Comments
I don't think we'll ever get as many packages as NPM. Systems programming is a small field compared to the giant field that npm targets (basically everything). I also think that with a smart (custom) binary format, we can stay in the range of megabytes for tens of millions of packages/package versions. These days when I visit one website like gitter or crates.io I'm downloading as much data already. As a further optimizarion suggestion: Right now every |
There's |
@matklad I think I've already sent one: rust-dev-tools/rls-analysis#132 |
Oh right. I was looking at Anyway, the problem exists with all crates that have test data, not just rls-analysis. My point was only using rls-analysis as an example :p. |
This sounds like an underestimation. It's true that we're different from Node, but Rust is just as flexible as Node, and it will be continuously improved. For example, once the futures ecosystem improves, I'd expect more web related crates.
I experimented with a custom SQLite based format, and the result was that it didn't improve much over the squashed Git/JSON based index. Plus, you need to handle delta transfer somehow. Also tens of millions of entries are definitely unlikely to be compressed to an acceptable file size. tldr, it doesn't scale.
So far, I don't know any usecases apart from crater, thus changing the semantics of a published crate is questionable. |
I've chatted recently with @kornelski, and he claims that clones from GitHub are actually quite slow, compared to just hosting repository yourself. So an interesting middle-ground, before embarking on the implementation of a custom binary protocol and supporting server, is to try to host a mirror of the index outside of GitHub, and see if that would be noticeably faster. |
Part of it is that GitHub discourages use of shallow clones (e.g. brew had to move from shallow to full clones). A shallow clone would go a long way to solving the pain of initial download. Another option that I think would be interesting is to use a non-git protocol for updates, e.g. http://zsync.moria.org.uk/ |
I work on an open source project named Pulp, which is a repository management solution similar to Artifactory. The git repository based index presents a lot of problems for us at a fundamental level. The general idea behind repository management utilities is that they give an admin the ability to do things like:
Cargo provides similar functionality in many cases (e.g. "yank"), which makes a repository management solution a little less useful than with, say, Debian or RPM packages, but there is still a potential value-add. The problem is that by making the repository index a git repository, the general workflow of many of these items becomes difficult or impossible. For instance:
I understand that not every use case listed here is necessarily something you would want to do in a language where the tooling already provides a lot of this functionality, but as I mentioned, there's still a value add, and very very few other types of content are restricted in this same way. I don't see any fundamental reasons why it's important for the index to be a git repository. With that being the case, I'm not really sure how to write a Cargo plugin for Pulp which would not be essentially restricted to mirroring the crates.io API and providing almost no additional functionality, and not integrating with any of the core features. The idea has "here be dragons" written all over it, IMO. edit: I have no specific opinion on the proposal, just echoing an objection to the current git-based solution. |
I don't have major meaningful design input except to say that this has wider implications than just making dependency resolution better in some ways. Due to the thread-bare nature of rust's stdlib, being able to resolve and pull down crates is essential for any would be build tool of Rust. Burying dependency resolution into a backend would add more hoops than already currently exist in front of folks who want to build Rust with something that isn't Cargo. I don't expect this to be a major consideration in either direction -- I just want to put the idea into everyone's minds that some poor fools (like me) might be trying to do crate resolution in weird ways and having that be possible 100% locally is very nice for our usecases. EDIT/Amended: Though, conversely, this might open a door to decoupling some implementations from the cargo binary, which would be interesting. |
While I quite like the idea of moving to something that doesn't require you to down load the entire index in general, it's worth pointing out that it'll make things like docs.rs much harder. docs.rs currently just keeps track of the latest index commit it built, which makes it super easy to keep track of what has or hasn't been built. /cc @onur @QuietMisdreavus |
@dralley technically cargo does a hard checkout of master. The registry need not return the same history every time. It is allowed to diverge from what the clients are expecting. At the coast of losing the efficient diff transfer provided by git. Alternatively, make the history consistent and linear but completely replace the contents of the registry. @jonhoo Yes this would have to go with a crates.io endpoint for monitoring changes. Another design constraint is that most of the time when generating a lockfile we are not just going to look up one package, we are going to look up hundreds to thousands of packages, and that we don't know what we will need to query next. For example: we have to fulfill
Hard details that could make this faster.
|
@Eh2406 Yeah, in the time since that post I did some experimentation and found that while a normal "git pull" took ~30 seconds, at "git pull --depth 1" only took ~2 seconds, so it's not as bad as I was initially thinking. |
@Eh2406 The So, moving on to providing a good user experience, what we need is basically to reduce latency so the queries can run nearly as fast as a local index.
And finally, network latency is still higher than disk in most cases, so as you said we should make Cargo async ready. |
I travel quite a bit and routinely find myself on connections where latency
can easily be 150+ ms. A moderate size dependency graph will quickly rack
up many seconds of latency. There's also a question of how to cache the
results for offline builds and then invalidate those results once back
online.
Putting the registry config inside the index is also going to make moving
away from git painful. How will a client determine that a registry supports
a different mechanism without fetching the v1 index over git?
…On Mon, Oct 22, 2018, 9:23 PM Tatsuyuki Ishi ***@***.***> wrote:
@Eh2406 <https://github.com/Eh2406> The O(n) network round trips is not
much an issue; the n here is normally under a reasonable amount, where we
don't have to use batching. We already make as many requests for
downloading packages. Also, yeah we don't want to do dependency resolution
on server side because it's simply too much computation even if it wasn't a
worst case.
So, moving on to providing a good user experience, what we need is
basically to reduce latency so the queries can run nearly as fast as a
local index.
- Use a CDN to cache the requests for popular packages. NPM recently
switched to CloudFlare (I don't know what they used before).
- If needed, geo-replicate all index data. This can be either done by
replicating DB (NoSQL are better on this), or fully offload to cross region
replicated S3 buckets. This reduces worse case latency.
And finally, network latency is still higher than disk in most cases, so
as you said we should make Cargo async ready.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5597 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJBDl1w9P0aSqy0VWTCkFxijoL36KP_6ks5unpmugaJpZM4UVHvL>
.
|
@kc8apf Let's go through them one by one.
That indeed sounds like a problem. Though, I'm not sure what's the "acceptable" line here - given the latency git operations are going to be slow as well, so we should test out to see if we need additional optimizations on this.
I don't think that's necessary - once Cargo.lock is generated we don't need access to index anymore until the dependency tree is modified again.
This is unfortunate indeed since the custom registry RFC has been accepted. Though, moving formats doesn't need to be that hard - I think we can simply add a config property to specify the registry protocol version. |
Perhaps the fix could be to keep resolution driven by the client, but change the way index is downloaded and updated to allow having a partial index on the client. Roughly, change update model from git-like to svn-like. Instead of downloading the whole thing ahead of time, download and cache bits of the index as needed, but still operate on the local copy. I think that would scale even to NPM's size, since the partial index on the client wouldn't have to be much larger than the working set of crates people actually use. Partial fetch of the index as-needed in the worst case would be similar to To make things faster:
|
We often need to query exponentially more crates then we download, ( That is part of the definition of Making the resolver async ready, will require a pretty major rewrite. I will keep in mind this new design constraint in my dreams of a new resolver. So far all such dreams are early conversations of possible vaporware. To be fair I thout the same thing about public & private dependencies and 8 months of thinking about how to do it later it is almost done with out needing a manger design. |
@kc8apf <https://github.com/kc8apf> Let's go through them one by one.
I travel quite a bit and routinely find myself on connections where
latency can easily be 150+ ms. A moderate size dependency graph will
quickly rack up many seconds of latency.
That indeed sounds like a problem. Though, I'm not sure what's the
"acceptable" line here - given the latency git operations are going to be
slow as well, so we should test out to see if we need additional
optimizations on this.
Git's smart HTTP protocol is two round-trips for a pull. The second
round-trip is a large pack containing whatever objects the client needs.
After that, cargo is only accessing the local copy.
How will a client determine that a registry supports a different mechanism
without fetching the v1 index over git?
This is unfortunate indeed since the custom registry RFC has been
accepted. Though, moving formats doesn't need to be that hard - I think we
can simply add a config property to specify the registry protocol version.
Moving the registry config to a separate HTTP request outside the index
would solve a few problems I've run into when writting a Cargo registry
plugin for Sonatype Nexus. Specifically, I dynamically generate the
registry config already because the download and API URLs are determined
from the the client's request. Right now, I construct a git commit on every
index request that attempts to update the URLs.
|
This index downloading problem is starting to feel very similar to HTTP
caching. If cargo used pipelined HTTP requests and cache control headers,
would an update of individual crate metadata files be any worse than
visiting a content-heavy website?
…On Tue, Oct 23, 2018, 7:14 AM Kornel ***@***.***> wrote:
Perhaps the fix could be to keep resolution driven by the client, but
change the way index is downloaded and updated to allow having a *partial
index* on the client.
Roughly, change update model from git-like to svn-like. Instead of
downloading the whole thing ahead of time, download and cache bits of the
index as needed, but still operate on the local copy.
I think that would scale even to NPM's scale, since the partial index on
the client wouldn't have to be much larger than the working set of crates
people actually use.
Partial fetch of the index as-needed in the worst case would be similar to
O(n) network trips resolution method mentioned before, but with local
caching, this would become fast and even work offline after enough of the
index is cached.
To make things faster:
- Cargo could come preloaded with a mini partial version of the index
that contains top N most popular crates.
- The update protocol could allow server to push to the client's cache
more than asked for, e.g. client: "I want re/ge, because I'm looking
up regex"; server: "here's re/ge and ah/o-". This being entirely
opportunistic and client-driven doesn't require server to do NP-Hard
computation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5597 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJBDlyFxyW4OBmRoyKvl3tMOZnorfo-Iks5unyQsgaJpZM4UVHvL>
.
|
I'm assuming that when you get a crate info, you get the same file as currently in the index, which contains dependency information for all versions of the crate. So if you had to make 1 request per 1 version of a crate, that would be lots of requests. But if one request returns info about all version of a crate, I think that's not too bad. While different versions may have different dependencies, it's exponentially bad only in theory. In practice people don't replace all their dependencies on every release with dependencies that replace all their dependencies on every release, etc., so the worst case will be you'll uncover and cache somewhat larger superset of dependencies. |
That's exactly what NPM do. @Eh2406 Resolving dependencies is NP-hard (if there are two dependency requesting conflicting dependencies it may be unable to reason about it in acceptable time), but the amount of index data is not.
@Eh2406 Please do not call it batching because it's confusing. It's just concurrency, we still make the same amount of requests. Batching is more like
@kornelski I'm not sure how it can be achieved. We need consistency to some degree on the index so for example when
This makes sense. I think what you propose is similar to HTTP/2 push, so there's a chance we can implement that optimization on the top of HTTP/2 spec. |
This is indeed tricky. Couple of solutions come to my mind. Revision map solutionLet's assume we have an index The client could be given a mapping between crates and the latest revision in which they were modified. Because it's safe for the mapping to contain a revision that's too high (it'll just cause unnecessary check), it can be compressed in a lossy way, e.g. only send max revision per subdirectory, or max revision based on a short hash of crate name. The mapping itself can be updated incrementally, sending revisions of crates that have been recently published. Merkele hash treeSimilar to mapping between crates and revisions, but it'd use content hashes instead of revisions. The downside is that hashes won't compress in a lossy way, so the index would grow. The upside is that the hashes can be built into a tree, similar to git's tree and commit hashes, so they could be used to verify not only whether crates are up to date, but also verify their integrity. Update-everything-since solutionThe client would check that the oldest crate it has is from revision 3, and ask the server to send metadata of crates that changed between rev. 3 and whatever is latest. The response would contain irrelevant crates (just as git update does today), but the diffs are still small compared to the whole index. The client can choose to ignore or delete irrelevant crates. (I'd go with that solution, since it's easiest to implement) Reverse revision informationEach crate's file could contain a revision number that is the latest revision of any of its dependencies. So if you install The implementation of it would depend on the server updating dependents of a crate every time a crate is published. Doing this is feasible, e.g. on crates.rs I resolve all dependencies for all crates, and build reverse index, so I can tell |
Closing in favor of #9069 |
This is a proposal to move to a API call based dependency resolution, rather than the Git based index clone. This is similar to what NPM uses.
Mainly, git based index clone has scalability issues:
Then, here are some random notes:
crates.io-index
repository, we can have daily database dumps, and offer an API and/or UI to display the latest updated crates.The text was updated successfully, but these errors were encountered: