-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload index metadata to index/
when publishing new crates
#4661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @arlosi!
There are a few operational issues that we'll need to address before this can be merged.
- We need to invalidate the CloudFront cache for a file after modifying it on S3. Even with a short cache-control interval files will enter the cache at different times and clients could end up seeing an inconsistent index state (such as a crate with a dependency that does not yet appear to be published). We could then probably increase the max-age allowed to be cached by CF.
- These files should probably be stored in their own S3 bucket. We should also serve them from something like
index.crates.io
instead ofstatic.crates.io
. I'll discuss with infra at our next meeting. - We need a way to populate the full repository, as these changes will only upload an index file upon a publish. A command could be added to
src/admin
that publishes all of the crate files. There will probably be some complexity around getting this right, since if there is a gap between this task and enabling it in production then updates will be missed, but if there is overlap then updates could be overwritten with older data. - Similarly, we should add logic to the
delete_crate
admin task to delete the index file from S3 and invalidate the CF cache. This happens pretty rarely and we don't automatically update the git index yet, but it would be helpful update the HTTP index automatically so that we don't forget.
On behalf of the cargo team, I would love to be part of meetings about designing this. I'm also not sure about when it's best to hash things out in this PR, or on the zulip conversation, or in a meeting. Having said that I will try and respond to each point as shortly as I can. Happy to talk in depth about any of them when we have picked the right venue.
One overarching thought, the cargo side of this is not ready for stabilization. At this point it's premature for crates.io to be prepared for a fully operationalized HTTP index. As long as we have a plan in place, and each step gets us closer to it, we can take it one step at a time. |
That sounds like a great solution!
Thanks for the additional context. I agree that at this point we just need to agree on a plan for long term operations, and can begin iterating towards that.
Okay, it looks like "[o]bject invalidations typically take from 60 to 300 seconds to complete." The PR currently sets the max-age to 600 seconds. For projects that release multiple crates they expect each crate to be published within seconds (to publish crates which depend on it), not minutes, so I wonder if we would need to set a very short max-age and only get the benefits of caching for very popular crates. I'd love to have some estimates of expected traffic and the associated costs of various options to base these design decisions on.
My main concern here is also related to new publishes (which will probably be solved by however we solve the above). If I publish new major versions of a batch of crates that depend on each other, then currently there would be a period of up to 10 minutes where some users may have a broken build if they attempt to upgrade. In contrast, the whole git index is updated atomically and we can ensure that for every dependency in the index cargo will find at least one crate version that satisfies it. (Currently I think the only exceptions are if a crate is deleted, or if versions are yanked without a compatible replacement.) |
Well it has a "atomic copy if not exist" command.
That is sort of in the awkward middle. If we want to ensure updates take less than say 30 seconds, then a short max-age is the only way to do that. If we are comfortable with updates occasionally taking over 300 seconds, then an infinitely long max-age and invalidations are the best way to do that. I'm not sure what's best, but my understanding is that we can change this pretty easily at anytime.
At the moment the traffic should be approximately 0, has the feature is not even available on nightly. As we stabilize it it should eventually grow too match current traffic for cloning the index. I don't know if the numbers I got from @pietroalbini can be shared publicly. I will start a DM on zulip if that works for you.
Yes, all the oddities occur when trying to build a package within one max-age of a publish. Even with gits nice atomic properties, even if a version does match every requirement that does not mean that every requirement can be satisfied. Every version that matches could conflict with some other requirement, or the lockfile. |
Created buckets and CDNs both for staging and production, the existing credentials have access to them:
|
20cf719
to
92698e3
Compare
The |
You are not missing anything. To quote:
|
Oops, I misread the request. The |
Is someone available to review this? Let me know if there are changes needed. |
☔ The latest upstream changes (presumably 531b5c8) made this pull request unmergeable. Please resolve the merge conflicts. |
Also provides a new admin tool to bulk upload existing index files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving and merging this with the understanding that this is an experimental feature without any stability guarantees for now.
As @jtgeibel pointed out in #4661 (review), there are still a few issues to solve for this implementation, but we agree that we can solve these in an incremental way while the feature is still considered experimental.
Cargo can access http-based registries via rust-lang/cargo#10470.
This change causes crates.io to publish any changed metadata files to
index/
on S3 in addition to the git-based index. The S3 bucket is configured by new environment variablesS3_INDEX_*
.A new admin tool for bulk-uploading crates is also added.