-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per-user compiled artifact cache #5931
Comments
There's been musings about this historically but never any degree of serious consideration. I've always wanted to explore it though! (I think it's definitely plausible) |
sccache is one option here - it has a local disk cache in addition to the more exotic options to store compiled artifacts in the cloud. |
sccache would be good for the compilation time part, but it'd be nice to also get a handle on the disk size part of it. |
cc #6229 |
I think you can put
in But I have no idea if this mode is officially supported. Is it? |
Yes it is, as is setting it with the corresponding environment variable. However the problems with cargo never deleting the unused artifacts gets to be dramatic quickly. Hence the connection to #6229 |
@joshtriplett and I had a brainstorming session on this at RustNL last week. It'd be great if cargo could have a very small subset of sccache's logic: per-user caching of intermediate build artifacts. By building this into cargo, we can tie it into all that cargo knows and cane make extensions to better support it. Risks
To mitigate problems with cache poisoning
As a contingency for if the cache is poisoned, we need a way to clear the cache (see also #3289) To mitigate running out of disk space, we need a GC / prune (see also #6509)
Locking strategy to mitigate race conditions / locking performance
Transition plan (modeled off of sparse registry)
|
Wonder if something like reflink would be useful |
See also #7150 |
This was broken in rust-lang#12268 when we stopped using an intermediate `Cargo.toml` file. Unlike pre-rust-lang#12268, - We are hashing the path, rather than the content, with the assumption that people change content more frequently than the path - We are using a simpler hash than `blake3` in the hopes that we can get away with it Unlike the Pre-RFC demo - We are not forcing a single target dir for all scripts in the hopes that we get rust-lang#5931
fix(embeded): Don't pollute the scripts dir with `target/` ### What does this PR try to resolve? This PR is part of #12207. This specific behavior was broken in #12268 when we stopped using an intermediate `Cargo.toml` file. Unlike pre-#12268, - We are hashing the path, rather than the content, with the assumption that people change content more frequently than the path - We are using a simpler hash than `blake3` in the hopes that we can get away with it Unlike the Pre-RFC demo - We are not forcing a single target dir for all scripts in the hopes that we get #5931 ### How should we test and review this PR? A new test was added specifically to show the target dir behavior, rather than overloading an existing test or making all tests sensitive to changes in this behavior. ### Additional information In the future, we might want to resolve symlinks before we get to this point
CARGO_TARGET_DIR
to make it the parent of all target directories
rust-lang/rfcs#3371
Some complications that came up when discussing this this with ehuss. First, some background. We track rebuilds in two ways. The first is we have an external Problems
I guess the first question is whether the per-user cache should be organized around |
Cargo uses relative paths to workspace root for path dependencies to generate stable hashes. This causes an issue (#12516) when sharing target directories between package with the same name and version and relative path to workspace. |
For me, the biggest thing that needs to be figured out before any other progress is worth it is how to get a reasonable amount of value out of this cache. Take my system
This is a "bottom of the stack" package. As you go up the stack, the impact of version combinations grows dramatically. I worry a per-user cache's value will only be slightly more than making |
How did you do that analysis? I'd be interested in running it on my own system. Also. re caching and |
Pre-req: I keep all repos in a single folder. $ ls */Cargo.lock | wc -l
$ rg 'name = "syn"' */Cargo.lock -l | wc -l
$ rg 'name = "syn"' */Cargo.lock -A 1 | rg version | rg -o '".*"' | sort -u | wc -l (and yes, there are likely |
Thanks! I keep my projects in two dirs (approximated active and inactive projects), but running this separately on both I get the following. Also included
|
Also syn is part of a set of crates that gets a lot of little bumps. This is common for dtolnay crates, but not so much for a whole host of other crates -- so I'm not sure this particular test is very representative. (Note that I'm definitely not disagreeing that the utility of a per-user compiled artifact cache might not be as great as hoped.) FWIW, given what I see Rust-Analyzer doing in a large workspace at work (some 670 crates are involved) it seems to be doing a lot of recompilation even with only weekly updates to the dependencies so even within a single workspace there might be some wins? |
Somebody with a deduplicating file system could share their statistics for a theoretical upper bound? |
Yes, we could have locks on a per-cached item basis and read directly from it. Whether we do depends on how much we trust the end-to-end process. |
Hi folks, chiming in here to merge two streams: some of us at Microsoft did a hackathon project to prototype a per-user Cargo cache, late last September. Here's the Zulip chat: https://rust-lang.zulipchat.com/#narrow/stream/246057-t-cargo/topic/Per-user.20build.20caches Our initial testing of this generally showed surprisingly small speedup, even when things were entirely cached. It seems that rustc is just really damn fast at crate compilation :-O And that (as we know) the long pole is always the final LLVM binary build. This change took the approach of creating a user-shared cache using the I do think this change's approach of having a very narrow cache interface is a good design direction that was reasonably proven by this experiment. Happy to discuss our approach on any level, hope it is useful to people wanting to move this further forwards. |
But the problem being solved here is not only the speed, but the amount of storage space being consumed. When you have dozens of projects each compiling the same crates and each easily taking up 2GB, your drive starts filling up very quickly! |
Exactly. I've now got a Ryzen 5 7600. Combined with mold, |
The write cycles issue is very pertinent! |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Came across this, reading the design document , and decided to share my thoughts here. Hopefully this will be good input.
This makes several assumption about CI behavior, probably based on how GH Actions behaves. Note that other runners will behave differently, for example by default GitLab does not upload the cache anywhere. I'm not even sure if the cache is compressed. This shouldn't be an issue, but it is a set of bad assumptions in the design doc and could theoretically could lead to bad design. I would also like to add that CI interactions should have supporting multiple CI runners from the get go. GitHub is, for better or worse, dominant, especially in mindshare, and the Rust project shouldn't be furthering it's monopoly. Hashes and fingerprintingStale caches are a major pain, to the point that I have learned to recognize the signs working with at least two build systems. Currently, Cargo does not even use the whole hash, only 64 bits (sixteen digits) of it. I'm worried that with user-wide caches, collisions may happen. To that end, I'd prefer there was a plaintext copy of all the data going into the cache, to be verified that it is indeed the correct cache. Or at least use the full length of the hash. Garbage collectionI've seen someone mention clearing old stuff by access time. This would work, in theory, but is also something to be careful around. For example, a lot of desktop systems use Personally, I would love to see something more advanced, with data tracking.
Setting the cache directoryIt is a feature that would greatly improve flexibility, while being fairly simple. I know people who would put the build cache in a ramdisk. I can envision a situation where the cache is put on a network share, to provide rudimentary sharing between people. Just allowing the user to configure the cache directory would make it much easier to set up. The default should be under |
This is mentioning a use case. Nothing in the design is "github specific"
We are doing our own access time tracking in an sqlite database, see https://doc.rust-lang.org/nightly/cargo/reference/unstable.html#gc |
Looks like with 64 bits the probability of getting a collision starts becoming more of a possibility when you start getting close to a billion entries From: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table |
I was wondering if anyone has contemplated somehow sharing compiled crates. If I have a number of projects on disk that often have similar dependencies, I'm spending a lot of time recompiling the same packages. (Even correcting for features, compiler flags and compilation profiles.) Would it make sense to store symlinks in
~/.cargo
or equivalent pointing to compiled artefacts?The text was updated successfully, but these errors were encountered: