-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler Performance Tracking Issue #48547
Comments
Expanding on the RLS scenario: this is kind of an extra dimension rather than a separate scenario because we care about initial build time (which is a 'from scratch' build) and about the incremental case (which is the 'small change' build). We do care more about the 'small change' case since that is fundamental to the RLS, but the startup time is really important too. So, the differences between RLS and regular build:
And some areas which I think would be desirable for the RLS:
|
Excellent summary, @nrc! I don't have anything else to add really, but I would like to give some input about relative priorities of features.
I think this should be prioritized, as it's very important for the overall RLS architecture. My understanding is that save analysis at the moment is the bottleneck, which cancels "on-demand" features of rustc.
This suggests to cut the work along compiler phases. A more important thing to do is to cut along the orthogonal dimension, and run all compiler phases, but only for some code. This what "'very lazy' compilation" bullet is about I think. And end goal here is to show all compilation errors for a single (currently opened in the editor) file, and do not comute most of the errors for other fiels! A useful analogy here is a parallel pipeline vs a data parallel task: the first scales up to the number of pipeline steps, which is fixed, the second is emnarassigly parallel. Ultimatelly, queries and on-demand compilation should allow to cut work along both dimentions, if queries are exposed to RLS in some form (i.e, if they don't go through save-analysis bottleneck).
It seems to me that initial build for RLS could do much less work than the usual build: it should be enough to collect the items across the dependency graph, without checking their bodies. |
This is a great point! Though not quite so simple - we do need data about function internals for doing find all refs, etc. inside a function, but we could do that on demand if it were quick enough (which I think if we could just do a single function body at a time, it would be, but that is predicated on very incremental compilation). However, for dependent crates, we never need the function body information. We could make |
Thanks for the detailed feedback, @nrc & @matklad! I'll update the description of the RLS scenario. There's definitely more to it than to I'll open an sub-issue soon that will define the concrete benchmarks for measuring success in each of the scenarios. I suggest we continue discussion about how to measure the RLS case there. For concrete ideas on how to improve performance, I suggest that we open a separate tracking issue for RLS specific performance questions. I'd add this to the "Concrete Performance Improvement Initiatives" section then. |
For "Sharing closures among generic instances" I think it'd be good to generate just one struct per polymorphic function but give the closure type a type parameter iff it needs it. This would mean that closures would benefit from any work done on polymorphic functions. So this: fn do_a_thing<T>(foo: T) -> T {
let my_fn = || foo;
my_fn()
} would desugar to this: fn do_a_thing<T>(foo: T) -> T {
struct Closure<U>(U);
impl<U> FnOnce<()> for Closure<U> {
type Output = U;
fn call_once(self, _: ()) -> U {
self.0
}
}
let my_fn = Closure::<T>(foo);
my_fn()
} i.e. closures would get desugared before monomorphisation. I kinda just assumed this was how it worked anyway before seeing this issue. I don't know whether all that would be needed would be removing redundant type parameters from the generated closure struct, but a remove-redundant-type-parameter pass would lead to even bigger wins than just my change proposed here. |
@Vurich That's how it works already, more or less. And yes, closures and polymorphic functions would benefit the same from "unused type parameters" analyses. |
https://rustc-dev-guide.rust-lang.org/profiling.html has instructions to profile the compiler |
Visiting as part of backlog bonanza. The topic of compiler performance will probably never be "solved", except maybe if the performance improves so much that we stop worrying about it. In any case, I think the information captured in this issue description is super-valuable; it just should probably live somewhere else, more prominent, like the rustc-dev-guide. So, my recommendation is that we port the information into this description into the rustc-dev-guide, and after that is done, we close this tracking issue (with a comment that points to that new page/chapter in the rustc-dev-guide). @rustbot label: S-tracking-impl-incomplete |
If anyone does that, here are some useful updates from 2023:
|
This issue is the landing page for all things compilation speed related. We define the most important usage scenarios, as seen by the @rust-lang/compiler team and the @rust-lang/wg-compiler-performance working group. Benchmarks based on these scenarios are then used as a metric for how well the compiler is doing, compile-time-wise. We then establish a model of the compilation process and use it to estimate the effect of optimization categories on the various usage scenarios. Finally, we list concrete optimization ideas and initiatives, relate them to categories and usage scenarios, and link to their various specific GH issues.
Usage Scenarios
We track compiler performance in a number of use cases in order to gauge how compile times in the most important and common scenarios develop. We continuously measure how the compiler performs on perf.rust-lang.org. #48750 defines which concrete benchmarks go into measuring each usage scenario. In the following, "project" means a medium to large codebase consisting of dozens of crates, most of which come from crates.io.
FROM-SCRATCH - Compiling a project from scratch (P-high)
This scenario occurs when a project is compiled for the first time, during CI builds, and when compiling after running
cargo clean
,make clean
or something similar. The targeted runtime performance for builds like this is usually "fast enough", that is, the compiled program should execute with performance that does not hinder testing but it is not expected to have absolute peak performance as you would expect from a release build.SMALL-CHANGE - Re-Compiling a project after a small change (P-high)
This scenario is the most common during the regular edit-compile-debug cycle. Low compile times are the top-most priority in this scenario. The compiled runtime performance, again, should be good enough not to be an obstacle.
RLS - Continuously re-compiling a project for the Rust Language Server (P-high)
This scenario covers how the Rust compiler is used by the RLS. Again, low compile times are the top-most priority here. The only difference to the previous scenario is that no executable program needs to be generated at all. The output here is diagnostics and the RLS specific
save-analysis
data.DIST - Compiling a project for maximum runtime performance (P-medium)
This scenario covers the case of generating release artifacts meant for being distributed and for reflecting runtime performance as perceived by end users. Here, compile times are of secondary importance -- they should be low if possible (especially for running benchmarks) but if there is a trade-off between compile time and runtime performance then runtime performance wins.
Sometimes we also see people "measuring" Rust's compile times by compiling a "Hello World" program and comparing how long that takes in other languages. While we do have such a benchmark in our suite, we don't consider it one of the important usage scenarios.
A Model of the Compilation Process
The compiler does lots of different things while it is compiling a program. Making any one of those things faster might improve the compile time in one of the scenarios above or it might not. This section will establish a few categories of tasks that the compiler executes and then will map each category to the scenarios that it impacts.
Compilation Phases / Task Categories
The compiler works in four large phases these days:
Pre-Query Analysis (pre-query) -- This roughly includes parsing, macro expansion, name resolution, and lowering to HIR. The tasks in this phase still follow the bulk processing paradigm and the results it produces cannot be cached for incremental compilation. This phase is executed on a single thread.
Query-based Analysis (query) -- This includes type checking and inference, lowering to MIR, borrow checking, MIR optimization, and translation to LLVM IR. The various sub-tasks in this phase are implemented as so-called queries, which are computations that form a DAG and the results of which can be tracked and cached for incremental compilation. Queries are also only executed if their result is requested, so in theory this system would allow for partially compiling code. This phase too is executed on a single thread.
LLVM-based optimization and code generation (llvm) -- Once the compiler has generated the LLVM IR representation of a given crate, it lets LLVM optimize and then translate it to a number of object files. For optimized builds this usually takes up 65-80% of the overall compilation time. This phase can be run on multiple threads in parallel, depending on compiler settings. Incremental compilation allows to skip LLVM work but is less effective than for queries because caching is more coarse-grained.
Linking (link) -- Finally, after LLVM has translated the program into object files, the output is linked into the final binary. This is done by an external linker which
rustc
takes care of invoking.Note that this describes the compilation process for a single crate. However, in the scenarios given above, we always deal with a whole graph of crates. Cargo will coordinate the build process for a graph of crates, only compiling crates the code of which has changed. For the overall compile time of a whole project, it is important to note that Cargo will compile multiple crates in parallel, but can only start compiling a crate once all its dependencies have been compiled. The crates in a project form a directed acyclic graph.
Using Task Categories To Estimate Optimization Impact On Scenarios
From the description above we can infer which optimizations will affect which scenarios:
Concrete Performance Improvement Initiatives
There are various concrete initiatives of various sizes that strive to improve the compiler's performance. Some of them are far along, some of them are just ideas that need to be validated before pursuing them further.
Incremental compilation
Work on supporting incremental re-compilation of programs has been ongoing for quite a while now and it is available on stable Rust. However, there are still many opportunities for improving it.
Query parallelization
Currently to compiler can evaluate queries (which comprise a large part of the non-LLVM compilation phases) in a single-threaded fashion. However, since queries have a clear evaluation model which structures computations into a directed acyclic graph, it seems feasible to implement parallel evaluation for queries at the framework level. @Zoxc even has done a proof-of-concept implementation. This would potentially help with usage scenarios since all of them have to execute the query-part of compilation.
MIR-only RLIBs
"MIR-only RLIBs" is what we call the idea of not generating any LLVM IR or machine code for RLIBs. Instead, all of this would be deferred to when executables, dynamic libraries, or static C libraries are generated. This potentially reduces the overall workload for compiling a whole crate graph and has some non-performance related benefits too. However, it might be detrimental in some other usage scenarios, especially if incremental compilation is not enabled.
ThinLTO
ThinLTO is an LLVM mode that allows to perform whole program optimization in a mostly parallel fashion. It is currently supported by the Rust compiler and even enabled by default in some cases. It tries to reduce compile times by distributing the LLVM workload to more CPU cores. At the same time the overall workload increases, so it can also have detrimental effects.
Sharing generic code between crates
Currently, the compiler will duplicate the machine code for generic functions within each crate that uses a specific monomorphization of a given function. If there is a lot of overlap this potentially means lots of redundant work. It should be investigated how much work and compile time could be saved by re-using monomorphizations across crate boundaries.
Sharing closures among generic instances
We duplicate code for closures within generic functions even if they do not depend on the generic parameters of the enclosing function. This leads to redundant work. We should try to be smarter about it.
Perform inlining at the MIR-level
Performing at least some amount of inlining at the MIR-level would potentially reduce the pressure on LLVM. It would also reduce the overall amount of work to be done because inlining would only have to be done once for all monomorphizations of a function while LLVM has to redo the work for each instance. There is an experimental implementation of MIR-inlining but it is not production-ready yet.
Provide tools and instructions for profiling the compiler
Profiling is an indispensable part of performance optimization. We should make it as easy as possible to profile the compiler and get an idea what it is spending its time on. That includes guides on how to use external profiling tools and improving the compiler internal profiling facilities.
Build released compiler binaries as optimized as possible
There is still headroom for turning on more optimizations for building the
rustc
release artifacts. Right now this is blocked by a mix of CI restrictions and possibly outdated restrictions for when build Rust dylibs.Feel free to leave comments below if there's anything you think is pertinent to this topic!
The text was updated successfully, but these errors were encountered: