-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not inlining array/slice access and iteration at opt-level=z causes knock-on effects on size #123345
Comments
@cbiffle While I would first check to see if anyone else has tried something like this in another PR first... yes, we are generally open to people opening such PRs! The main question is whether they do something like "regress compiler performance" or such, so they generally get thrown against rustc-perf first to examine that. It's also not-atypical for In any case, ideally they come with a codegen test (see tests/codegen and codegen-tests in the rustc-dev-guide) that verifies that the emitted IR has the properties you want for the code you wanted optimized. That way whatever is implemented for the optimization can get removed later hypothetically if LLVM ever learns to Just Do The Right Thing. |
@cbiffle Also, a bit obvious but: I strongly recommend investigating the latest behaviors on 1.77 and nightly! There has been a lot of effort, much of it very recent, to make rustc automatically inline more code by doing so on the MIR level, before LLVM has a chance to make decisions (and thus, possibly avoid making good ones, although this sometimes proves too over-eager). And it'd be great to land codegen or assembly tests even if this behavior is fixed for you. |
I've also encountered similar, and it still persists on nightly. It seems worse on opt-level=z than opt-level=s, but in my case the overall code size is smaller with |
I think the only solution to problems like this is
This is decently well-known, and is the reason I and some others have been surprised at the continuing popularity of Anyone working on this should be aware of the unfortunate interaction with ISPCCP and codegen tests, I'm quite sure that our current and only |
@workingjubilee - I can confirm @kevinmehall's observation that the behavior mostly continues on nightly. The (This has the exciting side effect that appeasing clippy, by using enumerate instead of iterating over a range and indexing, actually makes the code bigger and performance worse!) re: opt-level=3 being smaller sometimes:
Well, the "sometimes" here is the problem. We currently have images that will fit in a system with Side note: what I really want is the ability to provide an "inlining hints" file for a given application that applies the equivalent of |
Thanks for the reference on ISPCCP, I was unfamiliar with it, and it explains several cases where my 15-year-old mental model of LLVM's behavior wasn't matching what I was seeing. |
@cbiffle Well, you can still pass llvm-args to the backend. One thing that might be interesting to do, that a few people suggested long ago but that hasn't really been brought up since, is to pass special instructions to the pass manager to cope with the fact that Rust generates a lot of IR yet also benefits a lot from inlining, potentially much more on both counts than, say, clang. Given that LLVM recently-ish changed their pass manager interface a lot (and by recently I mean multiple versions ago For Good), it might benefit significantly from someone taking a whack at that, especially for |
@cbiffle One thing that's really missing here is that we have no tests at all for the code size of realistic codebases produced by |
Could cranelift be more easily tweaked for code size? In my naive understanding the egraph concept seems more amenable to the iterative inline, optimize, check size, repeat ... pick the best result approach that would be difficult for LLVM.
Does PGO help? It may not be designed to help with size, but for iterators it might help coincidentally since they also tend to be hot. |
@the8472 Currently, Cranelift is somewhat moot in this case: it does not support 32-bit Arm. |
@the8472 - unfortunately I have no idea how to apply PGO to a bare-metal embedded platform. |
IIRC Cranelift doesn't do inlining (only MIR inlining) and considers most interprocedural optimizations out of scope. But yes, it seems like egraphs could be much better at this. Is PGO feasible on Cortex-M? Instrumented build might be too much overhead to even fit on the device, and would need a way to get the profile back to the host. Perhaps something like ARM ETM could capture an execution trace and post-process it into a profile, but it doesn't seem like anyone has implemented such a thing. |
Cranelift doesn't support optimizing away loads and stores beyond very limited cases. And doesn't support inlining and other inter-procedural optimizations. As such it is almost certainly going to result in larger binaries than LLVM. |
For Hubris the OS image consists of a bunch of independent executables, right? Would it be possible to share some functions using dylib like functionality? You could put the "dylib" at a fixed address to avoid all complications of a dynamic linker if you make sure that the "dylib" doesn't reference the main executable. You only need to ensure that the "dylib" is mapped at the expected address by the kernel. Implementing this in a way that doesn't require a (relatively) huge dylib covering the entirety of libcore and liballoc including all unused functions may be non-trivial though. Maybe you can compile the entire system once, keep a record of all actually used symbols from the dylib and then recompile the dylib with a version script making only those symbols public + |
Bit off topic, but -- we looked into that early on, and the rustc/LLVM support for RWPI (read-write position independence), so that we could share the read-only text and not any read-write data segments, wasn't there yet. (Dylibs on Big Computers tend to assume things like a fixed offset from text to data, or that all data is next to each other, both of which are problematic for our use case.) I haven't seen signs that RWPI is stable so far. We're comfortable with our current approach for now. |
Libcore and liballoc only have read-only data afaik, so sharing all data too for the dylib shouldn't be a problem, right? You could add more crates to the dylib if you make sure they don't have any mutable data either. |
Alright, after some further analysis on programs here, I think as of current nightly, my main foe is
|
In case anyone wonders about that and tries it themselves: Yes, you can write your own It turns out that, annoyingly, It's off the topic of my original post but I'm now looking at |
Maybe those paths chould be outlined and put into In the meantime using build-std + panic_immediate_abort might help. |
That's hilarious. Anyways thanks for noting all this, @cbiffle, one of the things we've noticed is that we apparently don't enable a number of MIR opts with |
Summary
Hi! We've been seeing some surprising codegen decisions in Hubris programs for a while, and I wanted to surface it upstream in case there's anything to be done.
We build primarily at
opt-level="z"
, though many of these behaviors seem to happen atopt-level="s"
too. (z
code is very slightly smaller.)There are a handful of routines in
core
that rustc often decides not to inline which have significant knock-on effects on LLVM's ability to reason about panics, integer ranges for bounds checks, and the like. The problem primarily arises with:<Enumerate<core::slice::iter::Iter<T>> as Iterator>>::next
- not inlining this appears to hamper the ability of other parts of the compiler to notice that the enumerated indices it's producing are in a predictable range<Index<usize> for [T]>::index
or<Index<Range<usize>> for [T]>::index
- the non-inlined version of this does an unconditional bounds check, when in many cases there is information available at the callsite to eliminate it were it inlined<core::slice::Iter<T> as Iterator>::next
All of these routines, when inlined, tend to produce either one or zero machine instructions -- one in the case of
index
which is typically an indexed addressing mode load, zero in the case ofEnumerate::next
where the information is often already available in the calling frame for other reasons, and generally one in the case ofslice::Iter::next
.A concrete example from 1.75.0-ish
Here is a concrete example from one of our current programs.
The originating Rust code is (simplified):
Here's the ARMv7E-M code as produced by objdump from the output of
nightly-2023-12-21
(chosen to be approximately equivalent to 1.75.0).r6
here holds the length being produced by the inlined operation; along all code paths the length it produces is either 0, 1, or 8, all literal constants. I've included the simplest one here, which sets it to literal 1.Given that we are indexing a fixed-length 8-element array normally rustc would be very good about eliding the bounds check, since 0, 1, and 8 are all valid options here.
Instead we have the argument setup and a call to the non-inlined index function, which contains a bounds check and conditional panic:
Impact
In many of our programs, this has a pretty significant impact. In at least one case,
opt-level=3
winds up being smaller because of more aggressive inlining causing checks to be elided. Because panic sites are relatively expensive -- we format and record panic messages, so they pull in formatting machinery that might not otherwise be needed -- in programs that otherwise contain no panics, a shift in the inlining decisions around something likeindex
can cause a program's size to grow by 4 kiB pretty easily.For context, that's a 50-100% size increase on most of our programs.
A possibly clumsy solution
So, for the specific cases I described above -- indexing arrays with usizes or ranges of usizes, and
core::slice::Iter
-- does it ever make sense to not inline these routines? I'm wondering if folks would be open to me going through and carefully applying someinline(always)
attributes in this neighborhood. (This might not be sufficient for theEnumerate<T>
case but could definitely be applied to the array/slice Index impls.)If so, let me know what tests you'd like me to perform on the result. I would obviously measure the impact on various Hubris-based firmware, but perhaps there are other contexts where this could be a bad thing that you'd want me to check.
My LLVM experience is limited and rusty, so there might be a cleverer thing that could be done in the compiler, but I wouldn't know what to suggest.
rustc version
The examples above were generated by:
but I've been observing this behavior consistently since at least 2021, and just wasn't sure if an upstream bug report made sense.
The text was updated successfully, but these errors were encountered: