-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro-optimize the heck out of LEB128 reading and writing. #69050
Conversation
This commit makes the following writing improvements: - Removes the unnecessary `write_to_vec` function. - Reduces the number of conditions per loop from 2 to 1. - Avoids a mask and a shift on the final byte. And the following reading improvements: - Removes an unnecessary type annotation. - Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` -- the current code will read past the end of the slice some number of bytes. The bounds check at the end will subsequently trigger, unless something bad (like a crash) happens first. The cost of doing bounds check in the loop body is negligible. - Avoids a mask on the final byte. And the following improvements for both reading and writing: - Changes `for` to `loop` for the loops, avoiding an unnecessary condition on each iteration. This also removes the need for `leb128_size`. All of these changes give significant perf wins, up to 5%.
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit ad7802f with merge d902ca046d0a8cc72dd69a16627fa5da540030f1... |
Local
The biggest improvements are on "clean incremental" runs, followed by "patched incremental". |
☀️ Try build successful - checks-azure |
Queued d902ca046d0a8cc72dd69a16627fa5da540030f1 with parent dc4242d, future comparison URL. |
That's interesting. I remember that switching the code from Anyway, I'm happy to get any kind of improvement here. And it's even more safe than before 🎉 (In case someone is interested in the past of this implementation: https://github.com/michaelwoerister/encoding-bench contains a number of different versions that I tried out. It's rather messy as it's essentially a private repo but an interesting aspect is the test data files that are generated from actual rustc invocations) |
Finished benchmarking try commit d902ca046d0a8cc72dd69a16627fa5da540030f1, comparison URL. |
@bors r+ Thanks, @nnethercote! |
📌 Commit ad7802f has been approved by |
I was just finding it strange that the most significant bit was cleared out |
@nnethercote If you're bored, I wonder how this implementation compares to the pre-#59820 one in It definitely feels like your new version here is close to mine, but without checking I can't tell which one LLVM will prefer (or if they compile all the same). EDIT: also, has anyone considered using SIMD here, like @BurntSushi and others have employed for handling UTF-8/regexes etc.? I'm asking because UTF-8 is like a more complex LEB128. |
UTF-8 validation handles a lot of codepoints every call, while these read and write methods only handle a single LEB128 int per call, so SIMD is likely not useful. |
May not be relevant, but the serialized data is basically a sequence of |
If you are willing to do processor-specific tuning, |
See also Masked VByte [arXiv]. |
I tried the read and write implementations from |
Thanks for the link, I will take a look... but not in this PR :) |
@bors r=michaelwoerister |
📌 Commit ad7802f has been approved by |
BTW, in case anyone is curious, here's how I approached this bug. From profiling with Callgrind I saw that
These are instruction counts, and the percentages sum to about 11%. Lots of different functions are involved because the LEB128 functions are inlined, but the file is
Those percentages also add up to about 11%. Plus I poked around a bit at call sites and found this in a different file (
which is another 2.38%. So it was clear that LEB128 reading/writing was hot. I then tried gradually improving the code. I ended up measuring 18 different changes to the code. 10 of them were improvements (which I kept), and 8 were regressions (which I discarded). The following table shows the notes I took. The descriptions of the changes are a bit cryptic, but the basic technique should be clear.
Every iteration took about 6.5 minutes to recompile, and about 2 minutes to measure with Cachegrind. I interleaved these steps with other work, so in practice each iteration took anywhere from 10-30 minutes, depending on context-switching delays. The measurements in the notes are close to those from the CI run, which indicate the following for
Instruction counts are almost deterministic and highly reliable. Cycle counts are more variable but still reasonable. Wall-time is highly variable and barely trustworthy. But they're all pointing in the same direction, which is encouraging. Looking at the instruction counts, we saw that LEB128 operations were about 11-13% of instructions originally, and instruction counts went down by about 5%, which suggests that the LEB128 operations are a bit less than twice as fast as they were. Pretty good. |
…r=michaelwoerister Micro-optimize the heck out of LEB128 reading and writing. This commit makes the following writing improvements: - Removes the unnecessary `write_to_vec` function. - Reduces the number of conditions per loop from 2 to 1. - Avoids a mask and a shift on the final byte. And the following reading improvements: - Removes an unnecessary type annotation. - Fixes a dangerous unchecked slice access. Imagine a slice `[0x80]` -- the current code will read past the end of the slice some number of bytes. The bounds check at the end will subsequently trigger, unless something bad (like a crash) happens first. The cost of doing bounds check in the loop body is negligible. - Avoids a mask on the final byte. And the following improvements for both reading and writing: - Changes `for` to `loop` for the loops, avoiding an unnecessary condition on each iteration. This also removes the need for `leb128_size`. All of these changes give significant perf wins, up to 5%. r? @michaelwoerister
Rollup of 9 pull requests Successful merges: - #67642 (Relax bounds on HashMap/HashSet) - #68848 (Hasten macro parsing) - #69008 (Properly use parent generics for opaque types) - #69048 (Suggestion when encountering assoc types from hrtb) - #69049 (Optimize image sizes) - #69050 (Micro-optimize the heck out of LEB128 reading and writing.) - #69068 (Make the SGX arg cleanup implementation a NOP) - #69082 (When expecting `BoxFuture` and using `async {}`, suggest `Box::pin`) - #69104 (bootstrap: Configure cmake when building sanitizer runtimes) Failed merges: r? @ghost
☔ The latest upstream changes (presumably #69118) made this pull request unmergeable. Please resolve the merge conflicts. |
In response to earlier comments, PEXT can be used to encode with something like (untested)
You can do a similar thing with PDEP for decoding. Encoding larger integers is probably just best off using a branch to handle full chunks of 56 bits (with |
@fitzgen tried using PEXT a while back in a different project. For the common case (small integers that fit in 1 byte) it was a slight slowdown: |
Also, on intel chips, pext is implemented in hardware and super fast (one or two cycles iirc), but on amd it is implemented in microcode and is muuuuuch slower (150-300 cycles). Would have to be careful with it. |
@nnethercote The thing I would worry about with PEXT is the copy; if you do that byte-at-a-time (or with |
PR rust-lang#69050 changed LEB128 reading and writing. After it landed I did some double-checking and found that the writing changes were universally a speed-up, but the reading changes were not. I'm not exactly sure why, perhaps there was a quirk of inlining in the particular revision I was originally working from. This commit reverts some of the reading changes, while still avoiding `unsafe` code. I have checked it on multiple revisions and the speed-ups seem to be robust.
Tweak LEB128 reading some more. PR #69050 changed LEB128 reading and writing. After it landed I did some double-checking and found that the writing changes were universally a speed-up, but the reading changes were not. I'm not exactly sure why, perhaps there was a quirk of inlining in the particular revision I was originally working from. This commit reverts some of the reading changes, while still avoiding `unsafe` code. I have checked it on multiple revisions and the speed-ups seem to be robust. r? @michaelwoerister
#92604 is a successor to this PR, for those who like LEB128 micro-optimizations. |
As it turns out, the Rust compiler uses variable length LEB128 encoded integers internally. It so happens that they spent a fair amount of effort micro-optimizing the decoding functionality [0] [1], as it's in the hot path. With this change we replace our decoding routines with these optimized ones. To make that happen more easily (and to gain some base line speed up), also remove the "shift" return from the respective methods. As a result of these changes, we see a respective speed up: Before: test util::tests::bench_u64_leb128_reading ... bench: 128 ns/iter (+/- 10) After: test util::tests::bench_u64_leb128_reading ... bench: 103 ns/iter (+/- 5) Gsym decoding, which uses these routines, improved as follows: main/symbolize_gsym_multi_no_setup time: [146.26 µs 146.69 µs 147.18 µs] change: [−7.2075% −5.7106% −4.4870%] (p = 0.00 < 0.02) Performance has improved. [0] rust-lang/rust#69050 [1] rust-lang/rust#69157 Signed-off-by: Daniel Müller <[email protected]>
As it turns out, the Rust compiler uses variable length LEB128 encoded integers internally. It so happens that they spent a fair amount of effort micro-optimizing the decoding functionality [0] [1], as it's in the hot path. With this change we replace our decoding routines with these optimized ones. To make that happen more easily (and to gain some base line speed up), also remove the "shift" return from the respective methods. As a result of these changes, we see a respectable speed up: Before: test util::tests::bench_u64_leb128_reading ... bench: 128 ns/iter (+/- 10) After: test util::tests::bench_u64_leb128_reading ... bench: 103 ns/iter (+/- 5) Gsym decoding, which uses these routines, improved as follows: main/symbolize_gsym_multi_no_setup time: [146.26 µs 146.69 µs 147.18 µs] change: [−7.2075% −5.7106% −4.4870%] (p = 0.00 < 0.02) Performance has improved. [0] rust-lang/rust#69050 [1] rust-lang/rust#69157 Signed-off-by: Daniel Müller <[email protected]>
As it turns out, the Rust compiler uses variable length LEB128 encoded integers internally. It so happens that they spent a fair amount of effort micro-optimizing the decoding functionality [0] [1], as it's in the hot path. With this change we replace our decoding routines with these optimized ones. To make that happen more easily (and to gain some base line speed up), also remove the "shift" return from the respective methods. As a result of these changes, we see a respectable speed up: Before: test util::tests::bench_u64_leb128_reading ... bench: 128 ns/iter (+/- 10) After: test util::tests::bench_u64_leb128_reading ... bench: 103 ns/iter (+/- 5) Gsym decoding, which uses these routines, improved as follows: main/symbolize_gsym_multi_no_setup time: [146.26 µs 146.69 µs 147.18 µs] change: [−7.2075% −5.7106% −4.4870%] (p = 0.00 < 0.02) Performance has improved. [0] rust-lang/rust#69050 [1] rust-lang/rust#69157 Signed-off-by: Daniel Müller <[email protected]>
This commit makes the following writing improvements:
write_to_vec
function.And the following reading improvements:
[0x80]
--the current code will read past the end of the slice some number of
bytes. The bounds check at the end will subsequently trigger, unless
something bad (like a crash) happens first. The cost of doing bounds
check in the loop body is negligible.
And the following improvements for both reading and writing:
for
toloop
for the loops, avoiding an unnecessarycondition on each iteration. This also removes the need for
leb128_size
.All of these changes give significant perf wins, up to 5%.
r? @michaelwoerister