-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277
Comments
Hi! I left a comment in the other issue -- seems like that issue will not appear when using I tried your benchmark and here are the results I got:
However, after changing the three #[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
let mut ret = 0;
ret += cursor.read_u8()? as u64;
ret += cursor.read_u8()? as u64;
ret += cursor.read_u8()? as u64;
ret += cursor.read_u8()? as u64;
// ret += cursor.read_u16::<LittleEndian>()? as u64;
// ret += cursor.read_u32::<LittleEndian>()? as u64;
Ok(ret)
} I got:
I think the change is pretty significant compared to the two tests I did not change.
I got different results with
but they are gone after making the above change (doing 4 u8 reads). So, in short: I think this may be related to the inlining strategy that LLVM takes; reducing the number of different functions called helps. |
Weird... calls were indeed appearing in But you can notice that even you made a change that made the compiler inline them, the code is still slow, which is what this issue is about. |
After half a year, looks like the generated code looks much better, and most benchmarks I mentioned above are nearly equal - in particular, my "hand-optimizations" do not appear to be necessary anymore. The idiomatic Cursor code is still ~20% slower than the Slice code in my microbenchmark, but I'm guessing that's just unavoidable with an extra abstraction layer, so I'm gonna close this issue. |
This basically started with investigating byteorder's
read_u16
-style extensions. They are more or less meant to be used like this:You can see that they are implemented in terms of
Read
'sread_exact
method:https://github.com/BurntSushi/byteorder/blob/f10179b1c72f80d33f4bedf6ad3cf110ba2dac0a/src/io.rs#L110
And similarly in other projects:
https://github.com/image-rs/image-png/blob/ccd274fe028e5cb3247eaaf6d30ab83b5bde76e7/src/traits.rs#L7
However, it seems like when used with
Cursor<&[u8]>
or just&[u8]
, they are way slower than I'd expect them to be.When I build a test program, I encounter this bug: #47321 which is pretty bad, especially as
read_exact
will call tomemcpy
for all sizes > 1 byte. This causes, in particular, noticeable overhead on wasm.When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to
read_exact
, and there are no calls in inlined code except toio::Error::new
. However, even when inlined, it appears that code is still suboptimal.Normally, I'd assume that
cursor.read_16()
does a range_check, then a 2-byte read, cursor increment and that's it. But here it appears that more work is being done, probably redundant slice range checks.Here is my base test case:
In my benchmarks, it spends 7.5ns/call.
But when I manually extract slice range check out of the function, so that the compiler knows the slices will contain required number of bytes, for example:
->
The time improves to 5.3ns/call and AFAIK the behavior is identical (aside from different error kind).
This still generated range checks that seemed redundant (I may be wrong here), so after introducing
unsafe
:->
Time improved even more, to 4.1ns/call. (This is also similar time to one I got in equivalent C++ program with a simple range check on every
read_u*
call)Similar behavior occurs when I use a slice instead of a cursor. If I simply add a range check:
->
The iteration speed improves by nearly 2x.
Now I don't really know how much this has to do with #47321 , and how much with other optimizations. It's also a bit awkward that #47321 triggers in the production code, so until that bug is fixed, I won't really see benefits from fixing this one anyway.
I posted my benchmark on https://github.com/adrian17/rust-byteorder-benchmarks .
The text was updated successfully, but these errors were encountered: