Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suboptimal code generation when using read_exact() on &[u8]/Cursor<&[u8]> #80277

Closed
adrian17 opened this issue Dec 21, 2020 · 3 comments
Closed
Labels
A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@adrian17
Copy link

adrian17 commented Dec 21, 2020

This basically started with investigating byteorder's read_u16-style extensions. They are more or less meant to be used like this:

let mut rdr = Cursor::new(vec![2, 5, 3, 0]);
assert_eq!(517, rdr.read_u16::<BigEndian>().unwrap());
assert_eq!(768, rdr.read_u16::<BigEndian>().unwrap());

You can see that they are implemented in terms of Read's read_exact method:
https://github.com/BurntSushi/byteorder/blob/f10179b1c72f80d33f4bedf6ad3cf110ba2dac0a/src/io.rs#L110
And similarly in other projects:
https://github.com/image-rs/image-png/blob/ccd274fe028e5cb3247eaaf6d30ab83b5bde76e7/src/traits.rs#L7

However, it seems like when used with Cursor<&[u8]> or just &[u8], they are way slower than I'd expect them to be.

When I build a test program, I encounter this bug: #47321 which is pretty bad, especially as read_exact will call to memcpy for all sizes > 1 byte. This causes, in particular, noticeable overhead on wasm.

When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to read_exact, and there are no calls in inlined code except to io::Error::new. However, even when inlined, it appears that code is still suboptimal.
Normally, I'd assume that cursor.read_16() does a range_check, then a 2-byte read, cursor increment and that's it. But here it appears that more work is being done, probably redundant slice range checks.

Here is my base test case:

#[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
    let mut ret = 0;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u16::<LittleEndian>()? as u64;
    ret += cursor.read_u32::<LittleEndian>()? as u64;
    Ok(ret)
}

In my benchmarks, it spends 7.5ns/call.

But when I manually extract slice range check out of the function, so that the compiler knows the slices will contain required number of bytes, for example:

ret += cursor.read_u16::<LittleEndian>()? as u64;

->

if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = &cursor.get_ref()[cursor.position() as usize..cursor.position() as usize + 2];
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);

The time improves to 5.3ns/call and AFAIK the behavior is identical (aside from different error kind).

This still generated range checks that seemed redundant (I may be wrong here), so after introducing unsafe:

ret += cursor.read_u16::<LittleEndian>()? as u64;

->

if cursor.position() as usize + 1 >= cursor.get_ref().len() {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
let mut data = unsafe { cursor.get_ref().get_unchecked(cursor.position() as usize..cursor.position() as usize + 2) };
ret += data.read_u16::<LittleEndian>()? as u64;
cursor.set_position(cursor.position()+2);

Time improved even more, to 4.1ns/call. (This is also similar time to one I got in equivalent C++ program with a simple range check on every read_u* call)

Similar behavior occurs when I use a slice instead of a cursor. If I simply add a range check:

// data: &mut &[u8]
ret += data.read_u16::<LittleEndian>()? as u64;

->

if data.len() < 2 {
    return Err(io::Error::new(io::ErrorKind::Other, "failed to fill whole buffer"));
}
ret += data.read_u16::<LittleEndian>()? as u64;

The iteration speed improves by nearly 2x.

Now I don't really know how much this has to do with #47321 , and how much with other optimizations. It's also a bit awkward that #47321 triggers in the production code, so until that bug is fixed, I won't really see benefits from fixing this one anyway.

I posted my benchmark on https://github.com/adrian17/rust-byteorder-benchmarks .

@adrian17 adrian17 added the C-bug Category: This is a bug. label Dec 21, 2020
@jonas-schievink jonas-schievink added A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. and removed C-bug Category: This is a bug. labels Dec 21, 2020
@wecing
Copy link
Contributor

wecing commented Dec 22, 2020

Hi! I left a comment in the other issue -- seems like that issue will not appear when using --release.

I tried your benchmark and here are the results I got:

test tests::bench_byteorder_cursor           ... bench:      47,294 ns/iter (+/- 1,410)
test tests::bench_byteorder_cursor_v2        ... bench:      39,374 ns/iter (+/- 2,657)
test tests::bench_byteorder_cursor_v2_unsafe ... bench:      36,669 ns/iter (+/- 1,620)
test tests::bench_byteorder_slice            ... bench:      53,106 ns/iter (+/- 3,186)
test tests::bench_byteorder_slice_v2         ... bench:      42,819 ns/iter (+/- 1,803)

However, after changing the three sum_byteorder_cursor_impl* to let them always read u8 4 times, e.g.:

#[inline(never)]
pub fn sum_byteorder_cursor_impl(cursor: &mut Cursor<&[u8]>) -> io::Result<u64> {
    let mut ret = 0;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    ret += cursor.read_u8()? as u64;
    // ret += cursor.read_u16::<LittleEndian>()? as u64;
    // ret += cursor.read_u32::<LittleEndian>()? as u64;
    Ok(ret)
}

I got:

test tests::bench_byteorder_cursor           ... bench:      42,126 ns/iter (+/- 2,335)
test tests::bench_byteorder_cursor_v2        ... bench:      36,841 ns/iter (+/- 1,431)
test tests::bench_byteorder_cursor_v2_unsafe ... bench:      29,907 ns/iter (+/- 1,424)
test tests::bench_byteorder_slice            ... bench:      54,108 ns/iter (+/- 16,327)
test tests::bench_byteorder_slice_v2         ... bench:      42,847 ns/iter (+/- 1,505)

I think the change is pretty significant compared to the two tests I did not change.

When I build it as a lib (or benchmark), it appears that the functions get inlined, as I can't find any calls to read_exact

I got different results with cargo +nightly rustc --release -- --emit asm using cargo 1.50.0-nightly (bfca1cd22 2020-11-24); the generated assembly does contain 4 calls to read_exact:

        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        movzbl  1(%rsp), %r12d
        movb    $0, 1(%rsp)
        leaq    1(%rsp), %rsi
        movl    $1, %r15d
        movl    $1, %edx
        movq    %r14, %rdi
        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        movzbl  1(%rsp), %r13d
        movw    $0, 2(%rsp)
        leaq    2(%rsp), %rsi
        movl    $2, %edx
        movq    %r14, %rdi
        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        movzwl  2(%rsp), %ebp
        movl    $0, 4(%rsp)
        leaq    4(%rsp), %rsi
        movl    $4, %edx
        movq    %r14, %rdi
        callq   _ZN66_$LT$std..io..cursor..Cursor$LT$T$GT$$u20$as$u20$std..io..Read$GT$10read_exact17hc00b0cae91b17a7bE
        cmpb    $3, %al
        jne     .LBB12_5
        addq    %r12, %r13
        addq    %r13, %rbp
        movl    4(%rsp), %eax
        addq    %rbp, %rax
        movq    %rax, 8(%rbx)
        xorl    %r15d, %r15d
        jmp     .LBB12_6

but they are gone after making the above change (doing 4 u8 reads).

So, in short: I think this may be related to the inlining strategy that LLVM takes; reducing the number of different functions called helps.

@adrian17
Copy link
Author

I got different results with cargo +nightly rustc --release -- --emit asm using cargo 1.50.0-nightly (bfca1cd22 2020-11-24); the generated assembly does contain 4 calls to read_exact:

Weird... calls were indeed appearing in .s, but not when I objdump'd the built rlib. After I deleted the target directory and rebuilt, now the calls are in the built rlib too. Anyway, this is #47321 .

But you can notice that even you made a change that made the compiler inline them, the code is still slow, which is what this issue is about.

@adrian17
Copy link
Author

adrian17 commented Jul 4, 2021

After half a year, looks like the generated code looks much better, and most benchmarks I mentioned above are nearly equal - in particular, my "hand-optimizations" do not appear to be necessary anymore.

The idiomatic Cursor code is still ~20% slower than the Slice code in my microbenchmark, but I'm guessing that's just unavoidable with an extra abstraction layer, so I'm gonna close this issue.

@adrian17 adrian17 closed this as completed Jul 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: `std::io`, `std::fs`, `std::net` and `std::path` C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

3 participants