Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce unnecessary Vec allocations and indirections #77

Merged
merged 1 commit into from
Sep 3, 2022

Conversation

chyyran
Copy link
Contributor

@chyyran chyyran commented Aug 6, 2022

Pull Request Overview

  • Changed literal_probs array from a Vec<Vec<u16>> to a Vec2D backed by a contiguous Vec<u16>.
  • BitTrees in LenDecoder and DecoderState are now stored inline as an array. The actual BitTree data is still on the heap.
  • DecoderState::reset now just recreates stack-allocated data, dropping the existing ones. I found this to be faster than using slice::fill. The heap-allocated arrays still use fill to keep their allocation.

This gives a small but free performance boost and reduces overall allocations.

6e1f0d7

running 8 tests
test compress_65536                  ... bench:   1,406,069 ns/iter (+/- 60,172)
test compress_empty                  ... bench:         734 ns/iter (+/- 36)
test compress_hello                  ... bench:       1,037 ns/iter (+/- 31)
test decompress_after_compress_65536 ... bench:   2,708,511 ns/iter (+/- 298,490)
test decompress_after_compress_empty ... bench:       2,243 ns/iter (+/- 1,853)
test decompress_after_compress_hello ... bench:       2,869 ns/iter (+/- 680)
test decompress_big_file             ... bench:   4,097,344 ns/iter (+/- 182,653)
test decompress_huge_dict            ... bench:       2,791 ns/iter (+/- 183)

After this PR (notice the tighter variance in particular)

running 8 tests
test compress_65536                  ... bench:   1,391,247 ns/iter (+/- 38,388)
test compress_empty                  ... bench:         725 ns/iter (+/- 17)
test compress_hello                  ... bench:       1,035 ns/iter (+/- 36)
test decompress_after_compress_65536 ... bench:   2,441,367 ns/iter (+/- 53,371)
test decompress_after_compress_empty ... bench:       1,974 ns/iter (+/- 66)
test decompress_after_compress_hello ... bench:       2,459 ns/iter (+/- 291)
test decompress_big_file             ... bench:   3,849,870 ns/iter (+/- 192,785)
test decompress_huge_dict            ... bench:       2,427 ns/iter (+/- 126)

Testing Strategy

This pull request was tested by...

  • Added relevant unit tests.

@chyyran chyyran force-pushed the patch-reduce-allocs branch 3 times, most recently from c4a996a to 9f44242 Compare August 6, 2022 07:52
Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see some performance improvements here! Ultimately, const generics (#19) would go further into this direction.

src/decode/vec2d.rs Outdated Show resolved Hide resolved
src/decode/lzma.rs Outdated Show resolved Hide resolved
src/decode/vec2d.rs Outdated Show resolved Hide resolved
src/decode/vec2d.rs Outdated Show resolved Hide resolved
src/decode/vec2d.rs Outdated Show resolved Hide resolved
src/decode/vec2d.rs Outdated Show resolved Hide resolved
src/macros.rs Outdated Show resolved Hide resolved
src/decode/vec2d.rs Outdated Show resolved Hide resolved
src/decode/lzma.rs Outdated Show resolved Hide resolved
src/decode/lzma.rs Outdated Show resolved Hide resolved
@chyyran chyyran force-pushed the patch-reduce-allocs branch 2 times, most recently from 70194d7 to 038d597 Compare August 6, 2022 16:11
@chyyran chyyran requested a review from gendx August 6, 2022 16:12
@chyyran chyyran force-pushed the patch-reduce-allocs branch 3 times, most recently from e2f0650 to b5c8f04 Compare August 6, 2022 16:20
@chyyran chyyran mentioned this pull request Aug 9, 2022
@chyyran chyyran force-pushed the patch-reduce-allocs branch 2 times, most recently from b1a798a to a010cc0 Compare August 9, 2022 06:33
@chyyran
Copy link
Contributor Author

chyyran commented Aug 9, 2022

I found that it was faster to just recreate stack-allocated arrays in DecoderState than bother with filling them. Heap-backed arrays still use fill to keep their allocation.

@chyyran chyyran force-pushed the patch-reduce-allocs branch 2 times, most recently from 885a14f to b90b484 Compare August 17, 2022 00:50
@chyyran
Copy link
Contributor Author

chyyran commented Aug 17, 2022

b90b484 removes the explicit bounds checks in Index and IndexMut for Vec2D as well as the rows argument which isn't necessary afterwards. We can rely on the bounds checks for slice that can be better optimized by rustc instead on Release builds.

Looks like clippy is failing on an unrelated file with a new lint, so I'm going to leave that as is.

@gendx gendx mentioned this pull request Aug 23, 2022
1 task
Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the lints in #80, please rebase and re-run the workflows.

Comment on lines +227 to +238
self.pos_decoders = [0x400; 115];
self.is_match = [0x400; 192];
self.is_rep = [0x400; 12];
self.is_rep_g0 = [0x400; 12];
self.is_rep_g1 = [0x400; 12];
self.is_rep_g2 = [0x400; 12];
self.is_rep_0long = [0x400; 192];
self.state = 0;
self.rep.fill(0);
self.rep = [0; 4];
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this brings any readability benefit over fill. The compiler should be able to optimize both versions of the code regardless.

I see you however observed otherwise:

I found that it was faster to just recreate stack-allocated arrays in DecoderState than bother with filling them. Heap-backed arrays still use fill to keep their allocation.

Can you add a comment here to explain that?

(IMO, this might also be worth filing a bug to the Rust compiler, to understand if fill being slower is intentional or not a missed optimization somewhere in the compiler.)

src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Show resolved Hide resolved
src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Show resolved Hide resolved
src/util/vec2d.rs Show resolved Hide resolved
@chyyran chyyran force-pushed the patch-reduce-allocs branch from b90b484 to 8e08e96 Compare August 24, 2022 02:13
@chyyran
Copy link
Contributor Author

chyyran commented Aug 24, 2022

Addressed nits, and reformatted imports as per #78 (comment) as well.

@chyyran chyyran requested a review from gendx August 24, 2022 02:14
@chyyran chyyran force-pushed the patch-reduce-allocs branch 4 times, most recently from 583006a to 58360eb Compare August 26, 2022 21:31
@chyyran chyyran closed this Aug 26, 2022
@chyyran chyyran force-pushed the patch-reduce-allocs branch from 58360eb to c80e4aa Compare August 26, 2022 22:20
@chyyran chyyran reopened this Aug 26, 2022
@chyyran chyyran force-pushed the patch-reduce-allocs branch from 086d897 to 444e68c Compare August 26, 2022 22:32
@chyyran
Copy link
Contributor Author

chyyran commented Aug 26, 2022

Accidentally closed because of a faulty rebase, 444e68c should be properly rebased on current master now.

Copy link
Owner

@gendx gendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-run the benchmarks, which show no significant change. However this is still a valuable cleanup - assuming the Vec2D implementation is correct with all the needed bounds checks!

src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Outdated Show resolved Hide resolved
src/util/vec2d.rs Show resolved Hide resolved
* Changed literal_probs array from a Vec<Vec<u16>> to a Vec2D backed by a contiguous allocation
* BitTrees in LenDecoder and DecoderState are now stored inline. The actual BitTree data still
  lives in a Vec but one level of indirection is reduced.
* Don't bother with filling stack-allocated DecoderState arrays on reset, and just recreate the
  arrays dropping the existing ones.
@chyyran chyyran force-pushed the patch-reduce-allocs branch from 444e68c to 48b66fa Compare August 28, 2022 23:35
@chyyran chyyran requested a review from gendx August 28, 2022 23:37
@chyyran
Copy link
Contributor Author

chyyran commented Aug 28, 2022

Fixed nits in 48b66fa

@gendx gendx merged commit 5ad34d7 into gendx:master Sep 3, 2022
@chyyran chyyran deleted the patch-reduce-allocs branch September 3, 2022 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants