-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Require BufRead
instead of just Read
for inputs.
#427
Conversation
This PR changes the decoder from pulling fixed size 32KB chunks of input data to having the caller push arbitrary sized chunks of data into the decoder. The default for From testing locally on the QOI benchmark suite, I'm seeing a 1-2% performance regression with a (default sized) I wonder if the performance gains you've seen on this were caused by avoiding the 32KB allocation+memset on very small images? If so, we could think about changes to control the size of the buffer based on the image size (likely by switching away from internally using a |
I wonder if adding a benchmark that uses
I think your experiment supports merging this PR:
I think the data from the commit message suggests the opposite - the performance gains were much more pronounced for bigger images. I speculate that this is because the savings come from avoiding copying all of the image data, rather than coming from the initial, fixed-size data or memset. According to the commit message the savings looked as follows:
|
Sorry, I should have been more clear. The performance regression I found only happens with this PR but not on the main branch. That was what concerned me. Running a different experiment, I tried making this patch to the benchmarks: --- a/benches/decoder.rs
+++ b/benches/decoder.rs
@@ -48,7 +48,7 @@ fn bench_file(c: &mut Criterion, data: Vec<u8>, name: String) {
group.throughput(Throughput::Bytes(info.buffer_size() as u64));
group.bench_with_input(name, &data, |b, data| {
b.iter(|| {
- let decoder = Decoder::new(data.as_slice());
+ let decoder = Decoder::new(std::io::BufReader::new(data.as_slice()));
let mut decoder = decoder.read_info().unwrap();
decoder.next_frame(&mut image).unwrap();
}) I then ran each benchmark against the main branch and again against this PR:
I'll also note I'm somewhat biased here. This API change would have some ripple effects for parts of the main |
@fintelia - could you please upload your benchmarking setup to a github repo + point out the hashes of the before/after commits? This would help me understand what is being compared in your experiment. Is it maybe like this?:
Yes, this is true. I think I'd prefer to require all users of I am also biased - the speed of decoding of in-memory PNGs is most important to me, because this is the form of PNG input in Chromium's I think that understanding the performance impact of this PR is important, so let's continue the discussion here. At the same time, maybe the 2 PRs at #429 and #428 are less controversial and maybe this is where we can focus the review efforts for now? Landing those other 2 PRs first may be desirable to:
|
BTW, one additional argument for not merging this PR yet is that so far Chromium has not landed any code that uses the Still, I am curious about the benchmarking results above, and would like to debug and understand them better. (Although as I said before, this seems lower priority than discussing the other 2, less controversial PRs.) |
f24d4ff
to
e2eb3f8
Compare
e2eb3f8
to
c67d90e
Compare
For full transparency, let me share some recent observations. I have rebased this PR on top of the latest changes, but when rerunning the benchmarks I've realized that for some testcases (e.g. noncompressed, 2048x2048 image, split across 64kB IDAT chunks) I observe a significant regression. The binary size and the number of instructions goes down (6550545993 instructions => 6358944886 instructions - 3% improvement), but the stalled backend cycles increase from 5.37% backend cycles idle to 40.61% backend cycles idle (which translates into a regression of the overall runtime). I think that getting rid of the intermediate |
Any chance your CPU has a 64KB L1 cache? The current approach does three copies with size=32KB: input -> BufReader -> out_buffer -> data_stream, which should stay inside the L1 cache. This PR changes it to two copies input -> out_buffer -> data_stream, but in the process makes the copy size depend on the underlying image's IDAT sizes. Which with 64KB IDATs means that the working set for each copy becomes 128KB |
Let me convert this PR to a "draft", so that it won't get accidentally merged before we understand the performance impact better. There are multiple factors at play, so I think that (instead of continuing the discussion here) I'll try to post something to #416. |
This commit makes a breaking API change - it changes the `R: Read` constraint (in `Decoder` and `Reader` structs) to the `R: BufRead` constraint. This helps performance by avoiding copying the input data into an additional, intermediate `BufReader` that used to be stored in the (internal) `ReadDecoder::reader` field (after these changes that field is `R` rather than `BufReader`). In particular, some input types (e.g. when decoding from a `&[u8]`) already implement `BufRead` and for such types it is wasteful to introduce additional buffering via `BufReader`. The impact of the change is significant, but relatively small - this means that it mostly shows up in `noncompressed...` benchmarks which magnify the cost of code paths that are not related to `fdeflate` nor `unfilter`. Impact on benchmark runtime looks as follows (run once, and then rerun after compiling before+after with a fresh nightly `rustc`): * kodim02.png: - No change in performance detected (p = 0.08 > 0.05) - [+1.3713% +1.7241% +2.0960%] (p = 0.00 < 0.05) * kodim07.png: - [-1.1466% -0.6693% -0.2705%] (p = 0.00 < 0.05) - No change in performance detected. (p = 0.35 > 0.05) * kodim17.png: - [-2.3062% -1.2878% +0.1746%] (p = 0.05 < 0.05) - [-2.7355% -1.9939% -0.7986%] (p = 0.00 < 0.05) * kodim23.png: - No change in performance detected. (p = 0.51 > 0.05) - [-1.4834% -1.0648% -0.6692%] (p = 0.00 < 0.05) * Lohengrin...png: - [-2.0606% -1.7935% -1.4756%] (p = 0.00 < 0.05) - [-4.2412% -3.6723% -3.0327%] (p = 0.00 < 0.05) * Transparency.png: - [+1.4991% +1.8812% +2.3429%] (p = 0.00 < 0.05) - [-0.7939% -0.5746% -0.3590%] (p = 0.00 < 0.05) * noncompressed-8x8.png: - [-2.2881% -1.4801% -0.4110%] (p = 0.00 < 0.05) - [-7.5687% -7.2013% -6.8838%] (p = 0.00 < 0.05) * noncompressed-128x128.png: - [-12.495% -12.132% -11.760%] (p = 0.00 < 0.05) - [-10.597% -10.230% -9.8399%] (p = 0.00 < 0.05)
Similarily to image-rs@1636b55 this commit tries to ensure that the working set fits into the L1 cache. Before this commit, the whole `ZlibStream::out_buffer` could be filled out and this buffer is potentially bigger than the typical 32kB of the L1 cache. After this commit, `MAX_INCREMENTAL_DECOMPRESSION_SIZE` limits how many bytes can be written to `ZlibStream::out_buffer` in a single call to `fdeflate::Decompressor::read`.
After this commit the size of the initial allocation of `ZlibStream::out_buffer` should be big enough to avoid having to grow the capacity of the buffer backing this vector of bytes. Removing this overhead is expected to have a positive impact on runtime performance.
c67d90e
to
b2495a3
Compare
PTAL?