-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memoization of RGBA palette when expanding palette indices into RGB8 or RGBA8 #462
Conversation
This commit moves `expand_paletted_into_rgb8` and `expand_paletted_into_rgba8` (and their unit tests) into a separate `transform/palette.rs` module. This prepares room for encapsulating extra complexity in this module in follow-up commits, where we will start to precompute and memoize some data when creating a `TransformFn`. This commit just moves the code around - it should have no impact on correctness or performance.
The `PLTE` chunk's size should be a multiple of 3 (since it contains RGB entries - 3 bytes per entry). Additionally, taking 10000 samples in the `bench_create_fn` benchmarks is a bit excessive after memoization.
This commit changes the `TransformFn` type alias from `fn(...)` into `Box<dyn Fn(...)>`. This allows the `TransformFn` to have store some precomputer, memoized state that we plan to add in follow-up commits. In theory this commit may have negative performance impact, but in the grand scheme of things it disappears into the measurement noise. In particular, when there is no state, then `Box` shouldn't allocate.
Before this commit `expand_paletted_into_rgba8` would: * Perform 2 lookups - `palette.get(i)` and `trns.get(i)` * Check via `unwrap_or` if `i` was within the bounds of `palette`/`trns` This commit introduces `create_rgba_palette` which combines `palette` and `trns` into a fixed-size `[[u8;4]; 256]` look-up table (called `rgba_palette` in the code). After this commit `expand_paletted_into_rgba8` only needs to perform a single look-up and doesn't need to check the bounds. This helps to improve the expansion time by 60+%: - expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461: [-60.208% -60.057% -59.899%] (p = 0.00 < 0.05) - expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461: [-77.520% -77.407% -77.301%] (p = 0.00 < 0.05) `expand_paletted_into_rgb8` performs only a single lookup before and after this commit, but avoiding bounds checks still helps to improve the expansion time by ~12%: - expand_paletted(exec)/trns=no/src_bits=4/src_size=5461: [-12.357% -12.005% -11.664%] (p = 0.00 < 0.05) - expand_paletted(exec)/trns=no/src_bits=8/src_size=5461: [-13.135% -12.584% -12.092%] (p = 0.00 < 0.05) Understandably, this commit regresses the time of `create_transform_fn`. Future commits will reduce this regression 2-4 times: - expand_paletted(ctor)/plte=256/trns=256: [+3757.2% +3763.8% +3770.5%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=224/trns=32: [+3807.3% +3816.2% +3824.6%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=16/trns=1: [+1672.0% +1675.0% +1678.1%] (p = 0.00 < 0.05)
Before this commit `expand_into_rgb8` would copy 3 bytes at a time into the output. After this commit it copies 4 bytes at a time (possibly cloberring pixels that will be populated during the next iteration - this is ok). This improved the performance as follows: expand_paletted(exec)/trns=no/src_bits=8/src_size=5461 time: [-23.852% -23.593% -23.319%] (p = 0.00 < 0.05)
This improves the performance as follows: - expand_paletted(ctor)/plte=256/trns=256 [-40.581% -40.396% -40.211%] (p = 0.00 < 0.05) - expand_paletted(ctor)/plte=224/trns=32 [-24.070% -23.840% -23.592%] (p = 0.00 < 0.05) Small palettes are mostly unaffected: - expand_paletted(ctor)/plte=16/trns=1 [-0.2525% +0.0338% +0.3239%] (p = 0.81 > 0.05)
@fintelia, can you PTAL? /cc @Shnatsel (thanks again for pointing out my earlier benchmarking mistake; I read your article on eliminating bounds checks, but wasn’t able to successfully apply it - more details below)
SIMDMemoizationMy attempt to SIMD-ify Short link to the non-SIMD-ified code can be found here: https://godbolt.org/z/KeKrfWcs5. There is no auto-vectorization AFAICT. ExecutionMy attempt to SIMD-ify The non-SIMD-ified code can be found here: https://godbolt.org/z/T9zv891dc. There is no auto-vectorization AFAICT - this makes me a bit surprised that this is faster than the SIMD-ified version. The SIMD-ified version can be found here: https://godbolt.org/z/P19TPP33G. AFAICT it does in fact compile into SIMD instructions. There are some inefficiencies that (I think) may be addressed by (somehow) hoisting bounds checks outside of the loop (both in the non-SIMD-ified and the SIMD-ified version):
I don’t know how to address the inefficiencies / how to do the hoisting. (I’ve tried some Break-even pointMemoization is a trade-off:
If we process B bytes, then:
The breakeven point can be calculated with B=X/Y. Note that X and Y are typically measure via microbenchmarks which are somewhat unrealistic MeasurementsWithout memoization we get:
With memoization we get:
ResultsThe measurements above give the following thresholds:
Such small images seem relatively rare. IMHO it is unjustified (in terms of extra code complexity, binary size, etc.) to implement falling back to non-memoized implementation for these kinds of images. |
I am surprised the crate isn't doing this already! Turning the chunks into a lookup table seems like a no-brainer. I'll take a look at the bounds checks. |
What machine did you run those benchmarks on? I am getting dramatically higher numbers on a Zen 4 CPU. Benchmark results on Zen 4``` expand_paletted(exec)/trns=no/src_bits=4/src_size=5461 time: [7.0350 µs 7.0351 µs 7.0352 µs] thrpt: [4.3376 GiB/s 4.3376 GiB/s 4.3377 GiB/s] Found 73 outliers among 500 measurements (14.60%) 7 (1.40%) low severe 35 (7.00%) low mild 13 (2.60%) high mild 18 (3.60%) high severeexpand_paletted(exec)/trns=no/src_bits=8/src_size=5461 expand_paletted(exec)/trns=yes/src_bits=4/src_size=5461 expand_paletted(exec)/trns=yes/src_bits=8/src_size=5461 expand_paletted(ctor)/plte=256/trns=256 expand_paletted(ctor)/plte=224/trns=32 expand_paletted(ctor)/plte=16/trns=1
|
Just so you know, with I suspect the expansion function ends up getting called a lot on very small chunks, so the vector code is wasted and only gets in the way. I'll try to confirm this hunch. |
Hmm, that function seems to be called for every row. I am surprised that autovectorization actually hurts it even on rather large images. |
Zen4 has a somewhat weird AVX-512 implementation, so I wonder if LLVM just isn't realizing that the autovectorized version will be slower. As another datapoint, I tested on a Zen3 CPU and don't see any huge performance differences compiling with |
I couldn't convince the compiler to remove the bounds checks from the loop. But it seems to speculate right past them without too much trouble. A naive iterator version that removes bounds checks but copies 3 bytes instead of 4 is 25% slower than this code. |
{ | ||
let mut palette_iter = palette; | ||
let mut rgba_iter = &mut rgba_palette[..]; | ||
while palette_iter.len() >= 4 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while palette_iter.len() >= 4 { | |
while palette_iter.len() >= 4 && !rgba_iter.is_empty() { |
Modifying https://godbolt.org/z/KeKrfWcs5, I haven't benchmarked or tested it however.
Removes the panic but doesn't unroll the loop like the original https://rust.godbolt.org/z/53MaeEano
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in a 25% regression.
The compiler is aware that the branches leading to the panic are cold, and so decides to unroll the loop. The CPU should speculate right past the branch leading to a panic. So it is best to keep it as a clearly-cold panic rather than a regular branch.
input = &input[1..]; | ||
output = &mut output[3..]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing these with get_unchecked()
showed exactly zero performance difference on the benchmarks. It seems eliminating bounds checks here is not worthwhile.
AMD EPYC 7B12. `lscpu` output$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc c puid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3d nowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nr ip_save umip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec rstack overflow: Vulnerable: Safe RET, no microcode Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected FWIW, I also didn't use |
Ah, that runs at half the frequency of Zen4 desktop parts, so those numbers make sense. I did not use I suppose we could use multiversioning if wider SIMD helped, but at least in palette expansion it doesn't. |
Before this CL, `png::Transformations::EXPAND` was used unconditionally, which meant that the `png` crate was responsible for 1) expanding to at least 8 bits, 2) expanding `ColorType::Rgb` to `Rgba` (and `Grayscale` to `GrayscaleAlpha`) based on `tRNS` chunk, and 3) expanding `ColorType::Indexed` based on `PLTE` and `tRNS` chunks. After this CL, expanding `ColorType::Indexed` images is done on Skia side instead. This has some performance benefits: * Performance measurements show that the total runtime of decoding PNG images gathered from top 500 websites goes down - the runtime after this CL is ~94% of the runtime before this CL. For comparison with other potential interventions, please see the document here: https://docs.google.com/document/d/16P0I_4AglenbkU1IBM1Qfe1zAQrMXa5NoEazZqOTHZE/edit?usp=sharing * Arm-chair analysis: - There are less transformation "hops" after this CL: - Before this CL: indexed data ==png-crate==> rgb or rgba ==SkSwizzler-and/or-skcms_Transform==> final data - After this CL: indexed data ==SkSwizzler==> final data - The indexed=>rgb/rgba transformation is still happening, but it is happening within SkSwizzler, so there is (maybe?) less loop overhead and less memory pressure (although one image row is not _that_ much) - The skcms_Transform may still be needed, but it can be applied to the palette (at most 256 rgba pixels) rather than to all the image pixels. (This may also apply for some stages handles by SkSwizzler like alpha-premultiplication.) - I note that the indexed=>rgb/rgba transformation depends on auto-vectorization both before and after this CL. So far the `png` crate didn't even see improvements from using `std::simd` (see the notes in image-rs/image-png#462 (comment)). Similarily, `SkSwizzler` doesn't provide SIMD-based routines for this transformation (see `swizzle_small_index_to_n32` and `swizzle_index_to_n32`). There may be additional performance improvement opportunities here, because `libpng` expansion does use SIMD intrinsics - see the notes in https://crbug.com/40512957. Fixed: chromium:356882657 Change-Id: I4bf63bd126140401ae1f38f82e613ffe7d9d13e8 Reviewed-on: https://skia-review.googlesource.com/c/skia/+/911438 Reviewed-by: Daniel Dilan <[email protected]> Commit-Queue: Łukasz Anforowicz <[email protected]>
No description provided.