`decode_coefs` experiment #1325

lqd · 2024-07-17T17:23:25Z

Here's a draft PR for a checkpoint of where I am w.r.t. decode_coefs, so you can tell me what you think, and see if my results hold up for you as well.

We discussed how accessing the TxfmInfos via memreads was less optimizable than via the actual constants that they are, in particular in @nnethercote's experiment, albeit at a cost in binary size.

Now, there are may ways to do this here, and what I've done is an example of that where I've only made the discriminant const generic instead of the entire tdim as Nick had. That's also how you've done the same idea in decode_coefs_class.

In my experiments, it wasn't the absolute fastest way to do it and I've seen slightly better numbers via macros instead of using an #[inline(always)] trampoline. But this way has somewhat nicer ergonomics and works (and could also be micro-optimized in the future).

What I didn't expect however is that it would unlock more optimization here, in particular we can force inlining on a couple functions which weren't completely inlined before, for improved codegen. Of note these 2 functions are completed inlined in the C version as far as I can tell.

The binary size cost here is 5.8%, from 4010088 bytes to 4243128. It's more than what Nick was reporting, so we could also try his approach and see if we can then unlock the same gains again.

(I've also removed the bounds check I mentioned last time for slightly better results, if my benchmarking is accurate enough.)

Here are the results, on 8bit vs the master commit 74f485, around -1.7% if again the benchmarking is accurate enough:

Benchmark 1: target/release/dav1d -q -i ./Chimera-AV1-8bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     17.746 s ±  0.098 s    [User: 130.155 s, System: 1.196 s]
  Range (min … max):   17.653 s … 17.889 s    5 runs

Benchmark 2: binaries/dav1d-74f485 -q -i ./Chimera-AV1-8bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     18.057 s ±  0.116 s    [User: 131.783 s, System: 1.175 s]
  Range (min … max):   17.876 s … 18.148 s    5 runs

and on 10bit, where it seems less effective:

Benchmark 1: target/release/dav1d -q -i ./Chimera-AV1-10bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     41.158 s ±  0.083 s    [User: 324.702 s, System: 0.936 s]
  Range (min … max):   41.026 s … 41.235 s    5 runs

Benchmark 2: binaries/dav1d-74f485 -q -i ./Chimera-AV1-10bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     41.250 s ±  0.135 s    [User: 325.357 s, System: 0.929 s]
  Range (min … max):   41.019 s … 41.353 s    5 runs

(On this machine, the overhead on 8bit looked to be 4% and is closer to 2% with this, but take this with a grain of salt: I don't know if I need to compile the C version enabling LTO myself, or if it's been added to the build options when LTO was added to the rust version. Still, whatever the baseline is, the difference on the rust code between this PR and master seems to be 2% locally)

I don't expect this to land as-is, I've tried using macros to make it easier to read, it's not documented etc. But I'm not sure what you think about all this, so I decided to open a PR so you can see how this experiment looks in practice.

I've also quickly looked at doing this in read_coef_blocks and the like, instead of decode_coefs, but there are two different tdims and would result in a lot more code, for unclear gains and diminishing returns. A better approach, if possible, may be able to do this better.

these are also inlined in C

rinon

Thanks! Looks promising! I'll test this out and see what performance looks like on my machine.

rinon · 2024-07-17T17:51:59Z

src/tables.rs

@@ -192,6 +192,211 @@ impl BlockSize {
    }
 }

+pub const fn dav1d_txfm_size<const txsize_discr: usize>() -> TxfmSize {
+    let Some(size) = TxfmSize::from_repr(txsize_discr) else {


Any reason not to use .unwrap_unchecked() here?

I haven't tried actually, you can't unwrap in a const fn without a nightly feature, and expected it'd be the same for the _unchecked version, but I can check.

I also haven't tried using a non-const function, I really wanted to make sure the MIR const optimizations would be able to chew on this; but haven't checked any alternatives other than turning each decode_coefs call into a macro call that dispatches all TxfmSizes in a match...

It's very likely we can clean this up a lot if the results hold up.

Ahhh it looks like it is still unstable. Gotcha.

Ahhh it looks like it is still unstable. Gotcha.

Finally stable in 1.83!

Were you maybe waiting for this to merge the PR?

src/recon.rs

rinon · 2024-07-17T17:57:58Z

src/tables.rs

+        },
+    }
+}
+
 pub static dav1d_txfm_dimensions: [TxfmInfo; TxfmSize::COUNT] = {


Does this need to be a static? Can it be const so we don't have to duplicate it in the const switch above? I don't see anywhere that really relies on being able to take the address of these elements (i.e. they don't seem to be used from ASM?)

I hadn't seen a need for it to remain a static, but preferred to leave this out of the experiment to have your opinion about the const generics first.

You could also use enum_map! to use the above fn to generate this table (and make the key typed).

I did try enum_map! but then my measurements seemed slightly worse. It also required implement DefaultValue for TxfmInfo that I'm not sure can be correct, but then again it won't be used by the enum map so it may be fine?

Since the measurements seemed slightly worsened (and I haven't had the time t check the assembly), but still within noise most of the time, I didn't add this change to this PR. I can definitely push it to another branch if you'd like to test it yourself.

I also tried completely removing the static and using the (modified) const fn for the runtime values outside of decode_coefs, but that also seemed to yield worse numbers, so I've left it out of this PR. I did push a commit removing the duplication by using the new function though.

src/recon.rs

fbossen · 2024-07-17T18:39:37Z

@lqd Out of curiosity: what processor do you use for benchmarking? On an AMD 5600G (6 cores) I am seeing the overhead vs dav1d being reduced from an average of +7.8% to an average of 7.7%. Note that the average also includes the two summer nature test streams which have higher overhead.

lqd · 2024-07-17T18:41:01Z

I mostly use a zen2 epyc 7502P, but also tried on a i9-13900KF where results were similar.

edit: "similar"-ish, but lot closer (where each run is also a lot faster). The overhead there doesn't go down as much, as you said, compared to what seems to be a bigger reduction on the 7502P.

nick also had a 1% improvement in his experiment, so maybe there's something there (and hopefully not only on our machines 😓) and why I wanted to check with you all.

kkysen · 2024-07-17T19:43:44Z

src/recon.rs

@@ -1013,34 +1084,59 @@ fn decode_coefs<BD: BitDepth>(
        let cf = &mut cf;
        (rc, dc_tok) = match tx_class {
            TxClass::TwoD => decode_coefs_class::<{ TxClass::TwoD as _ }, BD>(
-                ts_c, t_dim, chroma, scratch, eob, tx, dbg, cf,
+                ts_c,
+                &tdim!(),


Should this be passed as a const generic, too?

We can, and it did seem to work equally. I can push this change here if you'd like. I'm trying to avoid pushing too many changes to not incur many more rounds of measurements to everybody.

src/recon.rs

rinon · 2024-07-17T22:33:12Z

This is a 1.2% speedup on 8-bit and no change on 10-bit on my Ryzen 7700X

src/recon.rs

src/tables.rs

lqd added 6 commits July 17, 2024 15:13

add const generic fns to get TxfmSizes and TxfmInfos

1977ffd

example of decode_coefs with const generic tdims

63c3147

fully inline two functions that are only partially inlined

d33c83e

these are also inlined in C

remove coalesced bounds check

462ecb2

fully inline decode_coefs_class

6a761f5

run rustfmt

f6ab903

rinon reviewed Jul 17, 2024

View reviewed changes

kkysen reviewed Jul 17, 2024

View reviewed changes

src/recon.rs Outdated Show resolved Hide resolved

kkysen reviewed Jul 17, 2024

View reviewed changes

src/recon.rs Outdated Show resolved Hide resolved

This was referenced Jul 17, 2024

Tracking issue for performance #1294

Open

[DRAFT] constify t_dim #1326

Draft

lqd added 3 commits July 18, 2024 17:47

fix const generic naming

7e89b23

remove duplication in dav1d_txfm_dimensions table

6cd908c

use const assignment instead of macro

a5731dd

lqd marked this pull request as ready for review July 18, 2024 21:07

kkysen reviewed Jul 18, 2024

View reviewed changes

src/recon.rs Outdated Show resolved Hide resolved

src/recon.rs Outdated Show resolved Hide resolved

src/tables.rs Outdated Show resolved Hide resolved

lqd added 2 commits July 18, 2024 21:46

wrap constant calls in const blocks

29d4351

panic on invalid TxfmSize constant discriminant

749e7f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decode_coefs` experiment #1325

`decode_coefs` experiment #1325

lqd commented Jul 17, 2024 •

edited

Loading

rinon left a comment

rinon Jul 17, 2024

lqd Jul 17, 2024 •

edited

Loading

rinon Jul 17, 2024

kkysen Oct 22, 2024

lqd Nov 12, 2024

rinon Jul 17, 2024

lqd Jul 17, 2024 •

edited

Loading

kkysen Jul 17, 2024

lqd Jul 18, 2024

fbossen commented Jul 17, 2024

lqd commented Jul 17, 2024 •

edited

Loading

kkysen Jul 17, 2024

lqd Jul 18, 2024

rinon commented Jul 17, 2024

decode_coefs experiment #1325

Are you sure you want to change the base?

decode_coefs experiment #1325

Conversation

lqd commented Jul 17, 2024 • edited Loading

rinon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lqd Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lqd Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbossen commented Jul 17, 2024

lqd commented Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rinon commented Jul 17, 2024

`decode_coefs` experiment #1325

`decode_coefs` experiment #1325

lqd commented Jul 17, 2024 •

edited

Loading

lqd Jul 17, 2024 •

edited

Loading

lqd Jul 17, 2024 •

edited

Loading

lqd commented Jul 17, 2024 •

edited

Loading