Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decode_coefs experiment #1325

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

decode_coefs experiment #1325

wants to merge 11 commits into from

Conversation

lqd
Copy link

@lqd lqd commented Jul 17, 2024

Here's a draft PR for a checkpoint of where I am w.r.t. decode_coefs, so you can tell me what you think, and see if my results hold up for you as well.

We discussed how accessing the TxfmInfos via memreads was less optimizable than via the actual constants that they are, in particular in @nnethercote's experiment, albeit at a cost in binary size.

Now, there are may ways to do this here, and what I've done is an example of that where I've only made the discriminant const generic instead of the entire tdim as Nick had. That's also how you've done the same idea in decode_coefs_class.

In my experiments, it wasn't the absolute fastest way to do it and I've seen slightly better numbers via macros instead of using an #[inline(always)] trampoline. But this way has somewhat nicer ergonomics and works (and could also be micro-optimized in the future).

What I didn't expect however is that it would unlock more optimization here, in particular we can force inlining on a couple functions which weren't completely inlined before, for improved codegen. Of note these 2 functions are completed inlined in the C version as far as I can tell.

The binary size cost here is 5.8%, from 4010088 bytes to 4243128. It's more than what Nick was reporting, so we could also try his approach and see if we can then unlock the same gains again.

(I've also removed the bounds check I mentioned last time for slightly better results, if my benchmarking is accurate enough.)

Here are the results, on 8bit vs the master commit 74f485, around -1.7% if again the benchmarking is accurate enough:

Benchmark 1: target/release/dav1d -q -i ./Chimera-AV1-8bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     17.746 s ±  0.098 s    [User: 130.155 s, System: 1.196 s]
  Range (min … max):   17.653 s … 17.889 s    5 runs

Benchmark 2: binaries/dav1d-74f485 -q -i ./Chimera-AV1-8bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     18.057 s ±  0.116 s    [User: 131.783 s, System: 1.175 s]
  Range (min … max):   17.876 s … 18.148 s    5 runs

and on 10bit, where it seems less effective:

Benchmark 1: target/release/dav1d -q -i ./Chimera-AV1-10bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     41.158 s ±  0.083 s    [User: 324.702 s, System: 0.936 s]
  Range (min … max):   41.026 s … 41.235 s    5 runs

Benchmark 2: binaries/dav1d-74f485 -q -i ./Chimera-AV1-10bit.ivf -o /dev/null --threads 8
  Time (mean ± σ):     41.250 s ±  0.135 s    [User: 325.357 s, System: 0.929 s]
  Range (min … max):   41.019 s … 41.353 s    5 runs

(On this machine, the overhead on 8bit looked to be 4% and is closer to 2% with this, but take this with a grain of salt: I don't know if I need to compile the C version enabling LTO myself, or if it's been added to the build options when LTO was added to the rust version. Still, whatever the baseline is, the difference on the rust code between this PR and master seems to be 2% locally)

I don't expect this to land as-is, I've tried using macros to make it easier to read, it's not documented etc. But I'm not sure what you think about all this, so I decided to open a PR so you can see how this experiment looks in practice.

I've also quickly looked at doing this in read_coef_blocks and the like, instead of decode_coefs, but there are two different tdims and would result in a lot more code, for unclear gains and diminishing returns. A better approach, if possible, may be able to do this better.

Copy link
Collaborator

@rinon rinon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks promising! I'll test this out and see what performance looks like on my machine.

src/tables.rs Outdated
@@ -192,6 +192,211 @@ impl BlockSize {
}
}

pub const fn dav1d_txfm_size<const txsize_discr: usize>() -> TxfmSize {
let Some(size) = TxfmSize::from_repr(txsize_discr) else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use .unwrap_unchecked() here?

Copy link
Author

@lqd lqd Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried actually, you can't unwrap in a const fn without a nightly feature, and expected it'd be the same for the _unchecked version, but I can check.

I also haven't tried using a non-const function, I really wanted to make sure the MIR const optimizations would be able to chew on this; but haven't checked any alternatives other than turning each decode_coefs call into a macro call that dispatches all TxfmSizes in a match...

It's very likely we can clean this up a lot if the results hold up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh it looks like it is still unstable. Gotcha.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh it looks like it is still unstable. Gotcha.

Finally stable in 1.83!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you maybe waiting for this to merge the PR?

src/recon.rs Outdated Show resolved Hide resolved
},
}
}

pub static dav1d_txfm_dimensions: [TxfmInfo; TxfmSize::COUNT] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a static? Can it be const so we don't have to duplicate it in the const switch above? I don't see anywhere that really relies on being able to take the address of these elements (i.e. they don't seem to be used from ASM?)

Copy link
Author

@lqd lqd Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't seen a need for it to remain a static, but preferred to leave this out of the experiment to have your opinion about the const generics first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also use enum_map! to use the above fn to generate this table (and make the key typed).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try enum_map! but then my measurements seemed slightly worse. It also required implement DefaultValue for TxfmInfo that I'm not sure can be correct, but then again it won't be used by the enum map so it may be fine?

Since the measurements seemed slightly worsened (and I haven't had the time t check the assembly), but still within noise most of the time, I didn't add this change to this PR. I can definitely push it to another branch if you'd like to test it yourself.

I also tried completely removing the static and using the (modified) const fn for the runtime values outside of decode_coefs, but that also seemed to yield worse numbers, so I've left it out of this PR. I did push a commit removing the duplication by using the new function though.

src/recon.rs Outdated Show resolved Hide resolved
@fbossen
Copy link
Collaborator

fbossen commented Jul 17, 2024

@lqd Out of curiosity: what processor do you use for benchmarking? On an AMD 5600G (6 cores) I am seeing the overhead vs dav1d being reduced from an average of +7.8% to an average of 7.7%. Note that the average also includes the two summer nature test streams which have higher overhead.

@lqd
Copy link
Author

lqd commented Jul 17, 2024

I mostly use a zen2 epyc 7502P, but also tried on a i9-13900KF where results were similar.

edit: "similar"-ish, but lot closer (where each run is also a lot faster). The overhead there doesn't go down as much, as you said, compared to what seems to be a bigger reduction on the 7502P.

nick also had a 1% improvement in his experiment, so maybe there's something there (and hopefully not only on our machines 😓) and why I wanted to check with you all.

src/recon.rs Outdated
@@ -1013,34 +1084,59 @@ fn decode_coefs<BD: BitDepth>(
let cf = &mut cf;
(rc, dc_tok) = match tx_class {
TxClass::TwoD => decode_coefs_class::<{ TxClass::TwoD as _ }, BD>(
ts_c, t_dim, chroma, scratch, eob, tx, dbg, cf,
ts_c,
&tdim!(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be passed as a const generic, too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, and it did seem to work equally. I can push this change here if you'd like. I'm trying to avoid pushing too many changes to not incur many more rounds of measurements to everybody.

src/recon.rs Outdated Show resolved Hide resolved
@rinon
Copy link
Collaborator

rinon commented Jul 17, 2024

This is a 1.2% speedup on 8-bit and no change on 10-bit on my Ryzen 7700X

@lqd lqd marked this pull request as ready for review July 18, 2024 21:07
src/recon.rs Outdated Show resolved Hide resolved
src/recon.rs Outdated Show resolved Hide resolved
src/tables.rs Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants