-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decode_coefs
experiment
#1325
base: main
Are you sure you want to change the base?
decode_coefs
experiment
#1325
Conversation
these are also inlined in C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks promising! I'll test this out and see what performance looks like on my machine.
src/tables.rs
Outdated
@@ -192,6 +192,211 @@ impl BlockSize { | |||
} | |||
} | |||
|
|||
pub const fn dav1d_txfm_size<const txsize_discr: usize>() -> TxfmSize { | |||
let Some(size) = TxfmSize::from_repr(txsize_discr) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use .unwrap_unchecked()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried actually, you can't unwrap in a const fn without a nightly feature, and expected it'd be the same for the _unchecked version, but I can check.
I also haven't tried using a non-const function, I really wanted to make sure the MIR const optimizations would be able to chew on this; but haven't checked any alternatives other than turning each decode_coefs
call into a macro call that dispatches all TxfmSize
s in a match...
It's very likely we can clean this up a lot if the results hold up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh it looks like it is still unstable. Gotcha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh it looks like it is still unstable. Gotcha.
Finally stable in 1.83!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you maybe waiting for this to merge the PR?
}, | ||
} | ||
} | ||
|
||
pub static dav1d_txfm_dimensions: [TxfmInfo; TxfmSize::COUNT] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a static? Can it be const so we don't have to duplicate it in the const switch above? I don't see anywhere that really relies on being able to take the address of these elements (i.e. they don't seem to be used from ASM?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't seen a need for it to remain a static, but preferred to leave this out of the experiment to have your opinion about the const generics first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use enum_map!
to use the above fn
to generate this table (and make the key typed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try enum_map!
but then my measurements seemed slightly worse. It also required implement DefaultValue
for TxfmInfo
that I'm not sure can be correct, but then again it won't be used by the enum map so it may be fine?
Since the measurements seemed slightly worsened (and I haven't had the time t check the assembly), but still within noise most of the time, I didn't add this change to this PR. I can definitely push it to another branch if you'd like to test it yourself.
I also tried completely removing the static
and using the (modified) const fn
for the runtime values outside of decode_coefs
, but that also seemed to yield worse numbers, so I've left it out of this PR. I did push a commit removing the duplication by using the new function though.
@lqd Out of curiosity: what processor do you use for benchmarking? On an AMD 5600G (6 cores) I am seeing the overhead vs dav1d being reduced from an average of +7.8% to an average of 7.7%. Note that the average also includes the two summer nature test streams which have higher overhead. |
I mostly use a zen2 epyc 7502P, but also tried on a i9-13900KF where results were similar. edit: "similar"-ish, but lot closer (where each run is also a lot faster). The overhead there doesn't go down as much, as you said, compared to what seems to be a bigger reduction on the 7502P. nick also had a 1% improvement in his experiment, so maybe there's something there (and hopefully not only on our machines 😓) and why I wanted to check with you all. |
src/recon.rs
Outdated
@@ -1013,34 +1084,59 @@ fn decode_coefs<BD: BitDepth>( | |||
let cf = &mut cf; | |||
(rc, dc_tok) = match tx_class { | |||
TxClass::TwoD => decode_coefs_class::<{ TxClass::TwoD as _ }, BD>( | |||
ts_c, t_dim, chroma, scratch, eob, tx, dbg, cf, | |||
ts_c, | |||
&tdim!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be passed as a const generic, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, and it did seem to work equally. I can push this change here if you'd like. I'm trying to avoid pushing too many changes to not incur many more rounds of measurements to everybody.
This is a 1.2% speedup on 8-bit and no change on 10-bit on my Ryzen 7700X |
Here's a draft PR for a checkpoint of where I am w.r.t.
decode_coefs
, so you can tell me what you think, and see if my results hold up for you as well.We discussed how accessing the
TxfmInfo
s via memreads was less optimizable than via the actual constants that they are, in particular in @nnethercote's experiment, albeit at a cost in binary size.Now, there are may ways to do this here, and what I've done is an example of that where I've only made the discriminant const generic instead of the entire tdim as Nick had. That's also how you've done the same idea in
decode_coefs_class
.In my experiments, it wasn't the absolute fastest way to do it and I've seen slightly better numbers via macros instead of using an
#[inline(always)]
trampoline. But this way has somewhat nicer ergonomics and works (and could also be micro-optimized in the future).What I didn't expect however is that it would unlock more optimization here, in particular we can force inlining on a couple functions which weren't completely inlined before, for improved codegen. Of note these 2 functions are completed inlined in the C version as far as I can tell.
The binary size cost here is 5.8%, from 4010088 bytes to 4243128. It's more than what Nick was reporting, so we could also try his approach and see if we can then unlock the same gains again.
(I've also removed the bounds check I mentioned last time for slightly better results, if my benchmarking is accurate enough.)
Here are the results, on 8bit vs the master commit 74f485, around -1.7% if again the benchmarking is accurate enough:
and on 10bit, where it seems less effective:
(On this machine, the overhead on 8bit looked to be 4% and is closer to 2% with this, but take this with a grain of salt: I don't know if I need to compile the C version enabling LTO myself, or if it's been added to the build options when LTO was added to the rust version. Still, whatever the baseline is, the difference on the rust code between this PR and master seems to be 2% locally)
I don't expect this to land as-is, I've tried using macros to make it easier to read, it's not documented etc. But I'm not sure what you think about all this, so I decided to open a PR so you can see how this experiment looks in practice.
I've also quickly looked at doing this in
read_coef_blocks
and the like, instead ofdecode_coefs
, but there are two different tdims and would result in a lot more code, for unclear gains and diminishing returns. A better approach, if possible, may be able to do this better.