Optimise xorBytes - huge gains for SRTP AES-CM #204

adriancable · 2022-05-21T02:52:43Z

name                           old time/op    new time/op    delta
EncryptRTP/CTR-100-12             852ns ± 1%     758ns ± 2%  -11.09%
EncryptRTP/CTR-1000-12           3.98µs ± 1%    2.96µs ± 2%  -25.69%
EncryptRTPInPlace/CTR-100-12      824ns ± 1%     734ns ± 4%  -10.93%
EncryptRTPInPlace/CTR-1000-12    3.89µs ± 5%    2.85µs ± 1%  -26.77%
DecryptRTP/CTR-12                 518ns ± 1%     515ns ± 1%   -0.70%
Write/CTR-100-12                  863ns ± 1%     762ns ± 1%  -11.71%
Write/CTR-1000-12                3.91µs ± 2%    2.90µs ± 2%  -25.85%
WriteRTP/CTR-100-12               816ns ± 4%     694ns ± 2%  -14.93%
WriteRTP/CTR-1400-12             3.95µs ± 3%    2.79µs ± 1%  -29.21%

name                           old speed      new speed      delta
EncryptRTP/CTR-100-12           131MB/s ± 1%   148MB/s ± 2%  +12.48%
EncryptRTP/CTR-1000-12          254MB/s ± 1%   342MB/s ± 2%  +34.58%
EncryptRTPInPlace/CTR-100-12    136MB/s ± 1%   153MB/s ± 3%  +12.29%
EncryptRTPInPlace/CTR-1000-12   260MB/s ± 5%   355MB/s ± 1%  +36.51%
DecryptRTP/CTR-12              54.0MB/s ± 1%  54.4MB/s ± 1%   +0.70%
Write/CTR-100-12                130MB/s ± 1%   147MB/s ± 1%  +13.25%
Write/CTR-1000-12               259MB/s ± 2%   349MB/s ± 2%  +34.85%
WriteRTP/CTR-100-12             137MB/s ± 4%   161MB/s ± 2%  +17.52%
WriteRTP/CTR-1400-12            256MB/s ± 3%   362MB/s ± 1%  +41.22%

codecov · 2022-05-21T02:53:26Z

Codecov Report

Merging #204 (7d96e6e) into master (b74938d) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #204      +/-   ##
==========================================
+ Coverage   75.16%   75.22%   +0.06%     
==========================================
  Files          17       17              
  Lines        1224     1227       +3     
==========================================
+ Hits          920      923       +3     
  Misses        208      208              
  Partials       96       96

Flag	Coverage Δ
go	`75.22% <100.00%> (+0.06%)`	⬆️
wasm	`74.73% <100.00%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
crypto.go	`94.73% <100.00%> (+0.45%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b74938d...7d96e6e. Read the comment docs.

``` name old time/op new time/op delta EncryptRTP/CTR-100-12 852ns ± 1% 758ns ± 2% -11.09% EncryptRTP/CTR-1000-12 3.98µs ± 1% 2.96µs ± 2% -25.69% EncryptRTPInPlace/CTR-100-12 824ns ± 1% 734ns ± 4% -10.93% EncryptRTPInPlace/CTR-1000-12 3.89µs ± 5% 2.85µs ± 1% -26.77% DecryptRTP/CTR-12 518ns ± 1% 515ns ± 1% -0.70% Write/CTR-100-12 863ns ± 1% 762ns ± 1% -11.71% Write/CTR-1000-12 3.91µs ± 2% 2.90µs ± 2% -25.85% WriteRTP/CTR-100-12 816ns ± 4% 694ns ± 2% -14.93% WriteRTP/CTR-1400-12 3.95µs ± 3% 2.79µs ± 1% -29.21% name old speed new speed delta EncryptRTP/CTR-100-12 131MB/s ± 1% 148MB/s ± 2% +12.48% EncryptRTP/CTR-1000-12 254MB/s ± 1% 342MB/s ± 2% +34.58% EncryptRTPInPlace/CTR-100-12 136MB/s ± 1% 153MB/s ± 3% +12.29% EncryptRTPInPlace/CTR-1000-12 260MB/s ± 5% 355MB/s ± 1% +36.51% DecryptRTP/CTR-12 54.0MB/s ± 1% 54.4MB/s ± 1% +0.70% Write/CTR-100-12 130MB/s ± 1% 147MB/s ± 1% +13.25% Write/CTR-1000-12 259MB/s ± 2% 349MB/s ± 2% +34.85% WriteRTP/CTR-100-12 137MB/s ± 4% 161MB/s ± 2% +17.52% WriteRTP/CTR-1400-12 256MB/s ± 3% 362MB/s ± 1% +41.22% ```

Sean-Der · 2022-05-21T03:32:19Z

That's really exciting! @adriancable how/why it is faster? I don't understand what this is doing fully. Mind updating the commit message to explain why?

Ah I see it is operating 4 bytes at a time instead one. That's a pretty cool trick. Wonder if this is possible via unsafe/other methods. I think it is worth investigating. Sorry to put extra work on you. I just worry this will cause frustration for implementations/users off the beaten path. I am not sure what can happen, but unsafe always has a habit of causing problems :(

@jech Do you have any concerns about this/see anything that could go wrong?

adriancable · 2022-05-21T04:15:47Z

It’s definitely worth getting extra eyes on, but the performance gain is just so huge. I guess I have done a lot of embedded work where everything is unsafe and these kinds of tricks are used all the time, and so in fact when things operate one byte at a time (like it used to) it’s a red flag for me.

I actually think this change is actually conpletely safe and is functionally identical to the original.

jech · 2022-05-21T08:22:42Z

Good catch, it's definitely an optimisation worth doing. Sean, Adrian's version is moving data 4 bytes at a time, which divides the number of instructions by four.

Adrian's version doesn't handle the case where the buffers are not aligned, which might possibly cause difficult to reproduce issues on arches that don't handle unaligned stores (at least ARMv5, MIPS and Sparc, don't know about RISC-V). (I'm not saying it will actually cause trouble, I'm just saying that the possibility makes me nervous).

However, it turns out that we already have an optimised version of xorBytes that handles this case correctly, and it's also more efficient than Adrian's version, since it does 8-byte copies on 64-bit arches. It's here: https://github.com/pion/stun/blob/master/xor.go. Unfortunately, it doesn't currently do ARMv7, as discussed here: pion/stun#73 (comment).

(Note further that the Go stdlib sources contain a version of xorBytes that's written in assembler, which used 128-bit copies on ARM (using NEON) and 256-bit copies on AMD64 (using AVX). It's under src/crypto/cypher. Unfortunately, it's not exported.)

So I suggest the following approach:

move xorBytes from pion/stun into a repository on which both stun and srtp may depend, and use the same function in both places;
extend the condition at https://github.com/pion/stun/blob/master/xor.go#L14 to include ARMv6 and ARMv7 (but not ARMv5);
consider adding assembler versions from the Go sources if benchmarks show that it is worthwile and Sean agrees.

Ideally, of course, we'd convince the Go people to export their version.

jech · 2022-05-21T08:51:27Z

golang/go#53021

adriancable · 2022-05-21T13:50:28Z

Juliusz - thanks for the great feedback. I looked at the sec/crypto/cipher implementation and unfortunately this doesn’t include ARMv7 or ARMv6 when setting unalignedAccess. Forget about assembly implementations for a moment - on ARM32 they’re not even enabling the unaligned access optimisation. Do you know why this might be? Is this a golang bug/oversight? If so I will file an issue.

jech · 2022-05-21T13:59:51Z

No idea. Will you file an issue, or shall I?

adriancable · 2022-05-21T14:02:01Z

Juliusz - very happy if you can file it, as I am OOP for the next couple of days. If you don’t get time, let me know and I will do it later in the week but very happy for you to file it.

adriancable · 2022-05-21T14:06:22Z

@Sean-Der - do you agree with the idea of creating a pion/crypto repo, and then we can move this common stuff that’s currently in crypto.go to that instead of duplicating between srtp and turn? Once the structural stuff is done, I’m happy to then file PRs to add the optimised versions of the functions.

jech · 2022-05-21T14:06:27Z

Ok. When you have time, please move our copy of xorBytes into a package that both srtp and stun depend on, and I'll optimise it for ARMv7. (I didn't bother optimising it for AR Mv7 earlier because it was only used by STUN/TURN, but if it's used for CTR, it's time spending some effort on it.)

jech · 2022-05-21T14:13:44Z

golang/go#53023

Sean-Der · 2022-05-21T18:21:20Z

It sounds like this could end up in /x that is pretty exciting!

@adriancable What do you think of putting it in https://github.com/pion/transport/ since everyone already depends on that. If you want to start github.com/pion/crypto I am up for that as well. The ops burden is pretty significant with all the repos we have today already.

adriancable · 2022-05-21T18:47:06Z

Let’s do it in pion/transport to save the ops burden. I’m happy to volunteer to get the structure set up and a draft PR with an initial take on the code. I would also like to work on the ARMv7 NEON implementation with @jech as that will be a huge performance win. We can then submit that as the reference implementation to the golang repo further down the line.

Sean-Der · 2022-05-21T19:20:16Z

@adriancable sounds like a great plan to me! Open away and happy to approve :)

adriancable · 2022-07-08T19:07:39Z

Replaced by #211.

adriancable force-pushed the optimise-xorbytes branch from 2230169 to 7d96e6e Compare May 21, 2022 02:54

adriancable requested a review from Sean-Der May 21, 2022 02:56

adriancable closed this Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise xorBytes - huge gains for SRTP AES-CM #204

Optimise xorBytes - huge gains for SRTP AES-CM #204

adriancable commented May 21, 2022

codecov bot commented May 21, 2022 •

edited

Loading

Sean-Der commented May 21, 2022 •

edited

Loading

adriancable commented May 21, 2022

jech commented May 21, 2022 •

edited

Loading

jech commented May 21, 2022

adriancable commented May 21, 2022

jech commented May 21, 2022

adriancable commented May 21, 2022

adriancable commented May 21, 2022

jech commented May 21, 2022

jech commented May 21, 2022

Sean-Der commented May 21, 2022

adriancable commented May 21, 2022

Sean-Der commented May 21, 2022

adriancable commented Jul 8, 2022

Optimise xorBytes - huge gains for SRTP AES-CM #204

Optimise xorBytes - huge gains for SRTP AES-CM #204

Conversation

adriancable commented May 21, 2022

codecov bot commented May 21, 2022 • edited Loading

Codecov Report

Sean-Der commented May 21, 2022 • edited Loading

adriancable commented May 21, 2022

jech commented May 21, 2022 • edited Loading

jech commented May 21, 2022

adriancable commented May 21, 2022

jech commented May 21, 2022

adriancable commented May 21, 2022

adriancable commented May 21, 2022

jech commented May 21, 2022

jech commented May 21, 2022

Sean-Der commented May 21, 2022

adriancable commented May 21, 2022

Sean-Der commented May 21, 2022

adriancable commented Jul 8, 2022

codecov bot commented May 21, 2022 •

edited

Loading

Sean-Der commented May 21, 2022 •

edited

Loading

jech commented May 21, 2022 •

edited

Loading