Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise xorBytes - huge gains for SRTP AES-CM #204

Closed
wants to merge 1 commit into from

Conversation

adriancable
Copy link
Contributor

name                           old time/op    new time/op    delta
EncryptRTP/CTR-100-12             852ns ± 1%     758ns ± 2%  -11.09%
EncryptRTP/CTR-1000-12           3.98µs ± 1%    2.96µs ± 2%  -25.69%
EncryptRTPInPlace/CTR-100-12      824ns ± 1%     734ns ± 4%  -10.93%
EncryptRTPInPlace/CTR-1000-12    3.89µs ± 5%    2.85µs ± 1%  -26.77%
DecryptRTP/CTR-12                 518ns ± 1%     515ns ± 1%   -0.70%
Write/CTR-100-12                  863ns ± 1%     762ns ± 1%  -11.71%
Write/CTR-1000-12                3.91µs ± 2%    2.90µs ± 2%  -25.85%
WriteRTP/CTR-100-12               816ns ± 4%     694ns ± 2%  -14.93%
WriteRTP/CTR-1400-12             3.95µs ± 3%    2.79µs ± 1%  -29.21%

name                           old speed      new speed      delta
EncryptRTP/CTR-100-12           131MB/s ± 1%   148MB/s ± 2%  +12.48%
EncryptRTP/CTR-1000-12          254MB/s ± 1%   342MB/s ± 2%  +34.58%
EncryptRTPInPlace/CTR-100-12    136MB/s ± 1%   153MB/s ± 3%  +12.29%
EncryptRTPInPlace/CTR-1000-12   260MB/s ± 5%   355MB/s ± 1%  +36.51%
DecryptRTP/CTR-12              54.0MB/s ± 1%  54.4MB/s ± 1%   +0.70%
Write/CTR-100-12                130MB/s ± 1%   147MB/s ± 1%  +13.25%
Write/CTR-1000-12               259MB/s ± 2%   349MB/s ± 2%  +34.85%
WriteRTP/CTR-100-12             137MB/s ± 4%   161MB/s ± 2%  +17.52%
WriteRTP/CTR-1400-12            256MB/s ± 3%   362MB/s ± 1%  +41.22%

@codecov
Copy link

codecov bot commented May 21, 2022

Codecov Report

Merging #204 (7d96e6e) into master (b74938d) will increase coverage by 0.06%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #204      +/-   ##
==========================================
+ Coverage   75.16%   75.22%   +0.06%     
==========================================
  Files          17       17              
  Lines        1224     1227       +3     
==========================================
+ Hits          920      923       +3     
  Misses        208      208              
  Partials       96       96              
Flag Coverage Δ
go 75.22% <100.00%> (+0.06%) ⬆️
wasm 74.73% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
crypto.go 94.73% <100.00%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b74938d...7d96e6e. Read the comment docs.

```
name                           old time/op    new time/op    delta
EncryptRTP/CTR-100-12             852ns ± 1%     758ns ± 2%  -11.09%
EncryptRTP/CTR-1000-12           3.98µs ± 1%    2.96µs ± 2%  -25.69%
EncryptRTPInPlace/CTR-100-12      824ns ± 1%     734ns ± 4%  -10.93%
EncryptRTPInPlace/CTR-1000-12    3.89µs ± 5%    2.85µs ± 1%  -26.77%
DecryptRTP/CTR-12                 518ns ± 1%     515ns ± 1%   -0.70%
Write/CTR-100-12                  863ns ± 1%     762ns ± 1%  -11.71%
Write/CTR-1000-12                3.91µs ± 2%    2.90µs ± 2%  -25.85%
WriteRTP/CTR-100-12               816ns ± 4%     694ns ± 2%  -14.93%
WriteRTP/CTR-1400-12             3.95µs ± 3%    2.79µs ± 1%  -29.21%

name                           old speed      new speed      delta
EncryptRTP/CTR-100-12           131MB/s ± 1%   148MB/s ± 2%  +12.48%
EncryptRTP/CTR-1000-12          254MB/s ± 1%   342MB/s ± 2%  +34.58%
EncryptRTPInPlace/CTR-100-12    136MB/s ± 1%   153MB/s ± 3%  +12.29%
EncryptRTPInPlace/CTR-1000-12   260MB/s ± 5%   355MB/s ± 1%  +36.51%
DecryptRTP/CTR-12              54.0MB/s ± 1%  54.4MB/s ± 1%   +0.70%
Write/CTR-100-12                130MB/s ± 1%   147MB/s ± 1%  +13.25%
Write/CTR-1000-12               259MB/s ± 2%   349MB/s ± 2%  +34.85%
WriteRTP/CTR-100-12             137MB/s ± 4%   161MB/s ± 2%  +17.52%
WriteRTP/CTR-1400-12            256MB/s ± 3%   362MB/s ± 1%  +41.22%
```
@adriancable adriancable requested a review from Sean-Der May 21, 2022 02:56
@Sean-Der
Copy link
Member

Sean-Der commented May 21, 2022

That's really exciting! @adriancable how/why it is faster? I don't understand what this is doing fully. Mind updating the commit message to explain why?

Ah I see it is operating 4 bytes at a time instead one. That's a pretty cool trick. Wonder if this is possible via unsafe/other methods. I think it is worth investigating. Sorry to put extra work on you. I just worry this will cause frustration for implementations/users off the beaten path. I am not sure what can happen, but unsafe always has a habit of causing problems :(

@jech Do you have any concerns about this/see anything that could go wrong?

@adriancable
Copy link
Contributor Author

It’s definitely worth getting extra eyes on, but the performance gain is just so huge. I guess I have done a lot of embedded work where everything is unsafe and these kinds of tricks are used all the time, and so in fact when things operate one byte at a time (like it used to) it’s a red flag for me.

I actually think this change is actually conpletely safe and is functionally identical to the original.

@jech
Copy link
Member

jech commented May 21, 2022

Good catch, it's definitely an optimisation worth doing. Sean, Adrian's version is moving data 4 bytes at a time, which divides the number of instructions by four.

Adrian's version doesn't handle the case where the buffers are not aligned, which might possibly cause difficult to reproduce issues on arches that don't handle unaligned stores (at least ARMv5, MIPS and Sparc, don't know about RISC-V). (I'm not saying it will actually cause trouble, I'm just saying that the possibility makes me nervous).

However, it turns out that we already have an optimised version of xorBytes that handles this case correctly, and it's also more efficient than Adrian's version, since it does 8-byte copies on 64-bit arches. It's here: https://github.com/pion/stun/blob/master/xor.go. Unfortunately, it doesn't currently do ARMv7, as discussed here: pion/stun#73 (comment).

(Note further that the Go stdlib sources contain a version of xorBytes that's written in assembler, which used 128-bit copies on ARM (using NEON) and 256-bit copies on AMD64 (using AVX). It's under src/crypto/cypher. Unfortunately, it's not exported.)

So I suggest the following approach:

  • move xorBytes from pion/stun into a repository on which both stun and srtp may depend, and use the same function in both places;
  • extend the condition at https://github.com/pion/stun/blob/master/xor.go#L14 to include ARMv6 and ARMv7 (but not ARMv5);
  • consider adding assembler versions from the Go sources if benchmarks show that it is worthwile and Sean agrees.

Ideally, of course, we'd convince the Go people to export their version.

@jech
Copy link
Member

jech commented May 21, 2022

golang/go#53021

@adriancable
Copy link
Contributor Author

Juliusz - thanks for the great feedback. I looked at the sec/crypto/cipher implementation and unfortunately this doesn’t include ARMv7 or ARMv6 when setting unalignedAccess. Forget about assembly implementations for a moment - on ARM32 they’re not even enabling the unaligned access optimisation. Do you know why this might be? Is this a golang bug/oversight? If so I will file an issue.

@jech
Copy link
Member

jech commented May 21, 2022

No idea. Will you file an issue, or shall I?

@adriancable
Copy link
Contributor Author

Juliusz - very happy if you can file it, as I am OOP for the next couple of days. If you don’t get time, let me know and I will do it later in the week but very happy for you to file it.

@adriancable
Copy link
Contributor Author

@Sean-Der - do you agree with the idea of creating a pion/crypto repo, and then we can move this common stuff that’s currently in crypto.go to that instead of duplicating between srtp and turn? Once the structural stuff is done, I’m happy to then file PRs to add the optimised versions of the functions.

@jech
Copy link
Member

jech commented May 21, 2022

Ok. When you have time, please move our copy of xorBytes into a package that both srtp and stun depend on, and I'll optimise it for ARMv7. (I didn't bother optimising it for AR Mv7 earlier because it was only used by STUN/TURN, but if it's used for CTR, it's time spending some effort on it.)

@jech
Copy link
Member

jech commented May 21, 2022

golang/go#53023

@Sean-Der
Copy link
Member

It sounds like this could end up in /x that is pretty exciting!

@adriancable What do you think of putting it in https://github.com/pion/transport/ since everyone already depends on that. If you want to start github.com/pion/crypto I am up for that as well. The ops burden is pretty significant with all the repos we have today already.

@adriancable
Copy link
Contributor Author

Let’s do it in pion/transport to save the ops burden. I’m happy to volunteer to get the structure set up and a draft PR with an initial take on the code. I would also like to work on the ARMv7 NEON implementation with @jech as that will be a huge performance win. We can then submit that as the reference implementation to the golang repo further down the line.

@Sean-Der
Copy link
Member

@adriancable sounds like a great plan to me! Open away and happy to approve :)

@adriancable
Copy link
Contributor Author

Replaced by #211.

@adriancable adriancable closed this Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants