-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise xorBytes - huge gains for SRTP AES-CM #204
Conversation
adriancable
commented
May 21, 2022
Codecov Report
@@ Coverage Diff @@
## master #204 +/- ##
==========================================
+ Coverage 75.16% 75.22% +0.06%
==========================================
Files 17 17
Lines 1224 1227 +3
==========================================
+ Hits 920 923 +3
Misses 208 208
Partials 96 96
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
``` name old time/op new time/op delta EncryptRTP/CTR-100-12 852ns ± 1% 758ns ± 2% -11.09% EncryptRTP/CTR-1000-12 3.98µs ± 1% 2.96µs ± 2% -25.69% EncryptRTPInPlace/CTR-100-12 824ns ± 1% 734ns ± 4% -10.93% EncryptRTPInPlace/CTR-1000-12 3.89µs ± 5% 2.85µs ± 1% -26.77% DecryptRTP/CTR-12 518ns ± 1% 515ns ± 1% -0.70% Write/CTR-100-12 863ns ± 1% 762ns ± 1% -11.71% Write/CTR-1000-12 3.91µs ± 2% 2.90µs ± 2% -25.85% WriteRTP/CTR-100-12 816ns ± 4% 694ns ± 2% -14.93% WriteRTP/CTR-1400-12 3.95µs ± 3% 2.79µs ± 1% -29.21% name old speed new speed delta EncryptRTP/CTR-100-12 131MB/s ± 1% 148MB/s ± 2% +12.48% EncryptRTP/CTR-1000-12 254MB/s ± 1% 342MB/s ± 2% +34.58% EncryptRTPInPlace/CTR-100-12 136MB/s ± 1% 153MB/s ± 3% +12.29% EncryptRTPInPlace/CTR-1000-12 260MB/s ± 5% 355MB/s ± 1% +36.51% DecryptRTP/CTR-12 54.0MB/s ± 1% 54.4MB/s ± 1% +0.70% Write/CTR-100-12 130MB/s ± 1% 147MB/s ± 1% +13.25% Write/CTR-1000-12 259MB/s ± 2% 349MB/s ± 2% +34.85% WriteRTP/CTR-100-12 137MB/s ± 4% 161MB/s ± 2% +17.52% WriteRTP/CTR-1400-12 256MB/s ± 3% 362MB/s ± 1% +41.22% ```
2230169
to
7d96e6e
Compare
That's really exciting! @adriancable how/why it is faster? I don't understand what this is doing fully. Mind updating the commit message to explain why? Ah I see it is operating 4 bytes at a time instead one. That's a pretty cool trick. Wonder if this is possible via unsafe/other methods. I think it is worth investigating. Sorry to put extra work on you. I just worry this will cause frustration for implementations/users off the beaten path. I am not sure what can happen, but unsafe always has a habit of causing problems :( @jech Do you have any concerns about this/see anything that could go wrong? |
It’s definitely worth getting extra eyes on, but the performance gain is just so huge. I guess I have done a lot of embedded work where everything is unsafe and these kinds of tricks are used all the time, and so in fact when things operate one byte at a time (like it used to) it’s a red flag for me. I actually think this change is actually conpletely safe and is functionally identical to the original. |
Good catch, it's definitely an optimisation worth doing. Sean, Adrian's version is moving data 4 bytes at a time, which divides the number of instructions by four. Adrian's version doesn't handle the case where the buffers are not aligned, which might possibly cause difficult to reproduce issues on arches that don't handle unaligned stores (at least ARMv5, MIPS and Sparc, don't know about RISC-V). (I'm not saying it will actually cause trouble, I'm just saying that the possibility makes me nervous). However, it turns out that we already have an optimised version of xorBytes that handles this case correctly, and it's also more efficient than Adrian's version, since it does 8-byte copies on 64-bit arches. It's here: https://github.com/pion/stun/blob/master/xor.go. Unfortunately, it doesn't currently do ARMv7, as discussed here: pion/stun#73 (comment). (Note further that the Go stdlib sources contain a version of xorBytes that's written in assembler, which used 128-bit copies on ARM (using NEON) and 256-bit copies on AMD64 (using AVX). It's under So I suggest the following approach:
Ideally, of course, we'd convince the Go people to export their version. |
Juliusz - thanks for the great feedback. I looked at the |
No idea. Will you file an issue, or shall I? |
Juliusz - very happy if you can file it, as I am OOP for the next couple of days. If you don’t get time, let me know and I will do it later in the week but very happy for you to file it. |
@Sean-Der - do you agree with the idea of creating a pion/crypto repo, and then we can move this common stuff that’s currently in crypto.go to that instead of duplicating between srtp and turn? Once the structural stuff is done, I’m happy to then file PRs to add the optimised versions of the functions. |
Ok. When you have time, please move our copy of xorBytes into a package that both srtp and stun depend on, and I'll optimise it for ARMv7. (I didn't bother optimising it for AR Mv7 earlier because it was only used by STUN/TURN, but if it's used for CTR, it's time spending some effort on it.) |
It sounds like this could end up in @adriancable What do you think of putting it in https://github.com/pion/transport/ since everyone already depends on that. If you want to start github.com/pion/crypto I am up for that as well. The ops burden is pretty significant with all the repos we have today already. |
Let’s do it in pion/transport to save the ops burden. I’m happy to volunteer to get the structure set up and a draft PR with an initial take on the code. I would also like to work on the ARMv7 NEON implementation with @jech as that will be a huge performance win. We can then submit that as the reference implementation to the golang repo further down the line. |
@adriancable sounds like a great plan to me! Open away and happy to approve :) |
Replaced by #211. |