-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chacha20: add SSE2 accelerated variant #61
Conversation
f890007
to
ff26e7c
Compare
Awesome, this looks good at first glance.
Yes. It might make sense to look at ways to improve/simplify the Additionally note we haven't implemented an optimization where the ChaCha (or Salsa) where the input array / block state is computed once, and then copied with only the counter updated to avoid setting it up again over and over for each block, as described in RFC 8439 Section 3
...so in addition to computing multiple blocks in parallel, they can all be initialized by copying them from the same pre-initialized block, and then setting their counters accordingly. AVX2 sounds good. As for AVX-512, I had suspected that in most situations the performance tradeoffs for AVX-512 aren't worth it only to have that later confirmed by Cloudflare engineers. An AVX-512 implementation of ChaCha20 is really only useful as something like a CSPRNG feeding another SIMD workload. As soon as you try to combine it into the ChaCha20Poly1305 AEAD, performance suffers, so I'd recommend just stopping with AVX2. Let me get back to your questions on gating / facade / testing. I had lots of fun with these in the |
Answering your remaining questions... Gating
I'd suggest getting everything working with compile-time gating only and then circling back on runtime gating. I ended up abandoning runtime feature detection for PCLMULQDQ in the
I'm fine with pulling in Facade
In I think you could potentially have an Testing
Add whatever That said, I originally did something similar in the tl;dr: I've gone back and forth on a lot of this stuff, and don't have good answers or strong preferences either way. |
Thanks for the feedback! After trying a few different approaches, I've now settled on the following choices:
I think I'm quite happy with the changes for now, and would defer to you for any change requests you may have :) |
Looks good to me and I'll go ahead and merge. Some notes:
Yeah, this was a big pain in the ass for my (abandoned)
Oof, yeah at some point I think we can look into doing a 2018 edition upgrade for select crates. It was easier to keep them consistent at first, but it's certainly getting annoying to deal with such old Rust versions. |
@srijs do you want to do an AVX2 implementation before I do another release? I'm going to throw |
Also, looking at the optimization described in RFC8439 Section 3, it'd probably be good to do that before adding any additional backends, since that impacts how they're structured. |
I don't have a strong opinion on the release question. Supposedly releasing the sse2 changes would just be a patch version bump, which comes "for free" for any downstream consumer? Agreed that it probably makes sense to tackle the state copying optimization first, so I'm happy to give that a go. I'm not entirely sure I fully understand the intend behind it though... is the difference really just between copying the full state vs. copying consts, key, iv and counter individually? |
Looking over it again, I think we already compute part of what it's talking about in advance, which is here: https://github.com/RustCrypto/stream-ciphers/blob/master/chacha20/src/lib.rs#L104 For the full effect, it seems like we might compute the initial state array sans counter, then clone that |
Right, so I've prototyped this idea, but according to criterion there is no statistically significant difference between the two approaches. My guess is that the part of the optimization that's already in place (which you linked to) achieves the main part of the speed up, to the point where implementing the last part of the optimization really only results in one or two fewer loads and stores, which is not noticeable compared to the time spent in rest of the block function. |
FWIW I've also looked into So it may be useful to release the |
Yeah, honestly I'd probably just rewrite it before beginning to add an optimization like that. It feels like it's written at the wrong level of abstraction to me since it caches things in terms of the Something more like one of these implementations would make it a bit more straightforward: https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/ctr32.rs |
Sure. I was thinking about just bumping it to v0.3.0 so the SSE-enabled version had a nice round number (the main downstream consumer is the |
This implements an SSE2-based variant of ChaCha20.
I'd also be keen to help adding support for AVX2 and AVX512, but it looks like this may need larger changes (possibly in the
salsa20_core
crate) to allow for processing multiple blocks simultaneously?On my machine (early 2016 MacBook, 1.1GHz M3), the SSE2 variant shows a good improvement over the scalar code, speeding up the benchmarks from 260MB/s to 327MB/s.
Aside from the core algorithm, here are some of the things I could use feedback on:
Gating: I've only implemented compile-time feature detection for now, as that seemed easier. I could potentially add runtime detection as well, if you think that is useful? Additionally, I was considering to pull in
cfg-if
to simplify the detection logic, but wasn't sure if you wanted to have an additional dependency. Lastly, do we need to be backwards compatible to Rust versions beforecore::arch
got stabilized?Facade: I'm currently switching between the two implementations inside the
cipher
module, but I wonder if there is a better way? Potentially, theblock
module could become a facade forblock/scalar
andblock/sse2
, which would abstract away the fact that there are multiple implementations?Testing: I've opted to pregenerate random inputs that I use in the tests, which seemed like an okay approach. I was considering pulling in
proptest
to essentially randomly test the new variant against the scalar implementation, but I wasn't quite sure if you wanted to have it as a dependency.Before:
After: