Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chacha20: add SSE2 accelerated variant #61

Merged
merged 1 commit into from
Oct 19, 2019

Conversation

srijs
Copy link
Contributor

@srijs srijs commented Oct 18, 2019

This implements an SSE2-based variant of ChaCha20.

I'd also be keen to help adding support for AVX2 and AVX512, but it looks like this may need larger changes (possibly in the salsa20_core crate) to allow for processing multiple blocks simultaneously?

On my machine (early 2016 MacBook, 1.1GHz M3), the SSE2 variant shows a good improvement over the scalar code, speeding up the benchmarks from 260MB/s to 327MB/s.

Aside from the core algorithm, here are some of the things I could use feedback on:

  • Gating: I've only implemented compile-time feature detection for now, as that seemed easier. I could potentially add runtime detection as well, if you think that is useful? Additionally, I was considering to pull in cfg-if to simplify the detection logic, but wasn't sure if you wanted to have an additional dependency. Lastly, do we need to be backwards compatible to Rust versions before core::arch got stabilized?

  • Facade: I'm currently switching between the two implementations inside the cipher module, but I wonder if there is a better way? Potentially, the block module could become a facade for block/scalar and block/sse2, which would abstract away the fact that there are multiple implementations?

  • Testing: I've opted to pregenerate random inputs that I use in the tests, which seemed like an okay approach. I was considering pulling in proptest to essentially randomly test the new variant against the scalar implementation, but I wasn't quite sure if you wanted to have it as a dependency.


Before:

test bench1_10     ... bench:          48 ns/iter (+/- 22) = 208 MB/s
test bench2_100    ... bench:         392 ns/iter (+/- 15) = 255 MB/s
test bench3_1000   ... bench:       4,043 ns/iter (+/- 642) = 247 MB/s
test bench4_10000  ... bench:      38,370 ns/iter (+/- 1,619) = 260 MB/s
test bench5_100000 ... bench:     383,558 ns/iter (+/- 9,745) = 260 MB/s

After:

test bench1_10     ... bench:          40 ns/iter (+/- 3) = 250 MB/s
test bench2_100    ... bench:         317 ns/iter (+/- 20) = 315 MB/s
test bench3_1000   ... bench:       3,082 ns/iter (+/- 117) = 324 MB/s
test bench4_10000  ... bench:      30,681 ns/iter (+/- 808) = 325 MB/s
test bench5_100000 ... bench:     305,280 ns/iter (+/- 14,035) = 327 MB/s

@srijs srijs force-pushed the chacha20-sse2 branch 3 times, most recently from f890007 to ff26e7c Compare October 18, 2019 12:36
@tarcieri
Copy link
Member

tarcieri commented Oct 18, 2019

SSE2 variant shows a good improvement over the scalar code, speeding up the benchmarks from 260MB/s to 327MB/s.

Awesome, this looks good at first glance.

I'd also be keen to help adding support for AVX2 and AVX512, but it looks like this may need larger changes (possibly in the salsa20_core crate) to allow for processing multiple blocks simultaneously?

Yes. It might make sense to look at ways to improve/simplify the salsa20-core crate as well, as its implementation is rather complex and not particularly optimal. Perhaps it would benefit from taking a 32-bit counter and outputting 512-bit blocks as [u8; 64], rather than the current approach which operates directly on the [u32; 8] output array.

Additionally note we haven't implemented an optimization where the ChaCha (or Salsa) where the input array / block state is computed once, and then copied with only the counter updated to avoid setting it up again over and over for each block, as described in RFC 8439 Section 3

Each block of ChaCha20 involves 16 move operations and one increment operation for loading the state, 80 each of XOR, addition and roll operations for the rounds, 16 more add operations and 16 XOR operations for protecting the plaintext. Section 2.3 describes the ChaCha block function as "adding the original input words". This implies that before starting the rounds on the ChaCha state, we copy it aside, only to add it in later. This is correct, but we can save a few operations if we instead copy the state and do the work on the copy. This way, for the next block you don't need to recreate the state, but only to increment the block counter. This saves approximately 5.5% of the cycles.

...so in addition to computing multiple blocks in parallel, they can all be initialized by copying them from the same pre-initialized block, and then setting their counters accordingly.

AVX2 sounds good. As for AVX-512, I had suspected that in most situations the performance tradeoffs for AVX-512 aren't worth it only to have that later confirmed by Cloudflare engineers. An AVX-512 implementation of ChaCha20 is really only useful as something like a CSPRNG feeding another SIMD workload. As soon as you try to combine it into the ChaCha20Poly1305 AEAD, performance suffers, so I'd recommend just stopping with AVX2.

Let me get back to your questions on gating / facade / testing. I had lots of fun with these in the polyval crate, but don't have time to write them up right now.

@tarcieri
Copy link
Member

Answering your remaining questions...

Gating

I've only implemented compile-time feature detection for now, as that seemed easier. I could potentially add runtime detection as well, if you think that is useful?

I'd suggest getting everything working with compile-time gating only and then circling back on runtime gating. I ended up abandoning runtime feature detection for PCLMULQDQ in the polyval crate, but may end up circling back on it eventually.

Additionally, I was considering to pull in cfg-if to simplify the detection logic, but wasn't sure if you wanted to have an additional dependency. Lastly, do we need to be backwards compatible to Rust versions before core::arch got stabilized?

I'm fine with pulling in cfg-if if you find it helpful. However, I tried to use it to simplify the gating in the polyval PR linked above only to discover it was too inexpressive to really help simplify the gating there.

Facade

I'm currently switching between the two implementations inside the cipher module, but I wonder if there is a better way? Potentially, the block module could become a facade for block/scalar and block/sse2, which would abstract away the fact that there are multiple implementations?

In polyval there's a field::Element facade/newtype, but I've been wondering if that might make sense to replace with a trait or thereabouts.

I think you could potentially have an SseCipher type which impls the same API, which the ChaCha20 type consumes directly.

Testing

I've opted to pregenerate random inputs that I use in the tests, which seemed like an okay approach. I was considering pulling in proptest to essentially randomly test the new variant against the scalar implementation, but I wasn't quite sure if you wanted to have it as a dependency.

Add whatever dev-dependencies you like. That sounds interesting.

That said, I originally did something similar in the polyval crate only to abandon it after changing the conditional gating so only one implementation ends up getting compiled into the crate. I tried to support multiple implementations in the runtime feature detection PR, but abandoned it.

tl;dr: I've gone back and forth on a lot of this stuff, and don't have good answers or strong preferences either way.

@srijs
Copy link
Contributor Author

srijs commented Oct 19, 2019

Thanks for the feedback! After trying a few different approaches, I've now settled on the following choices:

  • Moved the SSE2 implementation to be a submodule of the block module.
  • Added a comparison test to check the SIMD implementation against the scalar one.
  • Kept the feature detection static, because making the #[no_std] directive optional via feature flag is probably a bit more work.
  • Kept the feature branching inside cipher, because the module seemed simple enough and I'm not quite sure this really needs more facade.
  • Did not pull in cfg-if, because it is incompatible with Rust 1.27 (CI was failing).
  • Did not pull in proptest, since it needs std to work properly (or at least alloc on nightly).

I think I'm quite happy with the changes for now, and would defer to you for any change requests you may have :)

@tarcieri
Copy link
Member

Looks good to me and I'll go ahead and merge. Some notes:

Kept the feature detection static, because making the #[no_std] directive optional via feature flag is probably a bit more work.

Yeah, this was a big pain in the ass for my (abandoned) polyval runtime feature detection for PCLMULQDQ PR as well. I had the std feature off-by-default before that.

Did not pull in cfg-if, because it is incompatible with Rust 1.27 (CI was failing).

Oof, yeah at some point I think we can look into doing a 2018 edition upgrade for select crates. It was easier to keep them consistent at first, but it's certainly getting annoying to deal with such old Rust versions.

@tarcieri tarcieri merged commit 890823a into RustCrypto:master Oct 19, 2019
@tarcieri
Copy link
Member

@srijs do you want to do an AVX2 implementation before I do another release?

I'm going to throw criterion with the cycles-per-byte plugin on this to get some better benchmarks.

@tarcieri
Copy link
Member

Also, looking at the optimization described in RFC8439 Section 3, it'd probably be good to do that before adding any additional backends, since that impacts how they're structured.

@srijs srijs deleted the chacha20-sse2 branch October 20, 2019 01:05
@srijs
Copy link
Contributor Author

srijs commented Oct 20, 2019

I don't have a strong opinion on the release question. Supposedly releasing the sse2 changes would just be a patch version bump, which comes "for free" for any downstream consumer?

Agreed that it probably makes sense to tackle the state copying optimization first, so I'm happy to give that a go. I'm not entirely sure I fully understand the intend behind it though... is the difference really just between copying the full state vs. copying consts, key, iv and counter individually?

@tarcieri
Copy link
Member

Looking over it again, I think we already compute part of what it's talking about in advance, which is here:

https://github.com/RustCrypto/stream-ciphers/blob/master/chacha20/src/lib.rs#L104

For the full effect, it seems like we might compute the initial state array sans counter, then clone that

@srijs
Copy link
Contributor Author

srijs commented Oct 20, 2019

Right, so I've prototyped this idea, but according to criterion there is no statistically significant difference between the two approaches.

My guess is that the part of the optimization that's already in place (which you linked to) achieves the main part of the speed up, to the point where implementing the last part of the optimization really only results in one or two fewer loads and stores, which is not noticeable compared to the time spent in rest of the block function.

@srijs
Copy link
Contributor Author

srijs commented Oct 20, 2019

FWIW I've also looked into salsa20-core a bit, to try and understand what it would take to allow processing two blocks simultaneously using AVX2, and I think it may take me a little bit longer to figure out.

So it may be useful to release the sse2 improvements in the meantime, and then ship the AVX2 implementation once it's ready.

@tarcieri
Copy link
Member

FWIW I've also looked into salsa20-core a bit, to try and understand what it would take to allow processing two blocks simultaneously using AVX2, and I think it may take me a little bit longer to figure out.

Yeah, honestly I'd probably just rewrite it before beginning to add an optimization like that. It feels like it's written at the wrong level of abstraction to me since it caches things in terms of the u32 output array, rather than a byte-oriented interface.

Something more like one of these implementations would make it a bit more straightforward:

https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/ctr32.rs
https://github.com/RustCrypto/stream-ciphers/blob/master/ctr/src/lib.rs

@tarcieri
Copy link
Member

So it may be useful to release the sse2 improvements in the meantime, and then ship the AVX2 implementation once it's ready.

Sure. I was thinking about just bumping it to v0.3.0 so the SSE-enabled version had a nice round number (the main downstream consumer is the chacha20poly1305 crate, where we can do a minor bump)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants