-
-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use c2-chacha for stable, runtime-dispatched SIMD #789
Conversation
Performance: gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)] Minor version bump: the only breaking change is that no-std builds now require default-features=false (std is required by default for runtime cpu detection; no_std builds will use the best implementation supported by the target-features/target-cpu enabled at compile time) New functionality: ChaChaXRng is parameterized by round count at compile time. Convenient aliases for the typical 20/12/8 round implementations exposed. ChaChaRng is aliased to ChaCha20Rng for backward compatibility. Closes rust-random#667
More familiar numbers; evenness requirement enforced by PartialDiv.
Xeon 1231 aka Haswell:
Certainly a nice improvement! Any thoughts @kazcw on which should be our default for BTW I pushed additional benchmark code to my fork. |
@burdges would you be able to review this? |
Those are some very nice performance improvements! HC128 being faster than ChaCha20 and ChaCha12 while being slower than ChaCha8 is consistent with the eBACS benchmarks. However, it seems like the speed difference between ChaCha20 and HC128 is less than in our case, so I wonder whether there is still some potential for optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
One concern is the number of crates this depends on (c2-chacha
, byteorder
, generic-array
, typenum
, ppv-lite86
). We've had a few complaints that Rand pulls in too many dependencies; if we now use this for StdRng
then we won't be reducing the number of dependencies as previously promised. This is not a blocking issue but we should consider whether there are good alternatives.
Not really; ignoring initialisation time ChaCha20 is more than twice as fast as HC128 on several of those benchmarks. |
Did you mean including initialization time? |
IETF ChaCha has an initialization phase, but not ChaCha20. I'm not sure if I'm the ideal person to review this really. |
Who is the ideal person to review this then? You don't have to worry so much about correctness since we have test vectors; it's more about the API, approach and style. |
@dhardy Although the security margin of ChaCha12 is thought to be solid, AFAIK no one is currently using ChaCha with less than 20 rounds for anything important. My inclination would be to default to ChaCha20, to avoid being on the forefront of trusting anything out-of-the-box. @vks There's room for more performance improvement: the AVX2 backend of ppv-lite86 is currently disabled, so right now only machine features up through SSE4.1 are in use. Once I get that code updated to the new ppv-lite API, performance should be within spitting distance of hand-optimized assembly for this algo. @dhardy I can cut down the dependencies. I'll see what I can do. |
Eliminated all transitive deps for rand_chacha except c2-chacha (common impl for rand and stream-cipher) and ppv-lite86 (SIMD implementation).
miri can't handle libstd's CPU detection. I presume it can't do the SIMD intrinsics either but I'll address that in ppv-lite.
I don't know if any user would want this but it's needed for miri
Thanks @kazcw. Rand is in the interesting position of trying to provide a "general purpose generator", and because of that we don't necessarily want to make the same compromises as others. I wish we could allow configuration by the end user, but the only option we have for that are feature-flags, which are not ideal for this kind of configuration. (I suppose we could have several flags like |
Interface looks fine. It's basically the same, no? It's fine using distinct types for round count. I avoid dependencies on RustCrypto whenever possible myself, but maybe not sensible here. Ideally rust would've implicit feature flags other crates in the same build, so RustCrypto would only become a dependency if it was a dependency anyways. I have not dug into the c2_chacha code. |
avx2 has landed in the SIMD backend, so that should make your fancy-schmancy Haswells happy 😄. Benchmark comparison from a GCE instance:
|
Certianly looks good in the micro-benches 😄
We ought to test non-Intel CPUs too; I guess ChaCha20 is a lot slower on ARMv7? Anyway, I think this PR is ready now? Thanks for all the contributions @kazcw! |
Dependencies are now down to |
|
||
[build-dependencies] | ||
autocfg = "0.1" | ||
|
||
[features] | ||
default = ["std"] | ||
default = ["std", "simd"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a breaking change, because the simd
feature does not change the API, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, disabling simd
just forces use of the same portable implementation as architectures that don't have SIMD backends.
Performance:
gen_bytes_chacha20: 254 MB/s -> 603 MB/s [on a Xeon L5630 (SSE4.1)]
Minor version bump:
the only breaking change is that no-std builds now require
default-features=false (std is required by default for runtime cpu
detection; no_std builds will use the best implementation supported by
the target-features/target-cpu enabled at compile time)
New functionality:
ChaChaXRng is parameterized by round count at compile time. Convenient
aliases for the typical 20/12/8 round implementations exposed. ChaChaRng
is aliased to ChaCha20Rng for backward compatibility.
Closes #667