You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While profiling IdpfPoplar, I've noticed significant contributions from memcpy of AES internal states and allocation and freeing of a Vec<u8> buffer. (memcpy took up 20% of a profile all-told. Allocation and freeing of the buffer mentioned below took 3%.) We could cut back on the CPU time needed for these overhead tasks with some API and internal changes, I think.
The PrgAes128 struct is 752 bytes. It wraps a Cmac<Aes128>, which consists of expanded round keys, a block for the CMAC state, and another block and u8 offset for buffering input. Looking at assembly listings, callers of PrgAes128::into_seed_stream() perform a memcpy of 752 bytes prior to the call, and then the listing of PrgAes128::into_seed_stream() contains another memcpy of 752 bytes. Similarly, calls to Prng::from_seed_stream(seed_stream) are preceded by a memcpy of 768 bytes. (the size of SeedStreamAes128)
I think we want to keep the semantics that into_seed_stream() consumes the PRG state. (which rules out changing it to take &mut self) We could box these RustCrypto types, which would make PrgAes128 and SeedStreamAes128 much smaller, and eliminate memcpys when destructuring the structs or passing them between functions.
Prng contains a SeedStream, a buffer, and some integers. The buffer is sized to hold 128 encoded field elements. For Field64, this is 1KiB. Poplar1 only ever needs two field elements from each seed input, so significant work goes to waste allocating, zeroing, and deallocating this. We could add a capacity argument to the constructors, to right-size this buffer. Prio3 may benefit from this change as well. (I think Prio3 instantiations use fewer field elements than Prio2 in typical use) Prng is a pub(crate) API, so we only need to update internal uses.
One extra optimization: I think we could replace Cmac<Aes128> with Cmac<Aes128Enc>, and Ctr64BE<Aes128> with Ctr64BE<Aes128Enc>, to avoid deriving and storing an extra set of expanded round keys. This was incorrect, Aes128, Aes128Enc, and Aes128Dec all hold the same set of round keys. They only differ in what ZST tokens they hold to control whether they implement encryption and decryption traits.
The text was updated successfully, but these errors were encountered:
Great analysis! However I would suggest holding off on further work on this for now.
For security reasons, in VDAF-04 we will replace PrgAes128 with an extensible output hash function, such as SHA-3 or KangarooTwelve: cfrg/draft-irtf-cfrg-vdaf#106.
One concern about this change is that it'll incur a performance penalty. The cost is negligible for Prio3 (see benchmarks reported in that issue), as the CPU time seems to be dominated by field arithmetic. It remains to be seen whether Poplar1 will be performant enough with SHA-3; but even if we want a PRG based on AES, we plan to move to some fixed-key mode: cfrg/draft-irtf-cfrg-vdaf#32.
While profiling IdpfPoplar, I've noticed significant contributions from
memcpy
of AES internal states and allocation and freeing of aVec<u8>
buffer. (memcpy
took up 20% of a profile all-told. Allocation and freeing of the buffer mentioned below took 3%.) We could cut back on the CPU time needed for these overhead tasks with some API and internal changes, I think.The
PrgAes128
struct is 752 bytes. It wraps aCmac<Aes128>
, which consists of expanded round keys, a block for the CMAC state, and another block andu8
offset for buffering input. Looking at assembly listings, callers ofPrgAes128::into_seed_stream()
perform a memcpy of 752 bytes prior to the call, and then the listing ofPrgAes128::into_seed_stream()
contains another memcpy of 752 bytes. Similarly, calls toPrng::from_seed_stream(seed_stream)
are preceded by a memcpy of 768 bytes. (the size ofSeedStreamAes128
)I think we want to keep the semantics that
into_seed_stream()
consumes the PRG state. (which rules out changing it to take&mut self
) We could box these RustCrypto types, which would makePrgAes128
andSeedStreamAes128
much smaller, and eliminate memcpys when destructuring the structs or passing them between functions.Prng
contains aSeedStream
, a buffer, and some integers. The buffer is sized to hold 128 encoded field elements. ForField64
, this is 1KiB. Poplar1 only ever needs two field elements from each seed input, so significant work goes to waste allocating, zeroing, and deallocating this. We could add a capacity argument to the constructors, to right-size this buffer. Prio3 may benefit from this change as well. (I think Prio3 instantiations use fewer field elements than Prio2 in typical use)Prng
is apub(crate)
API, so we only need to update internal uses.One extra optimization: I think we could replaceThis was incorrect,Cmac<Aes128>
withCmac<Aes128Enc>
, andCtr64BE<Aes128>
withCtr64BE<Aes128Enc>
, to avoid deriving and storing an extra set of expanded round keys.Aes128
,Aes128Enc
, andAes128Dec
all hold the same set of round keys. They only differ in what ZST tokens they hold to control whether they implement encryption and decryption traits.The text was updated successfully, but these errors were encountered: