Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 #1964

Closed
wants to merge 1,931 commits into from

Conversation

Ko-
Copy link
Contributor

@Ko- Ko- commented Aug 19, 2018

This fixes #1636 and greatly speeds up bignum operations on ARMv8-A / Aarch64 platforms. It still passes all tests.

Benchmark ($ ./programs/test/benchmark) done on a Pixel XL:

                                 before              just 64-bit limbs      64-bit limbs + assembly multiplication
  RSA-2048                 :    2883  public/s         3678  public/s         12679  public/s
  RSA-2048                 :      87 private/s          115 private/s           355 private/s
  RSA-4096                 :     740  public/s          951  public/s          3478  public/s
  RSA-4096                 :      13 private/s           16 private/s            56 private/s
  DHE-2048                 :      13 handshake/s         16 handshake/s          57 handshake/s
  DH-2048                  :      25 handshake/s         31 handshake/s         111 handshake/s
  DHE-3072                 :       4 handshake/s          5 handshake/s          18 handshake/s
  DH-3072                  :       8 handshake/s          9 handshake/s          35 handshake/s
  ECDSA-secp521r1          :     272 sign/s             334 sign/s              438 sign/s
  ECDSA-brainpoolP512r1    :      29 sign/s              51 sign/s               57 sign/s
  ECDSA-secp384r1          :     463 sign/s             539 sign/s              644 sign/s
  ECDSA-brainpoolP384r1    :      59 sign/s             115 sign/s              127 sign/s
  ECDSA-secp256r1          :     664 sign/s             755 sign/s              794 sign/s
  ECDSA-secp256k1          :     670 sign/s             723 sign/s              821 sign/s
  ECDSA-brainpoolP256r1    :      97 sign/s             193 sign/s              207 sign/s
  ECDSA-secp224r1          :     948 sign/s            1056 sign/s             1170 sign/s
  ECDSA-secp224k1          :     787 sign/s             830 sign/s              938 sign/s
  ECDSA-secp192r1          :    1199 sign/s            1415 sign/s             1566 sign/s
  ECDSA-secp192k1          :     950 sign/s            1013 sign/s             1144 sign/s
  ECDSA-secp521r1          :      75 verify/s            85 verify/s            120 verify/s
  ECDSA-brainpoolP512r1    :       6 verify/s            11 verify/s             12 verify/s
  ECDSA-secp384r1          :     127 verify/s           146 verify/s            179 verify/s
  ECDSA-brainpoolP384r1    :      13 verify/s            25 verify/s             28 verify/s
  ECDSA-secp256r1          :     194 verify/s           217 verify/s            230 verify/s
  ECDSA-secp256k1          :     200 verify/s           215 verify/s            240 verify/s
  ECDSA-brainpoolP256r1    :      24 verify/s            49 verify/s             52 verify/s
  ECDSA-secp224r1          :     285 verify/s           321 verify/s            353 verify/s
  ECDSA-secp224k1          :     232 verify/s           240 verify/s            270 verify/s
  ECDSA-secp192r1          :     381 verify/s           426 verify/s            459 verify/s
  ECDSA-secp192k1          :     286 verify/s           309 verify/s            343 verify/s
  ECDHE-secp521r1          :      79 handshake/s         92 handshake/s         130 handshake/s
  ECDHE-brainpoolP512r1    :       6 handshake/s         11 handshake/s          12 handshake/s
  ECDHE-secp384r1          :     137 handshake/s        157 handshake/s         194 handshake/s
  ECDHE-brainpoolP384r1    :      13 handshake/s         26 handshake/s          27 handshake/s
  ECDHE-secp256r1          :     205 handshake/s        227 handshake/s         253 handshake/s
  ECDHE-secp256k1          :     213 handshake/s        225 handshake/s         258 handshake/s
  ECDHE-brainpoolP256r1    :      24 handshake/s         49 handshake/s          51 handshake/s
  ECDHE-secp224r1          :     312 handshake/s        343 handshake/s         380 handshake/s
  ECDHE-secp224k1          :     247 handshake/s        252 handshake/s         290 handshake/s
  ECDHE-secp192r1          :     415 handshake/s        469 handshake/s         522 handshake/s
  ECDHE-secp192k1          :     304 handshake/s        324 handshake/s         372 handshake/s
  ECDHE-Curve25519         :     157 handshake/s        175 handshake/s         200 handshake/s
  ECDH-secp521r1           :     106 handshake/s        118 handshake/s         174 handshake/s
  ECDH-brainpoolP512r1     :       8 handshake/s         15 handshake/s          16 handshake/s
  ECDH-secp384r1           :     179 handshake/s        210 handshake/s         256 handshake/s
  ECDH-brainpoolP384r1     :      17 handshake/s         34 handshake/s          35 handshake/s
  ECDH-secp256r1           :     281 handshake/s        312 handshake/s         344 handshake/s
  ECDH-secp256k1           :     291 handshake/s        312 handshake/s         359 handshake/s
  ECDH-brainpoolP256r1     :      33 handshake/s         67 handshake/s          69 handshake/s
  ECDH-secp224r1           :     420 handshake/s        463 handshake/s         520 handshake/s
  ECDH-secp224k1           :     335 handshake/s        346 handshake/s         402 handshake/s
  ECDH-secp192r1           :     565 handshake/s        629 handshake/s         712 handshake/s
  ECDH-secp192k1           :     418 handshake/s        449 handshake/s         508 handshake/s
  ECDH-Curve25519          :     324 handshake/s        364 handshake/s         411 handshake/s

Similar improvements were measured on a big Qualcomm server. On a Raspberry Pi3 (Cortex-A53) only enabling 64-bit limbs improved performance.

@RonEld RonEld added enhancement CLA requested component-crypto Crypto primitives and low-level interfaces labels Aug 20, 2018
@RonEld
Copy link
Contributor

RonEld commented Aug 20, 2018

@Ko- Thank you for your contribution!
It seems that many can benefit from it.

In addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer.

If this is a personal contribution, the easiest way to do this is if you create an mbed account and accept this click through agreement. Alternatively, you can find a slightly different agreement to sign here, which can be signed and returned to us, and is applicable if you don't want to create an mbed account or alternatively if this is a corporate contribution.

Thanks for your understanding and again, thanks for the contribution!

@Ko-
Copy link
Contributor Author

Ko- commented Aug 20, 2018

I signed the CLA yesterday.

@RonEld
Copy link
Contributor

RonEld commented Aug 20, 2018

@Ko- Thank you for signing the CLA!

Did you sign it with your Mbed account?
Unfortunately, the Mbed accounts and github accounts are not linked, and we need to manually search your user name. Could you please refer us to your Mbed user?

@RonEld
Copy link
Contributor

RonEld commented Aug 20, 2018

@Ko- Thank you for your your information. As mentioned, we can't link between github account.
We also can't search Mbed account through Email. Can you please refer us to your Mbed account?

@Ko-
Copy link
Contributor Author

Ko- commented Aug 20, 2018

It's andrikos1 I believe

@@ -135,9 +135,8 @@
typedef unsigned int mbedtls_t_udbl __attribute__((mode(TI)));
#define MBEDTLS_HAVE_UDBL
#endif /* !MBEDTLS_NO_UDBL_DIVISION */
#elif defined(__ARMCC_VERSION) && defined(__aarch64__)
#elif defined(__aarch64__)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah maybe you would actually prefer to add __aarch64__ to the __GNUC__ list because of the __uint128_t versus __attribute__((mode(TI)))? I could change that. Although I believe that both GCC and Clang have defined __uint128_t on 64-bit builds for ages, so I'm not sure if it would matter.

Copy link
Contributor

@Patater Patater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. I have a few questions and comments.

@@ -135,9 +135,8 @@
typedef unsigned int mbedtls_t_udbl __attribute__((mode(TI)));
#define MBEDTLS_HAVE_UDBL
#endif /* !MBEDTLS_NO_UDBL_DIVISION */
#elif defined(__ARMCC_VERSION) && defined(__aarch64__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add __aarch64__ in the #elif defined(__GNUC__) section with the other 64-bit architectures? GCC and ARMCC handle MBEDTLS_HAVE_UDBL differently, so __aarch64__ should be added to the GCC section. GCC shouldn't use the ARMCC section.

Copy link
Contributor Author

@Ko- Ko- Aug 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Changed.

"str x5, [%1], #8 \n\t"

#define MULADDC_STOP \
: "+r" (c), "=r" (d), "=r" (s) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need the "+r" (c) when you already have it listed in the input list? Wouldn't = be okay to use here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently yes. I first had = but then the tests failed. Maybe it would be fine to only have the read-write output register but to not list it separately as an input register. I'm not sure if there's a difference in semantics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference that I'm aware of, although some old compilers might not support +. Verying between architectures, probably based on who wrote it, we've used = and repeated the register in both input and output lists, or used + and avoided repeating the register in the input list.

#define MULADDC_STOP \
: "+r" (c), "=r" (d), "=r" (s) \
: "r" (s), "r" (d), "r" (c), "r" (b) \
: "x4", "x5", "x6", "x7", "cc" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At which optimization levels does this inline assembly work properly at? The set of registers available for use when using -O0 should be enough, but wouldn't hurt to have checked.

Copy link
Contributor Author

@Ko- Ko- Aug 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was with -O2. I now see that it fails with -O0. I'll investigate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consulted the documentation of the extended asm syntax and pushed a fix. Now both debug and release builds pass the tests.

Patater
Patater previously approved these changes Aug 21, 2018
Copy link
Contributor

@Patater Patater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Needs @yanesca review still

@Ko-
Copy link
Contributor Author

Ko- commented Sep 3, 2018

@yanesca Any idea when you have time to review this?

@yanesca
Copy link
Contributor

yanesca commented Sep 3, 2018

@Ko- I have a quite long review queue at the moment, so I can't tell when can I get to this PR, but I will review it as soon as I can!

yanesca
yanesca previously approved these changes Mar 8, 2019
Copy link
Contributor

@yanesca yanesca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@simonbutcher
Copy link
Contributor

retest

@simonbutcher simonbutcher added the needs-ci Needs to pass CI tests label Apr 2, 2019
@lnksz
Copy link

lnksz commented Jul 17, 2019

This is a totally cool feature, why is it pending still?

@RonEld
Copy link
Contributor

RonEld commented Jul 17, 2019

@lnksz Thank you for the reminder!
I have put this PR to our review backlog

Pass information via a key attribute structure rather than as separate
parameters to psa_crypto_storage functions. This makes it easier to
maintain the code when the metadata of a key evolves.

This has negligible impact on code size (+4B with "gcc -Os" on x86_64).
Stored keys must contain lifetime information. The lifetime used to be
implied by the location of the key, back when applications supplied
the lifetime value when opening the key. Now that all keys' metadata
are stored in a central location, this location needs to store the
lifetime explicitly.
For a key in a secure element, persist the key slot.

This is implemented in the nominal case. Failures may not be handled
properly.
Store the persistent data of secure element drivers.

This is fully implemented, but not at all tested.
athoelke and others added 11 commits August 22, 2019 15:52
Clarify the documented behavior and lifetime of key handles [v2]
Remove a redundant function call
Remove unused TG variable in mbedtls_mpi_gcd()
The crypto.c implementation of psa_inject_entropy() didn't match the
declaration in crypto_extra.h. Use a const seed in both files.
Use the PSA_KEY_POLICY_INIT macro in the definition of
PSA_CORE_KEY_ATTRIBUTES_INIT in order to avoid duplicating the key
policy initializer.
A macro useful for initializing psa_key_id_t, whether
MBEDTLS_PSA_CRYPTO_KEY_FILE_ID_ENCODES_OWNER is set or not. Without this
macro, it is necessary to know if
MBEDTLS_PSA_CRYPTO_KEY_FILE_ID_ENCODES_OWNER as with it the key ID is
non-scalar and needs to be initialized with {0, 0}, and 0 otherwise when
key ID is scalar.
Avoid compiler errors when MBEDTLS_PSA_CRYPTO_KEY_FILE_ID_ENCODES_OWNER
is set by using the application ID type.

    [Error] psa_crypto_slot_management.c@175,9: used type 'psa_key_id_t' (aka 'psa_key_file_id_t') where arithmetic or pointer type is required
@Patater
Copy link
Contributor

Patater commented Aug 29, 2019

Please resolve conflicts and ensure testing passes, then we can merge.

Patater and others added 9 commits August 29, 2019 11:31
…-legacy-psa-key-derivation

Remove legacy psa key derivation
Make fixes related to using Mbed Crypto as a service
According to SP800-90A, the DRBG seeding process should use a nonce
of length `security_strength / 2` bits as part of the DRBG seed. It
further notes that this nonce may be drawn from the same source of
entropy that is used for the first `security_strength` bits of the
DRBG seed. The present HMAC DRBG implementation does that, requesting
`security_strength * 3 / 2` bits of entropy from the configured entropy
source in total to form the initial part of the DRBG seed.

However, some entropy sources may have thresholds in terms of how much
entropy they can provide in a single call to their entropy gathering
function which may be exceeded by the present HMAC DRBG implementation
even if the threshold is not smaller than `security_strength` bits.
Specifically, this is the case for our own entropy module implementation
which only allows requesting at most 32 Bytes of entropy at a time
in configurations disabling SHA-512, and this leads to runtime failure
of HMAC DRBG when used with Mbed Crypto' own entropy callbacks in such
configurations.

This commit fixes this by splitting the seed entropy acquisition into
two calls, one requesting `security_strength` bits first, and another
one requesting `security_strength / 2` bits for the nonce.

Fixes Mbed-TLS#237.
…mpatibility_fix-crypto

HMAC DRBG: Split entropy-gathering requests to reduce request sizes
GCC and Clang do not define __ARMCC_VERSION when building for Aarch64.
Yet they should also use 64-bit limbs for Aarch64 builds.
x0-x3 are skipped such that function parameters to not have to be moved.
MULADDC_INIT and MULADDC_STOP are mostly empty because it is more
efficient to keep everything in registers (and that should easily be
possible). I considered a MULADDC_HUIT implementation, but could not
think of something that would be more efficient than basically 8
consecutive MULADDC_CORE. You could combine the loads and stores, but
it's probably more efficient to interleave them with arithmetic,
depending on the specific microarchitecture. NEON allows to do a
64x64->128 bit multiplication (and optional accumulation) in one
instruction, but is not great at handling carries.
@Ko- Ko- dismissed stale reviews from yanesca and Patater via cb260bb September 2, 2019 11:50
@Ko-
Copy link
Contributor Author

Ko- commented Sep 2, 2019

This PR should now be made to mbed-crypto instead. I'll create a new one there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component-crypto Crypto primitives and low-level interfaces enhancement needs-ci Needs to pass CI tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make GCC armv8 (aarch64) compilation use 64bit bignum