Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 #1964

Ko- · 2018-08-19T19:06:44Z

This fixes #1636 and greatly speeds up bignum operations on ARMv8-A / Aarch64 platforms. It still passes all tests.

Benchmark ($ ./programs/test/benchmark) done on a Pixel XL:

                                 before              just 64-bit limbs      64-bit limbs + assembly multiplication
  RSA-2048                 :    2883  public/s         3678  public/s         12679  public/s
  RSA-2048                 :      87 private/s          115 private/s           355 private/s
  RSA-4096                 :     740  public/s          951  public/s          3478  public/s
  RSA-4096                 :      13 private/s           16 private/s            56 private/s
  DHE-2048                 :      13 handshake/s         16 handshake/s          57 handshake/s
  DH-2048                  :      25 handshake/s         31 handshake/s         111 handshake/s
  DHE-3072                 :       4 handshake/s          5 handshake/s          18 handshake/s
  DH-3072                  :       8 handshake/s          9 handshake/s          35 handshake/s
  ECDSA-secp521r1          :     272 sign/s             334 sign/s              438 sign/s
  ECDSA-brainpoolP512r1    :      29 sign/s              51 sign/s               57 sign/s
  ECDSA-secp384r1          :     463 sign/s             539 sign/s              644 sign/s
  ECDSA-brainpoolP384r1    :      59 sign/s             115 sign/s              127 sign/s
  ECDSA-secp256r1          :     664 sign/s             755 sign/s              794 sign/s
  ECDSA-secp256k1          :     670 sign/s             723 sign/s              821 sign/s
  ECDSA-brainpoolP256r1    :      97 sign/s             193 sign/s              207 sign/s
  ECDSA-secp224r1          :     948 sign/s            1056 sign/s             1170 sign/s
  ECDSA-secp224k1          :     787 sign/s             830 sign/s              938 sign/s
  ECDSA-secp192r1          :    1199 sign/s            1415 sign/s             1566 sign/s
  ECDSA-secp192k1          :     950 sign/s            1013 sign/s             1144 sign/s
  ECDSA-secp521r1          :      75 verify/s            85 verify/s            120 verify/s
  ECDSA-brainpoolP512r1    :       6 verify/s            11 verify/s             12 verify/s
  ECDSA-secp384r1          :     127 verify/s           146 verify/s            179 verify/s
  ECDSA-brainpoolP384r1    :      13 verify/s            25 verify/s             28 verify/s
  ECDSA-secp256r1          :     194 verify/s           217 verify/s            230 verify/s
  ECDSA-secp256k1          :     200 verify/s           215 verify/s            240 verify/s
  ECDSA-brainpoolP256r1    :      24 verify/s            49 verify/s             52 verify/s
  ECDSA-secp224r1          :     285 verify/s           321 verify/s            353 verify/s
  ECDSA-secp224k1          :     232 verify/s           240 verify/s            270 verify/s
  ECDSA-secp192r1          :     381 verify/s           426 verify/s            459 verify/s
  ECDSA-secp192k1          :     286 verify/s           309 verify/s            343 verify/s
  ECDHE-secp521r1          :      79 handshake/s         92 handshake/s         130 handshake/s
  ECDHE-brainpoolP512r1    :       6 handshake/s         11 handshake/s          12 handshake/s
  ECDHE-secp384r1          :     137 handshake/s        157 handshake/s         194 handshake/s
  ECDHE-brainpoolP384r1    :      13 handshake/s         26 handshake/s          27 handshake/s
  ECDHE-secp256r1          :     205 handshake/s        227 handshake/s         253 handshake/s
  ECDHE-secp256k1          :     213 handshake/s        225 handshake/s         258 handshake/s
  ECDHE-brainpoolP256r1    :      24 handshake/s         49 handshake/s          51 handshake/s
  ECDHE-secp224r1          :     312 handshake/s        343 handshake/s         380 handshake/s
  ECDHE-secp224k1          :     247 handshake/s        252 handshake/s         290 handshake/s
  ECDHE-secp192r1          :     415 handshake/s        469 handshake/s         522 handshake/s
  ECDHE-secp192k1          :     304 handshake/s        324 handshake/s         372 handshake/s
  ECDHE-Curve25519         :     157 handshake/s        175 handshake/s         200 handshake/s
  ECDH-secp521r1           :     106 handshake/s        118 handshake/s         174 handshake/s
  ECDH-brainpoolP512r1     :       8 handshake/s         15 handshake/s          16 handshake/s
  ECDH-secp384r1           :     179 handshake/s        210 handshake/s         256 handshake/s
  ECDH-brainpoolP384r1     :      17 handshake/s         34 handshake/s          35 handshake/s
  ECDH-secp256r1           :     281 handshake/s        312 handshake/s         344 handshake/s
  ECDH-secp256k1           :     291 handshake/s        312 handshake/s         359 handshake/s
  ECDH-brainpoolP256r1     :      33 handshake/s         67 handshake/s          69 handshake/s
  ECDH-secp224r1           :     420 handshake/s        463 handshake/s         520 handshake/s
  ECDH-secp224k1           :     335 handshake/s        346 handshake/s         402 handshake/s
  ECDH-secp192r1           :     565 handshake/s        629 handshake/s         712 handshake/s
  ECDH-secp192k1           :     418 handshake/s        449 handshake/s         508 handshake/s
  ECDH-Curve25519          :     324 handshake/s        364 handshake/s         411 handshake/s

Similar improvements were measured on a big Qualcomm server. On a Raspberry Pi3 (Cortex-A53) only enabling 64-bit limbs improved performance.

RonEld · 2018-08-20T06:40:37Z

@Ko- Thank you for your contribution!
It seems that many can benefit from it.

In addition, unfortunately our policy is to not accept contributions, without a Contributor’s Licence Agreement (CLA) signed or authorised by yourself or your employer.

If this is a personal contribution, the easiest way to do this is if you create an mbed account and accept this click through agreement. Alternatively, you can find a slightly different agreement to sign here, which can be signed and returned to us, and is applicable if you don't want to create an mbed account or alternatively if this is a corporate contribution.

Thanks for your understanding and again, thanks for the contribution!

Ko- · 2018-08-20T06:46:48Z

I signed the CLA yesterday.

RonEld · 2018-08-20T06:49:22Z

@Ko- Thank you for signing the CLA!

Did you sign it with your Mbed account?
Unfortunately, the Mbed accounts and github accounts are not linked, and we need to manually search your user name. Could you please refer us to your Mbed user?

RonEld · 2018-08-20T06:58:26Z

@Ko- Thank you for your your information. As mentioned, we can't link between github account.
We also can't search Mbed account through Email. Can you please refer us to your Mbed account?

Ko- · 2018-08-20T07:26:10Z

It's andrikos1 I believe

Ko- · 2018-08-20T10:56:39Z

include/mbedtls/bignum.h

@@ -135,9 +135,8 @@
            typedef unsigned int mbedtls_t_udbl __attribute__((mode(TI)));
            #define MBEDTLS_HAVE_UDBL
        #endif /* !MBEDTLS_NO_UDBL_DIVISION */
-    #elif defined(__ARMCC_VERSION) && defined(__aarch64__)
+    #elif defined(__aarch64__)


Ah maybe you would actually prefer to add __aarch64__ to the __GNUC__ list because of the __uint128_t versus __attribute__((mode(TI)))? I could change that. Although I believe that both GCC and Clang have defined __uint128_t on 64-bit builds for ages, so I'm not sure if it would matter.

Patater

Thanks for your contribution. I have a few questions and comments.

Patater · 2018-08-20T09:25:01Z

include/mbedtls/bignum.h

@@ -135,9 +135,8 @@
            typedef unsigned int mbedtls_t_udbl __attribute__((mode(TI)));
            #define MBEDTLS_HAVE_UDBL
        #endif /* !MBEDTLS_NO_UDBL_DIVISION */
-    #elif defined(__ARMCC_VERSION) && defined(__aarch64__)


Why not add __aarch64__ in the #elif defined(__GNUC__) section with the other 64-bit architectures? GCC and ARMCC handle MBEDTLS_HAVE_UDBL differently, so __aarch64__ should be added to the GCC section. GCC shouldn't use the ARMCC section.

Thanks. Changed.

Patater · 2018-08-20T10:27:57Z

include/mbedtls/bn_mul.h

+        "str x5, [%1], #8   \n\t"
+
+#define MULADDC_STOP                            \
+         : "+r" (c),  "=r" (d), "=r" (s)        \


Do you need the "+r" (c) when you already have it listed in the input list? Wouldn't = be okay to use here?

Apparently yes. I first had = but then the tests failed. Maybe it would be fine to only have the read-write output register but to not list it separately as an input register. I'm not sure if there's a difference in semantics.

No difference that I'm aware of, although some old compilers might not support +. Verying between architectures, probably based on who wrote it, we've used = and repeated the register in both input and output lists, or used + and avoided repeating the register in the input list.

Patater · 2018-08-20T10:49:22Z

include/mbedtls/bn_mul.h

+#define MULADDC_STOP                            \
+         : "+r" (c),  "=r" (d), "=r" (s)        \
+         : "r" (s), "r" (d), "r" (c), "r" (b)   \
+         : "x4", "x5", "x6", "x7", "cc"         \


At which optimization levels does this inline assembly work properly at? The set of registers available for use when using -O0 should be enough, but wouldn't hurt to have checked.

This was with -O2. I now see that it fails with -O0. I'll investigate this.

I consulted the documentation of the extended asm syntax and pushed a fix. Now both debug and release builds pass the tests.

Patater

LGTM

Needs @yanesca review still

Ko- · 2018-09-03T08:31:59Z

@yanesca Any idea when you have time to review this?

yanesca · 2018-09-03T08:48:17Z

@Ko- I have a quite long review queue at the moment, so I can't tell when can I get to this PR, but I will review it as soon as I can!

yanesca

Looks good to me.

simonbutcher · 2019-04-02T16:07:53Z

retest

lnksz · 2019-07-17T07:12:10Z

This is a totally cool feature, why is it pending still?

RonEld · 2019-07-17T07:22:37Z

@lnksz Thank you for the reminder!
I have put this PR to our review backlog

Pass information via a key attribute structure rather than as separate parameters to psa_crypto_storage functions. This makes it easier to maintain the code when the metadata of a key evolves. This has negligible impact on code size (+4B with "gcc -Os" on x86_64).

Stored keys must contain lifetime information. The lifetime used to be implied by the location of the key, back when applications supplied the lifetime value when opening the key. Now that all keys' metadata are stored in a central location, this location needs to store the lifetime explicitly.

For a key in a secure element, persist the key slot. This is implemented in the nominal case. Failures may not be handled properly.

Store the persistent data of secure element drivers. This is fully implemented, but not at all tested.

Clarify the documented behavior and lifetime of key handles [v2]

Remove a redundant function call

Remove unused TG variable in mbedtls_mpi_gcd()

The crypto.c implementation of psa_inject_entropy() didn't match the declaration in crypto_extra.h. Use a const seed in both files.

Use the PSA_KEY_POLICY_INIT macro in the definition of PSA_CORE_KEY_ATTRIBUTES_INIT in order to avoid duplicating the key policy initializer.

A macro useful for initializing psa_key_id_t, whether MBEDTLS_PSA_CRYPTO_KEY_FILE_ID_ENCODES_OWNER is set or not. Without this macro, it is necessary to know if MBEDTLS_PSA_CRYPTO_KEY_FILE_ID_ENCODES_OWNER as with it the key ID is non-scalar and needs to be initialized with {0, 0}, and 0 otherwise when key ID is scalar.

Avoid compiler errors when MBEDTLS_PSA_CRYPTO_KEY_FILE_ID_ENCODES_OWNER is set by using the application ID type. [Error] psa_crypto_slot_management.c@175,9: used type 'psa_key_id_t' (aka 'psa_key_file_id_t') where arithmetic or pointer type is required

Patater · 2019-08-29T08:58:53Z

Please resolve conflicts and ensure testing passes, then we can merge.

…-legacy-psa-key-derivation Remove legacy psa key derivation

Make fixes related to using Mbed Crypto as a service

Everest integration

According to SP800-90A, the DRBG seeding process should use a nonce of length `security_strength / 2` bits as part of the DRBG seed. It further notes that this nonce may be drawn from the same source of entropy that is used for the first `security_strength` bits of the DRBG seed. The present HMAC DRBG implementation does that, requesting `security_strength * 3 / 2` bits of entropy from the configured entropy source in total to form the initial part of the DRBG seed. However, some entropy sources may have thresholds in terms of how much entropy they can provide in a single call to their entropy gathering function which may be exceeded by the present HMAC DRBG implementation even if the threshold is not smaller than `security_strength` bits. Specifically, this is the case for our own entropy module implementation which only allows requesting at most 32 Bytes of entropy at a time in configurations disabling SHA-512, and this leads to runtime failure of HMAC DRBG when used with Mbed Crypto' own entropy callbacks in such configurations. This commit fixes this by splitting the seed entropy acquisition into two calls, one requesting `security_strength` bits first, and another one requesting `security_strength / 2` bits for the nonce. Fixes Mbed-TLS#237.

…mpatibility_fix-crypto HMAC DRBG: Split entropy-gathering requests to reduce request sizes

GCC and Clang do not define __ARMCC_VERSION when building for Aarch64. Yet they should also use 64-bit limbs for Aarch64 builds.

x0-x3 are skipped such that function parameters to not have to be moved. MULADDC_INIT and MULADDC_STOP are mostly empty because it is more efficient to keep everything in registers (and that should easily be possible). I considered a MULADDC_HUIT implementation, but could not think of something that would be more efficient than basically 8 consecutive MULADDC_CORE. You could combine the loads and stores, but it's probably more efficient to interleave them with arithmetic, depending on the specific microarchitecture. NEON allows to do a 64x64->128 bit multiplication (and optional accumulation) in one instruction, but is not great at handling carries.

…Aarch64 builds.

Ko- · 2019-09-02T11:59:02Z

This PR should now be made to mbed-crypto instead. I'll create a new one there.

RonEld added enhancement CLA requested component-crypto Crypto primitives and low-level interfaces labels Aug 20, 2018

RonEld added CLA valid and removed CLA requested labels Aug 20, 2018

RonEld requested review from Patater and yanesca August 20, 2018 07:33

Ko- commented Aug 20, 2018

View reviewed changes

Patater suggested changes Aug 20, 2018

View reviewed changes

RonEld mentioned this pull request Aug 20, 2018

compiling Mbed TLS for cortex-m0 with optimization and ARMC6 compiler results in error #1077

Closed

Patater previously approved these changes Aug 21, 2018

View reviewed changes

yanesca previously approved these changes Mar 8, 2019

View reviewed changes

simonbutcher added the needs-ci Needs to pass CI tests label Apr 2, 2019

gilles-peskine-arm added 5 commits July 23, 2019 11:29

Make whitespace consistent

274a263

SE keys: implement persistent storage

1df83d4

For a key in a secure element, persist the key slot. This is implemented in the nominal case. Failures may not be handled properly.

SE drivers: implement persistent storage

8b96cad

Store the persistent data of secure element drivers. This is fully implemented, but not at all tested.

athoelke and others added 11 commits August 22, 2019 15:52

Cross reference 'key handles' from INVALID_HANDLE

3c2b803

Editorial fixes.

8824dae

Merge pull request Mbed-TLS#235 from athoelke/at-key-handles

b14a48b

Clarify the documented behavior and lifetime of key handles [v2]

Merge pull request Mbed-TLS#225 from RonEld/iotssl_2739

85c78b4

Remove a redundant function call

Merge pull request Mbed-TLS#224 from tempesta-tech/development

932e496

Remove unused TG variable in mbedtls_mpi_gcd()

getting_started: Update for PSA Crypto API 1.0b3

884738a

crypto_extra: Use const seed for entropy injection

c7529c9

The crypto.c implementation of psa_inject_entropy() didn't match the declaration in crypto_extra.h. Use a const seed in both files.

psa: Don't duplicate policy initializer

f89cc69

Use the PSA_KEY_POLICY_INIT macro in the definition of PSA_CORE_KEY_ATTRIBUTES_INIT in order to avoid duplicating the key policy initializer.

psa: Adapt set_key_id() for when owner is included

e3cdf28

Patater and others added 9 commits August 29, 2019 11:31

Merge pull request Mbed-TLS#229 from k-stachowiak/IOTCRYPT-791-remove…

21db2a9

…-legacy-psa-key-derivation Remove legacy psa key derivation

Merge pull request Mbed-TLS#232 from Patater/psa-crypto-api-1.0b3

98d5685

Make fixes related to using Mbed Crypto as a service

Merge pull request Mbed-TLS#140 from yanesca/everest_integration

f071654

Everest integration

Merge pull request Mbed-TLS#238 from hanno-arm/hmac_drbg_entropy_inco…

822b8f6

…mpatibility_fix-crypto HMAC DRBG: Split entropy-gathering requests to reduce request sizes

Enable 64-bit limbs for all Aarch64 builds.

03d2daf

GCC and Clang do not define __ARMCC_VERSION when building for Aarch64. Yet they should also use 64-bit limbs for Aarch64 builds.

Make GNUC-compatible compilers use the right mbedtls_t_udbl again on …

05cff95

…Aarch64 builds.

Fix -O0 build for Aarch64 bignum multiplication.

cb260bb

Ko- dismissed stale reviews from yanesca and Patater via cb260bb September 2, 2019 11:50

Ko- force-pushed the development branch from 373060d to cb260bb Compare September 2, 2019 11:50

Ko- mentioned this pull request Sep 2, 2019

Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 ARMmbed/mbed-crypto#240

Merged

Ko- closed this Sep 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 #1964

Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 #1964

Ko- commented Aug 19, 2018

RonEld commented Aug 20, 2018

Ko- commented Aug 20, 2018

RonEld commented Aug 20, 2018

RonEld commented Aug 20, 2018

Ko- commented Aug 20, 2018 •

edited

Loading

Ko- Aug 20, 2018

Patater left a comment

Patater Aug 20, 2018

Ko- Aug 20, 2018 •

edited

Loading

Patater Aug 20, 2018

Ko- Aug 20, 2018

Patater Aug 20, 2018

Patater Aug 20, 2018

Ko- Aug 20, 2018 •

edited

Loading

Ko- Aug 20, 2018

Patater left a comment

Ko- commented Sep 3, 2018

yanesca commented Sep 3, 2018

yanesca left a comment

simonbutcher commented Apr 2, 2019

lnksz commented Jul 17, 2019

RonEld commented Jul 17, 2019

Patater commented Aug 29, 2019

Ko- commented Sep 2, 2019

Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 #1964

Enable 64-bit bignum limbs and add optimized multiplication for Aarch64 #1964

Conversation

Ko- commented Aug 19, 2018

RonEld commented Aug 20, 2018

Ko- commented Aug 20, 2018

RonEld commented Aug 20, 2018

RonEld commented Aug 20, 2018

Ko- commented Aug 20, 2018 • edited Loading

Ko- Aug 20, 2018

Choose a reason for hiding this comment

Patater left a comment

Choose a reason for hiding this comment

Patater Aug 20, 2018

Choose a reason for hiding this comment

Ko- Aug 20, 2018 • edited Loading

Choose a reason for hiding this comment

Patater Aug 20, 2018

Choose a reason for hiding this comment

Ko- Aug 20, 2018

Choose a reason for hiding this comment

Patater Aug 20, 2018

Choose a reason for hiding this comment

Patater Aug 20, 2018

Choose a reason for hiding this comment

Ko- Aug 20, 2018 • edited Loading

Choose a reason for hiding this comment

Ko- Aug 20, 2018

Choose a reason for hiding this comment

Patater left a comment

Choose a reason for hiding this comment

Ko- commented Sep 3, 2018

yanesca commented Sep 3, 2018

yanesca left a comment

Choose a reason for hiding this comment

simonbutcher commented Apr 2, 2019

lnksz commented Jul 17, 2019

RonEld commented Jul 17, 2019

Patater commented Aug 29, 2019

Ko- commented Sep 2, 2019

Ko- commented Aug 20, 2018 •

edited

Loading

Ko- Aug 20, 2018 •

edited

Loading

Ko- Aug 20, 2018 •

edited

Loading