Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make size of output in mpi_mul_hlp() explicit #5701

Merged
merged 26 commits into from
Apr 15, 2022

Conversation

hanno-becker
Copy link

@hanno-becker hanno-becker commented Apr 6, 2022

Context: The helper mpi_mul_hlp() performs a multiply-accumulate operation d += s * b, where d,b are MPIs represented as plain integer arrays, and b is a scalar.

Previously, only the length of s was specified, while d was assumed to be 0-terminated of unspecified length.

This was leveraged at the end of the core multiplication steps computing the first limb(s) limbs of d + s*b: Namely, the routine would keep on adding the current carry to d until none was left. This can, in theory, run for an arbitrarily long time if d has a tail of 0xFFs, and hence the assumption of 0-termination.

This solution is both fragile and insecure -- the latter because the carry-loop depends on the result of the multiplication.

Changes: This PR ...

  • ... changes the signature of mpi_mul_hlp() to receive the length of the output buffer, which must be greater or equal to the length of the input buffer.
  • ... renames the function to mbedtls_mpi_core_mla(), exposed in bignum_internal.h for internal usage
  • ... adjusts the p25519 modular reduction to use mbedtls_mpi_core_mla() -- this was not originally intended to be done here, but turned out to be necessary since said reduction routine called mbedtls_mpi_mul_int() with a stack-allocated MPI which caused difficulties with re-growing.

Regarding the new signature of mpi_mul_hlp(), note that it is not assumed that the output buffer is strictly larger than the input buffer -- instead, the routine will simply return any carry that's left. This will be useful in some applications of this function. It is the responsibility of the caller to either size the output appropriately so that no carry will be left, or to handle the carry.

Note: This PR tries to be as minimal as possible. There are a number of instances where one is tempted to try to simplify call-sites of mpi_mul_hlp() further, such as mbedtls_mpi_mul_int() or the conditional subtraction after mpi_montmul(), but this is deliberately not tackled in this PR.

Hanno Becker added 5 commits April 6, 2022 06:11
The helper `mpi_mul_hlp()` performs a multiply-accumulate
operation `d += s * b`, where `d,b` are MPIs and `b` is a scalar.

Previously, only the length of `s` was specified, while `d` was
assumed to be 0-terminated of unspecified length.

This was leveraged at the end of the core multiplication steps
computingg the first `limb(s)` limbs of `d + s*b`: Namely, the
routine would keep on adding the current carry to `d` until none
was left. This can, in theory, run for an arbitrarily long time
if `d` has a tail of `0xFF`s, and hence the assumption of
`0`-termination.

This solution is both fragile and insecure -- the latter because
the carry-loop depends on the result of the multiplication.

This commit changes the signature of `mpi_mul_hlp()` to receive
the length of the output buffer, which must be greater or equal
to the length of the input buffer.

It is _not_ assumed that the output buffer is strictly larger
than the input buffer -- instead, the routine will simply return
any carry that's left. This will be useful in some applications
of this function. It is the responsibility of the caller to either
size the output appropriately so that no carry will be left, or
to handle the carry.

NOTE: The commit leaves the library in a state where it cannot
      be compiled since the call-sites of mpi_mul_hlp() have
      not yet been adjusted. This will be done in the subsequent
      commits.

Signed-off-by: Hanno Becker <[email protected]>
The previous commit has changed the signature of mpi_mul_hlp(),
making the length of the output explicit.

This commit adjusts the call-site in mbedtls_mpi_mul_mpi() to
this new signature.

A notable change to the multiplication strategy had to be made:
mbedtls_mpi_mul_mpi() performs a simple row-wise schoolbook
multiplication, which however was so far computed iterating
rows from top to bottom. This leads to the undesirable consequence
that as lower rows are calculated and added to the temporary
result, carry chains can grow. It is simpler and faster to
iterate from bottom to top instead, as it is guaranteed that
there will be no carry when adding the next row to the previous
temporary result: The length of the output in each iteration
can be fixed to len(B)+1.

Signed-off-by: Hanno Becker <[email protected]>
A previous commit has changed the signature of mpi_mul_hlp(), making
the length of the output explicit.

This commit adjusts mbedtls_mpi_mul_int() to this change.

Along the way, we make the code simpler and more secure by not calculating
the minimal limb-size of A. A previous comment indicated that this was
functionally necessary because of the implementation of mpi_mul_hlp() --
if it ever was, it isn't anymore.

Signed-off-by: Hanno Becker <[email protected]>
A previous commit has changed the signature of mpi_mul_hlp, making the length
of the output explicit. This commit adjusts mpi_montmul() accordingly.

It also fixes a comment on the required size of the temporary value
passed to mpi_montmul() (but does not change the call-sites).

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker hanno-becker added component-crypto Crypto primitives and low-level interfaces needs-ci Needs to pass CI tests needs-reviewer This PR needs someone to pick it up for review labels Apr 6, 2022
(void) mpi_mul_hlp( d, n + 2,
N->p, n,
u1 );
d++;
}

/* At this point, d is either the desired result or the desired result
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conditional subtraction can be simplified here to

    borrow = mpi_sub_hlp( n, d, d, N->p );
    borrow = ( d[n] < borrow );
    (void) mpi_mul_hlp( d, n, N->p, n, borrow );
    memcpy( A->p, d, n * ciL );

which would be an example of a call to mpi_mul_hlp() where you expect and process the carry.

(@yanesca @mpg we touched on this approach in the chat, so FYI).

Hanno Becker added 2 commits April 11, 2022 07:03
This paves the way for the helper to be used from the ECP module

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker
Copy link
Author

This PR broke the CI because it calls mbedtls_mpi_grow() in mbedtls_mpi_mul_int() which is called from within the p25519 modular reduction with a stack allocated MPI... a good example for why it will be useful to have a more transparent and leaner API for bignum internally.

To resolve, I removed the use of mbedtls_mpi_mul_int() from the x25519 modular reduction routine, using mpi_mul_hlp() directly instead. @mpg Can you please have a look? I changed the reduction routine slightly to avoid shifting, wrapping at 2^256 == 38 instead of 2^255 == 19.

(In this case, there's nothing to do anyway since we only do a
quasi-reduction to N+1 limbs)

Signed-off-by: Hanno Becker <[email protected]>
library/bignum.c Outdated
mbedtls_mpi_uint c = 0; /* carry */

/* Remember the excess of d over s for later */
d_len -= s_len;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use a separate variable for this, and name it more appropriately?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

library/bignum.c Outdated
* \param i The number of limbs of \p s.
* \param[in,out] d The bignum to add to.
* \param d_len The number of limbs of \p d. This must be
* at least \p s_len.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpg does Mbed TLS have a mechanism for assert()ing requirements like this? or is it considered not worth it, even from a documentation-in-code perspective?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From all I know, Mbed TLS does not yet have such a mechanism, though it would arguably be useful for debug builds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't assert(), for several reasons. One reason is that the code must be able to run in environments that don't have assert(), and we don't want to maintain yet another compilation option HAVE_ASSERT. Another is that we want to zeroize confidential data even if (especially if!) something goes wrong. So if a requirement is violated, the only recourse is to return an error status. But that is not possible in low-level bignum functions that return void.

So far our general approach in bignum is to have public functions that check requirements and return error codes (also those functions may resize objects), and low-level functions that don't check requirements and return void (and those functions may not resize objects).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason not to use assert is that it's typically disabled in production, which is where it's the most useful. If you want a debug-only assert, you can have extra code under MBEDTLS_TEST_HOOKS that should not change the execution of the function, but can mark the test as failed.

library/bignum.c Outdated
const mbedtls_mpi_uint *s,
mbedtls_mpi_uint *d,
mbedtls_mpi_uint b )
mbedtls_mpi_uint mbedtls_mpi_core_mla( mbedtls_mpi_uint *d, size_t d_len ,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: space before comma

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

if( N->n < P255_WIDTH )
/* Helper references for top part of N */
mbedtls_mpi_uint * const NT_p = N->p + P255_WIDTH;
size_t const NT_n = N->n - P255_WIDTH;
Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivial: if you're ever editing this file again, its way more common to put const first (this line and the one above)

Copy link
Author

@hanno-becker hanno-becker Apr 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in the line above, moving const to the front would change the semantics to a non-const pointer to constant memory -- to avoid this distinction, I usually put the const last.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were starting writing C in a vacuum, I'd agree that placing the const right next to the variable name is more consistent as it always means the same thing here - the variable is constant, regardless of whether it's a pointer or not. However, I agree with Tom that in the existing body of C already written, const size_t foo seems way more common. So, there's a conflict here, unfortunately, between your sensible argument, and existing usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, I'll change it.

The variable is a local variable for the i386 bignum assembly only;
introduce it as part of the start/finish macros.

It can be noted that the variable is initialize to 0 within MULADDC_INIT,
so there are no data dependencies across blocks of MULADDC_INIT/CORE/STOP.

Signed-off-by: Hanno Becker <[email protected]>
Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Releasing comments from a first pass. I'll make a second pass tomorrow as I'm running out of time for today, and this is non-trivial enough to deserve more attention.

@mpg mpg self-requested a review April 11, 2022 11:02
@mpg mpg removed the needs-reviewer This PR needs someone to pick it up for review label Apr 11, 2022
Hanno Becker added 2 commits April 12, 2022 11:02
The check was meant to precisely catch an underflow.

Signed-off-by: Hanno Becker <[email protected]>
Hanno Becker added 2 commits April 12, 2022 11:10
@mpg
Copy link
Contributor

mpg commented Apr 14, 2022

Ok, I tried running ecc-heap.sh (as extended by #5381) and bad news, it looks like total heap usage and number of allocations are going up after this PR, especially the number of allocations for Brainpool curves (that don't have a dedicated quasi-reduction routine), unless I messed up something in my measurements.

Before: (branch here)

  ECDSA-secp521r1          :    1184 sign/s             2776 heap bytes,    800 allocs
  ECDSA-brainpoolP512r1    :     156 sign/s             2664 heap bytes,  13550 allocs
  ECDSA-secp384r1          :    1668 sign/s             2120 heap bytes,    637 allocs
  ECDSA-brainpoolP384r1    :     394 sign/s             2168 heap bytes,  10141 allocs
  ECDSA-secp256r1          :    2906 sign/s             1560 heap bytes,    540 allocs
  ECDSA-secp256k1          :    2711 sign/s             1616 heap bytes,   2430 allocs
  ECDSA-brainpoolP256r1    :     715 sign/s             1576 heap bytes,   8281 allocs
  ECDSA-secp521r1          :     321 verify/s           8136 heap bytes,   2499 allocs
  ECDSA-brainpoolP512r1    :      40 verify/s           7648 heap bytes,  61774 allocs
  ECDSA-secp384r1          :     464 verify/s           6656 heap bytes,   2004 allocs
  ECDSA-brainpoolP384r1    :      73 verify/s           6264 heap bytes,  46585 allocs
  ECDSA-secp256r1          :     795 verify/s           5128 heap bytes,   1550 allocs
  ECDSA-secp256k1          :     751 verify/s           4744 heap bytes,   8687 allocs
  ECDSA-brainpoolP256r1    :     176 verify/s           5064 heap bytes,  32952 allocs
  ECDHE-secp521r1          :     170 full handshake/s   8736 heap bytes,   4939 allocs
  ECDHE-brainpoolP512r1    :      20 full handshake/s   8200 heap bytes, 123705 allocs
  ECDHE-secp384r1          :     257 full handshake/s   7264 heap bytes,   3937 allocs
  ECDHE-brainpoolP384r1    :      43 full handshake/s   6696 heap bytes,  93332 allocs
  ECDHE-secp256r1          :     449 full handshake/s   5792 heap bytes,   3036 allocs
  ECDHE-secp256k1          :     425 full handshake/s   5256 heap bytes,  17344 allocs
  ECDHE-brainpoolP256r1    :      93 full handshake/s   5280 heap bytes,  66053 allocs
  ECDHE-x25519             :     307 full handshake/s   2608 heap bytes,  12454 allocs
  ECDHE-x448               :     134 full handshake/s   3768 heap bytes,  21722 allocs

After: (branch here)

  ECDSA-secp521r1          :     895 sign/s             4456 heap bytes,    990 allocs
  ECDSA-brainpoolP512r1    :     104 sign/s             4136 heap bytes,  40909 allocs
  ECDSA-secp384r1          :    1365 sign/s             3408 heap bytes,    792 allocs
  ECDSA-brainpoolP384r1    :     272 sign/s             3272 heap bytes,  18772 allocs
  ECDSA-secp256r1          :    2022 sign/s             2440 heap bytes,    638 allocs
  ECDSA-secp256k1          :    2115 sign/s             2504 heap bytes,   2520 allocs
  ECDSA-brainpoolP256r1    :     472 sign/s             2592 heap bytes,  13415 allocs
  ECDSA-secp521r1          :     240 verify/s           8640 heap bytes,   2851 allocs
  ECDSA-brainpoolP512r1    :      25 verify/s           8192 heap bytes, 188211 allocs
  ECDSA-secp384r1          :     386 verify/s           6936 heap bytes,   2266 allocs
  ECDSA-brainpoolP384r1    :      66 verify/s           6624 heap bytes,  86864 allocs
  ECDSA-secp256r1          :     631 verify/s           5536 heap bytes,   1731 allocs
  ECDSA-secp256k1          :     643 verify/s           5184 heap bytes,   8839 allocs
  ECDSA-brainpoolP256r1    :     137 verify/s           5168 heap bytes,  53818 allocs
  ECDHE-secp521r1          :     128 full handshake/s   9792 heap bytes,   5574 allocs
  ECDHE-brainpoolP512r1    :      13 full handshake/s   9376 heap bytes, 377346 allocs
  ECDHE-secp384r1          :     211 full handshake/s   8048 heap bytes,   4404 allocs
  ECDHE-brainpoolP384r1    :      32 full handshake/s   7664 heap bytes, 174067 allocs
  ECDHE-secp256r1          :     343 full handshake/s   6360 heap bytes,   3356 allocs
  ECDHE-secp256k1          :     347 full handshake/s   5912 heap bytes,  17601 allocs
  ECDHE-brainpoolP256r1    :      71 full handshake/s   6072 heap bytes, 107908 allocs
  ECDHE-x25519             :     344 full handshake/s   2608 heap bytes,  12448 allocs
  ECDHE-x448               :     136 full handshake/s   3760 heap bytes,  21720 allocs

Removing the trimming has significant memory impact. While it is clearly what
we want to do eventually for constant-time'ness, it should be fixed alongside
a strategy to contain the ramifications on memory usage.

Signed-off-by: Hanno Becker <[email protected]>
Copy link
Contributor

@mpg mpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mpg
Copy link
Contributor

mpg commented Apr 14, 2022

Updated results:

  ECDSA-secp521r1          :    1213 sign/s             2776 heap bytes,    800 allocs
  ECDSA-brainpoolP512r1    :     195 sign/s             2664 heap bytes,  13555 allocs
  ECDSA-secp384r1          :    1821 sign/s             2120 heap bytes,    633 allocs
  ECDSA-brainpoolP384r1    :     414 sign/s             2168 heap bytes,  10141 allocs
  ECDSA-secp256r1          :    2778 sign/s             1560 heap bytes,    539 allocs
  ECDSA-secp256k1          :    2705 sign/s             1616 heap bytes,   2432 allocs
  ECDSA-brainpoolP256r1    :     733 sign/s             1576 heap bytes,   8281 allocs
  ECDSA-secp521r1          :     293 verify/s           8136 heap bytes,   2498 allocs
  ECDSA-brainpoolP512r1    :      33 verify/s           7640 heap bytes,  61773 allocs
  ECDSA-secp384r1          :     388 verify/s           6688 heap bytes,   2000 allocs
  ECDSA-brainpoolP384r1    :      87 verify/s           6392 heap bytes,  46587 allocs
  ECDSA-secp256r1          :     797 verify/s           5232 heap bytes,   1552 allocs
  ECDSA-secp256k1          :     767 verify/s           4816 heap bytes,   8690 allocs
  ECDSA-brainpoolP256r1    :     183 verify/s           5024 heap bytes,  32951 allocs
  ECDHE-secp521r1          :     153 full handshake/s   8736 heap bytes,   4938 allocs
  ECDHE-brainpoolP512r1    :      21 full handshake/s   8200 heap bytes, 123698 allocs
  ECDHE-secp384r1          :     245 full handshake/s   7264 heap bytes,   3937 allocs
  ECDHE-brainpoolP384r1    :      43 full handshake/s   6696 heap bytes,  93326 allocs
  ECDHE-secp256r1          :     431 full handshake/s   5792 heap bytes,   3038 allocs
  ECDHE-secp256k1          :     406 full handshake/s   5256 heap bytes,  17341 allocs
  ECDHE-brainpoolP256r1    :      90 full handshake/s   5280 heap bytes,  66055 allocs
  ECDHE-x25519             :     307 full handshake/s   2600 heap bytes,  12451 allocs
  ECDHE-x448               :     129 full handshake/s   3768 heap bytes,  21716 allocs

This fixed the problem!

@mpg mpg added approved Design and code approved - may be waiting for CI or backports and removed needs-review Every commit must be reviewed by at least two team members, needs-ci Needs to pass CI tests labels Apr 15, 2022
@mpg mpg merged commit 63ed7cb into Mbed-TLS:development Apr 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports component-crypto Crypto primitives and low-level interfaces
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants