Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.6] Reduce performance regression in RSA public operations #9281

Merged
merged 21 commits into from
Aug 22, 2024

Conversation

mpg
Copy link
Contributor

@mpg mpg commented Jun 18, 2024

Description

Fix #9232 - partially, the low-hanging fruits (also those that don't make the code size go up by too much).

Status: work in progress.

TODO:

  • measure code size
  • measure perf on other platforms
  • currently the window size is based on the limbsize of E, should it use the actual bitsize instead when E is public?
  • decide on API
  • propagate to RSA functions

PR checklist

  • changelog provided
  • development TODO
  • 3.6 backport this is it
  • 2.28 backport not required (we are fixing a regression in 3.6)
  • tests partly provided, will be continued in [3.6] Rsapub additional tests #9493

Attempt to partially solve the performance regression in 3.6.0 without
adding too much code size.

Signed-off-by: Manuel Pégourié-Gonnard <[email protected]>
@mpg mpg self-assigned this Jun 18, 2024
@mpg mpg changed the base branch from development to mbedtls-3.6 June 18, 2024 11:03
@mpg
Copy link
Contributor Author

mpg commented Jun 18, 2024

Initial performance measurements on my laptop (64-bit Intel):

3.6.0

1024/1024-bit exp_mod, cold r: 0.961081 ms
1024/1024-bit exp_mod,  hot r: 1.050500 ms
  17/1024-bit exp_mod, cold r: 0.100309 ms
  17/1024-bit exp_mod,  hot r: 0.177438 ms

This PR

1024/1024-bit exp_mod, cold r: 1.070629 ms
1024/1024-bit exp_mod,  hot r: 1.062271 ms
  17/1024-bit exp_mod, cold r: 0.034083 ms
  17/1024-bit exp_mod,  hot r: 0.027739 ms

3.5.1

1024/1024-bit exp_mod, cold r: 1.071714 ms
1024/1024-bit exp_mod,  hot r: 1.083624 ms
  17/1024-bit exp_mod, cold r: 0.022744 ms
  17/1024-bit exp_mod,  hot r: 0.016366 ms

Edit: code used for benchmarking

Copy link
Contributor

@yanesca yanesca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me, this should eliminate all the overhead. One opinion about the API: maybe it would be worth to keep the _optionally_safe() function static and have an _unsafe() version in the header instead.

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok (with a few minor remarks) if we really have to have mixed-leakiness functions, but do we really have to do that? I find it harder to follow the code and reason about the function, and that's with a single function, I fear it won't scale to more complex algorithms.

* }
* not the other way round, in order to prevent misuse. (This is, if a value
* other than the two below is passed, default to the safe path.) */
#define MBEDTLS_MPI_IS_PUBLIC 0x2a2a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some constants can be loaded with fewer instructions than others. We should pick a constant that's short to load on Thumb. I think 0x2a2a can be improved on. This page has some guidance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x2a2a is a 16 bit value and can be loaded with a movw. That supposed to be a single cycle, how can we improve on that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanesca That's new in v7. And on second thoughts that blog post only covers 32-bit Arm instructions, but when thinking of code size, we care about Thumb instructions. So this is more relevant:

In Thumb instructions, constant can be:

  • any constant that can be produced by shifting an 8-bit value left by any number of bits within a 32-bit word
  • any constant of the form 0x00XY00XY
  • any constant of the form 0xXY00XY00
  • any constant of the form 0xXYXYXYXY.

That covers 0x002a002a and 0x2a2a2a2a and 0x0000002a and 0x00002a00 and more but not 0x00002a2a.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm that for Thumb2 the comparisons are encoded as single instructions, e.g.

  16:   f1b1 3f2a       cmp.w   r1, #707406378  ; 0x2a2a2a2a

* \return Another negative error code on different kinds of failures.
*
*/
int mbedtls_mpi_exp_mod_optionally_safe(mbedtls_mpi *X, const mbedtls_mpi *A,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have a single function for both cases, I would prefer a more specific naming convention than optionally_safe. I don't like the use of “safe”, which could refer to many different things. I'd prefer leaky or “public”.

* It is up to the caller to zeroize \p T when it is no
* longer needed, and before freeing it if it was dynamically
* allocated.
* \param[in] E_public Set to MBEDTLS_MPI_IS_PUBLIC to gain some performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the complexity of having functions whose security properties depend on a runtime argument. Neither P256-m nor BearSSL do that. I would prefer to have separate code paths for leaky and non-leaky, converging only on functions that have a single implementation either way. The only argument I can think of to merge the two is code size. Is the gain really worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not happy about it either, but I think now that we are turning on USE_PSA_CRYPTO by default, we will be pretty desperate for code size savings for a while. I think the gain can worth it and we should try.

@mpg
Copy link
Contributor Author

mpg commented Jul 22, 2024

Ok (with a few minor remarks) if we really have to have mixed-leakiness functions, but do we really have to do that?

No, I don't think we absolutely have to do that, and that's one of the main points about which I was hoping to get feedback. I just has to make a temporary choice for prototyping.

One opinion about the API: maybe it would be worth to keep the _optionally_safe() function static and have an _unsafe() version in the header instead.

Yes, I think that would already be better.

I don't like the complexity of having functions whose security properties depend on a runtime argument.

Agreed. Do you think it's acceptable for static functions though or would you rather avoid it entirely?

The only argument I can think of to merge the two is code size. Is the gain really worth it?

That, and also there can be a maintenance cost if we end up duplicating source code. I can give it a try to do some size measurements and check how easy or awkward it is to avoid duplication.

@gilles-peskine-arm
Copy link
Contributor

I don't like the complexity of having functions whose security properties depend on a runtime argument.

Agreed. Do you think it's acceptable for static functions though or would you rather avoid it entirely?

I'd prefer to avoid it, but for smaller functions with a limited scope, it's not so bad.

@jforissier
Copy link

With this PR (and the necessary changes to make use of the E_public parameter), the OP-TEE test case runs faster although still much slower than with v3.5.2. This is because the test doesn't only run RSA signature verification and decryption, it also includes RSA signing and encryption. Assuming OP-TEE switches to the more secure algorithm for secret stuff, we may need to disable this test in the CI and run it only during the release tests.
Here are the numbers (running time xtest 4011 on the arm32 QEMU target):

@mpg
Copy link
Contributor Author

mpg commented Aug 6, 2024

Just checking: @gilles-peskine-arm @yanesca will you be volunteering to review this once I've updated it, or should I look for another reviewer? (I'd appreciate if at least one of you could give at least a design review.)

@gilles-peskine-arm
Copy link
Contributor

I intend to review if I'm around (I'll take about a week off around 15 August).

@yanesca
Copy link
Contributor

yanesca commented Aug 6, 2024

I am happy to review it as well, but I will be taking some days off too (this Friday and next Wednesday).

@mpg mpg added priority-very-high Highest priority - prioritise this over other review work bug needs-work labels Aug 9, 2024
@mpg mpg marked this pull request as ready for review August 9, 2024 10:28
@mpg mpg linked an issue Aug 9, 2024 that may be closed by this pull request
The complexity of having functions whose security properties depend on a
runtime argument can be dangerous. Limit misuse by making any such
functions local.

Signed-off-by: Janos Follath <[email protected]>
The complexity of having functions whose security properties depend on a
runtime argument can be dangerous. Limit risk by isolating such code in
small functions with limited scope.

Signed-off-by: Janos Follath <[email protected]>
These macros are not part of any public or internal API, ideally they
would be defined in the source files. The reason to put them in
bignum_core.h to avoid duplication as macros for this purpose are
needed in both bignum.c and bignum_core.c.

Signed-off-by: Janos Follath <[email protected]>
In Thumb instructions, constant can be:

- any constant that can be produced by shifting an 8-bit value left by any
  number of bits within a 32-bit word
- any constant of the form 0x00XY00XY
- any constant of the form 0xXY00XY00
- any constant of the form 0xXYXYXYXY.

Signed-off-by: Janos Follath <[email protected]>
@yanesca yanesca self-assigned this Aug 12, 2024
@yanesca
Copy link
Contributor

yanesca commented Aug 12, 2024

There is still work needed, just pushing what I have so far to get an early CI overnight.

It is easier to read if the parameter controlling constant timeness with
respect to a parameter is next to that parameter.

Signed-off-by: Janos Follath <[email protected]>
The allocated size can be significantly larger than the actual size. In
the unsafe case we can use the actual size and gain some performance.

Signed-off-by: Janos Follath <[email protected]>
The new test hooks allow to check whether there was an unsafe call of an
optionally safe function in the codepath. For the sake of simplicity the
MBEDTLS_MPI_IS_* macros are reused for signalling safe/unsafe codepaths
here too.

Signed-off-by: Janos Follath <[email protected]>
Signed-off-by: Janos Follath <[email protected]>
@yanesca
Copy link
Contributor

yanesca commented Aug 22, 2024

Reverted to 878af12.

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve the library changes for the release freeze today, after fixing the CI (plus the changelog entry while you're at it). I haven't reviewed the tests, but I'm satisfied that they're non-harmful.

This is conditional on doing the following as soon as possible, in any case before the end of the quarter:

yanesca and others added 2 commits August 22, 2024 14:49
Co-authored-by: Gilles Peskine <[email protected]>
Signed-off-by: Janos Follath <[email protected]>
To silence no previous prototype warnings. And this is the proper way to
do it anyway.

Signed-off-by: Janos Follath <[email protected]>
Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve the library changes at 5f31697 for the release freeze today. I haven't reviewed the tests, but I'm satisfied that they're non-harmful.

This is conditional on doing the following as soon as possible, in any case before the end of the quarter:

Signed-off-by: Janos Follath <[email protected]>
Copy link
Contributor

@tom-cosgrove-arm tom-cosgrove-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@gilles-peskine-arm gilles-peskine-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the reservations in #9281 (review)

@gilles-peskine-arm
Copy link
Contributor

I've pressed the merge button so that we can get on with the release. We need a forward-port to development, but that doesn't go through the release process so it can wait a few days.

@gilles-peskine-arm gilles-peskine-arm added approved Design and code approved - may be waiting for CI or backports needs-backports Backports are missing or are pending review and approval. and removed needs-review Every commit must be reviewed by at least two team members, needs-ci Needs to pass CI tests labels Aug 22, 2024
@gilles-peskine-arm gilles-peskine-arm added this pull request to the merge queue Aug 22, 2024
Merged via the queue into Mbed-TLS:mbedtls-3.6 with commit df0ef8a Aug 22, 2024
4 of 6 checks passed
@mpg mpg mentioned this pull request Sep 4, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Design and code approved - may be waiting for CI or backports bug needs-backports Backports are missing or are pending review and approval. priority-very-high Highest priority - prioritise this over other review work
Development

Successfully merging this pull request may close these issues.

Performance regression in mbedtls_mpi_exp_mod() (v3.6.0)
6 participants