Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review inline assembly for missing constraints and fix them #4943

Open
gilles-peskine-arm opened this issue Sep 14, 2021 · 13 comments
Open

Review inline assembly for missing constraints and fix them #4943

gilles-peskine-arm opened this issue Sep 14, 2021 · 13 comments
Labels
bug component-crypto Crypto primitives and low-level interfaces help-wanted This issue is not being actively worked on, but PRs welcome. size-m Estimated task size: medium (~1w)

Comments

@gilles-peskine-arm
Copy link
Contributor

gilles-peskine-arm commented Sep 14, 2021

Some of our inline assembly code is missing annotations telling the compiler what the assembly reads and writes (register, memory). On the platforms we test with, there are generally enough annotations to get correct code, but we're at the mercy of a new compiler version that optimizes better. This happened on x86_64 with Clang 12 and the bignum multiplication code (#4116, #4786, #4917), for example. 52b845b may have been a similar issue but I haven't fully investigated.

The x86 code in bn_mul.h is similar to the x86_64 and has the same issue. I haven't looked at others.

Help from the community would be appreciated, especially for platforms other than x86 and arm.

@gilles-peskine-arm gilles-peskine-arm added bug help-wanted This issue is not being actively worked on, but PRs welcome. component-crypto Crypto primitives and low-level interfaces Product Backlog size-m Estimated task size: medium (~1w) labels Sep 14, 2021
@gilles-peskine-arm
Copy link
Contributor Author

The C implementation in bn_mul.h is fairly simple. I wonder if the assembly code is really better than what compilers can do. Maybe we can remove the assembly code and stick with C? To be benchmarked.

@davidhorstmann-arm
Copy link
Contributor

Do we know what the origins of these assembly snippets? Were they community contributions? Did the authors do benchmarking? Were they hand-written from scratch, or adapted from compiled C code?

@gilles-peskine-arm
Copy link
Contributor Author

Do we know what the origins of these assembly snippets?

We know what's in the git log. For things that are older than the git history (which includes most of bn_mul.h), we don't know anymore.

@mpg
Copy link
Contributor

mpg commented Sep 16, 2021

I also suggested some time ago that we get rid of the asm code and stick with C as a first step, and then re-introduce asm for a particular platform only if the following conditions are met:

  1. We know there's interest for running Mbed TLS on that platform. (For example, to do still care about non-SSE2 i386 in 2021? I don't know, and I'd like to know.)
  2. The code can be tested, and not just at submission time (ideally as part of our CI, but if someone volunteers for regular testing, perhaps something can be arranged).
  3. A benchmark shows the asm performs significantly better than the C version.
  4. Someone volunteers to maintain the code if it's for a platform we're not familiar with (so probably everything that's not arm or x86).

Regardless of the exact conditions, I think we should make sure the cost-benefit of asm is positive before including/keeping it. Currently we have no idea about the benefits, and the cost is non-zero as this bug shows (and previous bugs, for example our Sparc64 asm was reported broken some time ago).

@hanno-becker
Copy link

IIRC, we also had a long-standing issue with our MIPS assembly which was then realized to be a bug in the clobber list.

@davidhorstmann-arm
Copy link
Contributor

davidhorstmann-arm commented Sep 22, 2021

Looking more closely at the inline assembly I've realised that the correct memory constraint depends on how many times MULADDC_CORE is used. Our current memory constraints for x86_64 are correct in some cases and far too restrictive in others. We could either:

  1. Ignore this and have overly restrictive memory constraints sometimes.
  2. Supply a parameter to the MULADDC_STOP macro, for the number of iterations of MULADDC_CORE that go before it, to use for memory constraints. So the code in bignum.h would look like:
    MULADDC_INIT
    MULADDC_CORE MULADDC_CORE
    MULADDC_CORE MULADDC_CORE
    
    MULADDC_CORE MULADDC_CORE
    MULADDC_CORE MULADDC_CORE
    MULADDC_STOP(8)

Which would generate

asm(
    ...
        : "+c" (c), "+D" (d), "+S" (s), "+m" (*(uint64_t (*)[8]) d) \
        : "b" (b), "m" (*(const uint64_t (*)[8]) s)                 \
        : "rax", "rdx", "r8" 
);

Does anyone have thoughts on this solution?

@gilles-peskine-arm
Copy link
Contributor Author

I think we should just use the largest amount. It's unlikely that this will prevent a compiler from doing a worthwhile optimization. When I did the amd64 fix, I even considered just adding a generic "memory" clobber to keep things simple.

davidhorstmann-arm added a commit to davidhorstmann-arm/mbedtls that referenced this issue Sep 22, 2021
Add memory constraints to the aarch64 inline assembly in MULADDC_STOP.
This fixes an issue where Clang 12 and 13 were generating
non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943
for further details.

Signed-off-by: David Horstmann <[email protected]>
davidhorstmann-arm added a commit to davidhorstmann-arm/mbedtls that referenced this issue Sep 22, 2021
Add memory constraints to the aarch64 inline assembly in MULADDC_STOP.
This fixes an issue where Clang 12 and 13 were generating
non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943
for further details.

Signed-off-by: David Horstmann <[email protected]>
davidhorstmann-arm added a commit to davidhorstmann-arm/mbedtls that referenced this issue Sep 24, 2021
Add memory constraints to the aarch64 inline assembly in MULADDC_STOP.
This fixes an issue where Clang 12 and 13 were generating
non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943
for further details.

Signed-off-by: David Horstmann <[email protected]>
@davidhorstmann-arm
Copy link
Contributor

Current progress: amd64 and aarch64 assembly snippets in bn_mul.h are fixed. I've been unable to reproduce the Clang-12 issue with arm or thumb, although this does not necessarily mean that their constraints are correct.

@gilles-peskine-arm
Copy link
Contributor Author

Tips for local testing

For targets that can run Linux, see https://developer.trustedfirmware.org/w/mbed-tls/testing/qemu/ for setting up your system to use Qemu for testing.

If on Ubuntu 18.04 or earlier (not needed on 20.04), add a package source for a recent clang: /etc/apt/sources.list.d/clang.list

deb [arch=amd64] http://apt.llvm.org/bionic/ llvm-toolchain-bionic-12 main
deb-src [arch=amd64] http://apt.llvm.org/bionic/ llvm-toolchain-bionic-12 main

Then install the clang-12 package.

@hanno-becker
Copy link

hanno-becker commented Dec 22, 2021

I have just measured the performance of mpi_montmul() on Cortex-M55, comparing ASM to Arm Compiler 6.15 generated code (with -mcpu=cortex-m55 and-O3). For a 2048-bit Montgomery multiplication, our assembly takes ~39kCycles, while the C code takes 82kCycles. With another small optimization in the assembly, leveraging the 64-bit data path of Cortex-M55, the assembly performance got down to 25kCycles, so more than 3x faster than the C code.

I think we should keep the inline assembly for those architectures we feel we have the competency and resources to maintain, including the A-profile and esp. the M-profile of the Arm architecture. In my recent experience, optimized assembly is always much faster than C, and esp. for the most time-consuming code around RSA and ECC, I don't think we can afford dropping a 2x factor solely for easier maintenance.

For convenience of the user relying on the inline assembly we'd be removing, we can add thorough documentation for what macros to replace, and provide the old snippets as non-maintained examples (so long as we are not aware that they don't work)?

@mpg
Copy link
Contributor

mpg commented Dec 27, 2021

Thanks @hanno-arm that is very valuable information! I'm thinking we might want to collect that kind of results in some place that's more central and easier to find in the future than comments on various github issues/PR. (Not only because we want to know, but also because we often get asked externally.) How about introducing something like docs/performance.txt?

In my recent experience, optimized assembly is always much faster than C, and esp. for the most time-consuming code around RSA and ECC, I don't think we can afford dropping a 2x factor solely for easier maintenance.

To clarify: I never suggested that, just that we should make sure there's an actual measurable improvement, rather than just hope there is. Clearly, a 2x factor absolutely worth the maintenance cost, I think even smaller factors would be worth it considering this is a critical part of the code.

Coming back to the list of conditions I suggested earlier, I think it's clear the A-profile and M-profile meet criteria 0, 1, and 3, and your measurements make it clear that for cores with the DSP extension (which I think includes all A-profile cores, and about half of M-profile cores) criterion 2 is more than satisfied. Now it would be good to also know for the remaining M-profile cores, namely M0(+), M3 and M23.

Btw, I'm not sure if in 2021 (soon 2022!) we still care about pre-Cortex Arm cores?

@hanno-becker
Copy link

On Armv8-A, the difference between C and inline assembly is smaller for high-end cores: On Cortex-A78 and Cortex-X1, I get around 24kCycles for our inline assembly and around 27kCycles for the C version, for a 4096-bit Montgomery multiplication. The inline assembly can be improved to around 17kCycles with a few simple changes keeping the structure of the inline assembly. So, while the differences are not as big as for the M-profile, they are large enough to merit the maintenance of inline assembly for v8-A, in my opinion. (NB: Of course there is much more to gain by writing dedicated multiplication routines for ECC-sized "big"nums, as those fit into the register file, but that's a whole lot more intrusive than improving the inline assembly macros we have at the moment).

@gilles-peskine-arm
Copy link
Contributor Author

We've had a report (#9819) that aesni.c is missing some clobber declarations. No known miscompilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug component-crypto Crypto primitives and low-level interfaces help-wanted This issue is not being actively worked on, but PRs welcome. size-m Estimated task size: medium (~1w)
Projects
None yet
Development

No branches or pull requests

6 participants