Review inline assembly for missing constraints and fix them #4943

gilles-peskine-arm · 2021-09-14T09:21:27Z

Some of our inline assembly code is missing annotations telling the compiler what the assembly reads and writes (register, memory). On the platforms we test with, there are generally enough annotations to get correct code, but we're at the mercy of a new compiler version that optimizes better. This happened on x86_64 with Clang 12 and the bignum multiplication code (#4116, #4786, #4917), for example. 52b845b may have been a similar issue but I haven't fully investigated.

The x86 code in bn_mul.h is similar to the x86_64 and has the same issue. I haven't looked at others.

Help from the community would be appreciated, especially for platforms other than x86 and arm.

The text was updated successfully, but these errors were encountered:

gilles-peskine-arm · 2021-09-15T16:26:26Z

The C implementation in bn_mul.h is fairly simple. I wonder if the assembly code is really better than what compilers can do. Maybe we can remove the assembly code and stick with C? To be benchmarked.

davidhorstmann-arm · 2021-09-15T17:22:51Z

Do we know what the origins of these assembly snippets? Were they community contributions? Did the authors do benchmarking? Were they hand-written from scratch, or adapted from compiled C code?

gilles-peskine-arm · 2021-09-15T17:27:16Z

Do we know what the origins of these assembly snippets?

We know what's in the git log. For things that are older than the git history (which includes most of bn_mul.h), we don't know anymore.

mpg · 2021-09-16T08:11:09Z

I also suggested some time ago that we get rid of the asm code and stick with C as a first step, and then re-introduce asm for a particular platform only if the following conditions are met:

We know there's interest for running Mbed TLS on that platform. (For example, to do still care about non-SSE2 i386 in 2021? I don't know, and I'd like to know.)
The code can be tested, and not just at submission time (ideally as part of our CI, but if someone volunteers for regular testing, perhaps something can be arranged).
A benchmark shows the asm performs significantly better than the C version.
Someone volunteers to maintain the code if it's for a platform we're not familiar with (so probably everything that's not arm or x86).

Regardless of the exact conditions, I think we should make sure the cost-benefit of asm is positive before including/keeping it. Currently we have no idea about the benefits, and the cost is non-zero as this bug shows (and previous bugs, for example our Sparc64 asm was reported broken some time ago).

hanno-becker · 2021-09-16T08:25:22Z

IIRC, we also had a long-standing issue with our MIPS assembly which was then realized to be a bug in the clobber list.

davidhorstmann-arm · 2021-09-22T16:52:17Z

Looking more closely at the inline assembly I've realised that the correct memory constraint depends on how many times MULADDC_CORE is used. Our current memory constraints for x86_64 are correct in some cases and far too restrictive in others. We could either:

Ignore this and have overly restrictive memory constraints sometimes.
Supply a parameter to the MULADDC_STOP macro, for the number of iterations of MULADDC_CORE that go before it, to use for memory constraints. So the code in bignum.h would look like:

    MULADDC_INIT
    MULADDC_CORE MULADDC_CORE
    MULADDC_CORE MULADDC_CORE
    
    MULADDC_CORE MULADDC_CORE
    MULADDC_CORE MULADDC_CORE
    MULADDC_STOP(8)

Which would generate

asm(
    ...
        : "+c" (c), "+D" (d), "+S" (s), "+m" (*(uint64_t (*)[8]) d) \
        : "b" (b), "m" (*(const uint64_t (*)[8]) s)                 \
        : "rax", "rdx", "r8" 
);

Does anyone have thoughts on this solution?

gilles-peskine-arm · 2021-09-22T17:12:34Z

I think we should just use the largest amount. It's unlikely that this will prevent a compiler from doing a worthwhile optimization. When I did the amd64 fix, I even considered just adding a generic "memory" clobber to keep things simple.

Add memory constraints to the aarch64 inline assembly in MULADDC_STOP. This fixes an issue where Clang 12 and 13 were generating non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943 for further details. Signed-off-by: David Horstmann <[email protected]>

davidhorstmann-arm · 2021-09-27T10:04:48Z

Current progress: amd64 and aarch64 assembly snippets in bn_mul.h are fixed. I've been unable to reproduce the Clang-12 issue with arm or thumb, although this does not necessarily mean that their constraints are correct.

gilles-peskine-arm · 2021-11-16T10:50:23Z

Tips for local testing

For targets that can run Linux, see https://developer.trustedfirmware.org/w/mbed-tls/testing/qemu/ for setting up your system to use Qemu for testing.

If on Ubuntu 18.04 or earlier (not needed on 20.04), add a package source for a recent clang: /etc/apt/sources.list.d/clang.list

deb [arch=amd64] http://apt.llvm.org/bionic/ llvm-toolchain-bionic-12 main
deb-src [arch=amd64] http://apt.llvm.org/bionic/ llvm-toolchain-bionic-12 main

Then install the clang-12 package.

hanno-becker · 2021-12-22T13:45:03Z

I have just measured the performance of mpi_montmul() on Cortex-M55, comparing ASM to Arm Compiler 6.15 generated code (with -mcpu=cortex-m55 and-O3). For a 2048-bit Montgomery multiplication, our assembly takes ~39kCycles, while the C code takes 82kCycles. With another small optimization in the assembly, leveraging the 64-bit data path of Cortex-M55, the assembly performance got down to 25kCycles, so more than 3x faster than the C code.

I think we should keep the inline assembly for those architectures we feel we have the competency and resources to maintain, including the A-profile and esp. the M-profile of the Arm architecture. In my recent experience, optimized assembly is always much faster than C, and esp. for the most time-consuming code around RSA and ECC, I don't think we can afford dropping a 2x factor solely for easier maintenance.

For convenience of the user relying on the inline assembly we'd be removing, we can add thorough documentation for what macros to replace, and provide the old snippets as non-maintained examples (so long as we are not aware that they don't work)?

mpg · 2021-12-27T09:16:40Z

Thanks @hanno-arm that is very valuable information! I'm thinking we might want to collect that kind of results in some place that's more central and easier to find in the future than comments on various github issues/PR. (Not only because we want to know, but also because we often get asked externally.) How about introducing something like docs/performance.txt?

In my recent experience, optimized assembly is always much faster than C, and esp. for the most time-consuming code around RSA and ECC, I don't think we can afford dropping a 2x factor solely for easier maintenance.

To clarify: I never suggested that, just that we should make sure there's an actual measurable improvement, rather than just hope there is. Clearly, a 2x factor absolutely worth the maintenance cost, I think even smaller factors would be worth it considering this is a critical part of the code.

Coming back to the list of conditions I suggested earlier, I think it's clear the A-profile and M-profile meet criteria 0, 1, and 3, and your measurements make it clear that for cores with the DSP extension (which I think includes all A-profile cores, and about half of M-profile cores) criterion 2 is more than satisfied. Now it would be good to also know for the remaining M-profile cores, namely M0(+), M3 and M23.

Btw, I'm not sure if in 2021 (soon 2022!) we still care about pre-Cortex Arm cores?

hanno-becker · 2021-12-29T07:57:43Z

On Armv8-A, the difference between C and inline assembly is smaller for high-end cores: On Cortex-A78 and Cortex-X1, I get around 24kCycles for our inline assembly and around 27kCycles for the C version, for a 4096-bit Montgomery multiplication. The inline assembly can be improved to around 17kCycles with a few simple changes keeping the structure of the inline assembly. So, while the differences are not as big as for the M-profile, they are large enough to merit the maintenance of inline assembly for v8-A, in my opinion. (NB: Of course there is much more to gain by writing dedicated multiplication routines for ECC-sized "big"nums, as those fit into the register file, but that's a whole lot more intrusive than improving the inline assembly macros we have at the moment).

gilles-peskine-arm · 2024-12-09T19:49:15Z

We've had a report (#9819) that aesni.c is missing some clobber declarations. No known miscompilation.

gilles-peskine-arm added bug help-wanted This issue is not being actively worked on, but PRs welcome. component-crypto Crypto primitives and low-level interfaces Product Backlog size-m Estimated task size: medium (~1w) labels Sep 14, 2021

gilles-peskine-arm mentioned this issue Sep 14, 2021

8e464c4 breaks clang 12 -O2 (make test failed) #4786

Closed

sfan5 mentioned this issue Sep 21, 2021

Clang 12 bignum issue also affects aarch64 #4962

Closed

davidhorstmann-arm mentioned this issue Sep 22, 2021

Fix aarch64 assembly for bignum multiplication #4968

Merged

davidhorstmann-arm mentioned this issue Sep 24, 2021

Backport 2.x: Fix aarch64 assembly for bignum multiplication #4978

Merged

gilles-peskine-arm mentioned this issue Nov 16, 2021

embedtls does not work with clang 13 optimization -O2 #5171

Closed

davidhorstmann-arm mentioned this issue Nov 23, 2021

"RSA-The PKCS#1 verification failed" on XCode13.1 #5152

Closed

lohralfen mentioned this issue Dec 6, 2021

mbedtls_mpi_self_test fails on Cortex M33 microcontroller #5266

Closed

hanno-becker mentioned this issue Dec 22, 2021

Improve inline assembly for Cortex-M + DSP #5360

Closed

daverodgman removed the Product Backlog label May 13, 2022

bensze01 added the Product Backlog label May 13, 2022

daverodgman removed the Product Backlog label May 13, 2022

the-real-adammork mentioned this issue Nov 23, 2022

Patch for MbedTLS PR#4978 - fixes bignum multiplication for newer versions of clang (Xcode 13/14) mobilecoinfoundation/rust-mbedtls#66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review inline assembly for missing constraints and fix them #4943

Review inline assembly for missing constraints and fix them #4943

gilles-peskine-arm commented Sep 14, 2021 •

edited

Loading

gilles-peskine-arm commented Sep 15, 2021

davidhorstmann-arm commented Sep 15, 2021

gilles-peskine-arm commented Sep 15, 2021

mpg commented Sep 16, 2021

hanno-becker commented Sep 16, 2021

davidhorstmann-arm commented Sep 22, 2021 •

edited

Loading

gilles-peskine-arm commented Sep 22, 2021

davidhorstmann-arm commented Sep 27, 2021

gilles-peskine-arm commented Nov 16, 2021

hanno-becker commented Dec 22, 2021 •

edited

Loading

mpg commented Dec 27, 2021

hanno-becker commented Dec 29, 2021

gilles-peskine-arm commented Dec 9, 2024

Review inline assembly for missing constraints and fix them #4943

Review inline assembly for missing constraints and fix them #4943

Comments

gilles-peskine-arm commented Sep 14, 2021 • edited Loading

gilles-peskine-arm commented Sep 15, 2021

davidhorstmann-arm commented Sep 15, 2021

gilles-peskine-arm commented Sep 15, 2021

mpg commented Sep 16, 2021

hanno-becker commented Sep 16, 2021

davidhorstmann-arm commented Sep 22, 2021 • edited Loading

gilles-peskine-arm commented Sep 22, 2021

davidhorstmann-arm commented Sep 27, 2021

gilles-peskine-arm commented Nov 16, 2021

Tips for local testing

hanno-becker commented Dec 22, 2021 • edited Loading

mpg commented Dec 27, 2021

hanno-becker commented Dec 29, 2021

gilles-peskine-arm commented Dec 9, 2024

gilles-peskine-arm commented Sep 14, 2021 •

edited

Loading

davidhorstmann-arm commented Sep 22, 2021 •

edited

Loading

hanno-becker commented Dec 22, 2021 •

edited

Loading