-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review inline assembly for missing constraints and fix them #4943
Comments
The C implementation in |
Do we know what the origins of these assembly snippets? Were they community contributions? Did the authors do benchmarking? Were they hand-written from scratch, or adapted from compiled C code? |
We know what's in the git log. For things that are older than the git history (which includes most of |
I also suggested some time ago that we get rid of the asm code and stick with C as a first step, and then re-introduce asm for a particular platform only if the following conditions are met:
Regardless of the exact conditions, I think we should make sure the cost-benefit of asm is positive before including/keeping it. Currently we have no idea about the benefits, and the cost is non-zero as this bug shows (and previous bugs, for example our Sparc64 asm was reported broken some time ago). |
IIRC, we also had a long-standing issue with our MIPS assembly which was then realized to be a bug in the clobber list. |
Looking more closely at the inline assembly I've realised that the correct memory constraint depends on how many times
Which would generate
Does anyone have thoughts on this solution? |
I think we should just use the largest amount. It's unlikely that this will prevent a compiler from doing a worthwhile optimization. When I did the amd64 fix, I even considered just adding a generic |
Add memory constraints to the aarch64 inline assembly in MULADDC_STOP. This fixes an issue where Clang 12 and 13 were generating non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943 for further details. Signed-off-by: David Horstmann <[email protected]>
Add memory constraints to the aarch64 inline assembly in MULADDC_STOP. This fixes an issue where Clang 12 and 13 were generating non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943 for further details. Signed-off-by: David Horstmann <[email protected]>
Add memory constraints to the aarch64 inline assembly in MULADDC_STOP. This fixes an issue where Clang 12 and 13 were generating non-functional code on aarch64 platforms. See Mbed-TLS#4962, Mbed-TLS#4943 for further details. Signed-off-by: David Horstmann <[email protected]>
Current progress: amd64 and aarch64 assembly snippets in |
Tips for local testingFor targets that can run Linux, see https://developer.trustedfirmware.org/w/mbed-tls/testing/qemu/ for setting up your system to use Qemu for testing. If on Ubuntu 18.04 or earlier (not needed on 20.04), add a package source for a recent clang:
Then install the |
I have just measured the performance of I think we should keep the inline assembly for those architectures we feel we have the competency and resources to maintain, including the A-profile and esp. the M-profile of the Arm architecture. In my recent experience, optimized assembly is always much faster than C, and esp. for the most time-consuming code around RSA and ECC, I don't think we can afford dropping a 2x factor solely for easier maintenance. For convenience of the user relying on the inline assembly we'd be removing, we can add thorough documentation for what macros to replace, and provide the old snippets as non-maintained examples (so long as we are not aware that they don't work)? |
Thanks @hanno-arm that is very valuable information! I'm thinking we might want to collect that kind of results in some place that's more central and easier to find in the future than comments on various github issues/PR. (Not only because we want to know, but also because we often get asked externally.) How about introducing something like
To clarify: I never suggested that, just that we should make sure there's an actual measurable improvement, rather than just hope there is. Clearly, a 2x factor absolutely worth the maintenance cost, I think even smaller factors would be worth it considering this is a critical part of the code. Coming back to the list of conditions I suggested earlier, I think it's clear the A-profile and M-profile meet criteria 0, 1, and 3, and your measurements make it clear that for cores with the DSP extension (which I think includes all A-profile cores, and about half of M-profile cores) criterion 2 is more than satisfied. Now it would be good to also know for the remaining M-profile cores, namely M0(+), M3 and M23. Btw, I'm not sure if in 2021 (soon 2022!) we still care about pre-Cortex Arm cores? |
On Armv8-A, the difference between C and inline assembly is smaller for high-end cores: On Cortex-A78 and Cortex-X1, I get around 24kCycles for our inline assembly and around 27kCycles for the C version, for a 4096-bit Montgomery multiplication. The inline assembly can be improved to around 17kCycles with a few simple changes keeping the structure of the inline assembly. So, while the differences are not as big as for the M-profile, they are large enough to merit the maintenance of inline assembly for v8-A, in my opinion. (NB: Of course there is much more to gain by writing dedicated multiplication routines for ECC-sized "big"nums, as those fit into the register file, but that's a whole lot more intrusive than improving the inline assembly macros we have at the moment). |
We've had a report (#9819) that |
Some of our inline assembly code is missing annotations telling the compiler what the assembly reads and writes (register, memory). On the platforms we test with, there are generally enough annotations to get correct code, but we're at the mercy of a new compiler version that optimizes better. This happened on x86_64 with Clang 12 and the bignum multiplication code (#4116, #4786, #4917), for example. 52b845b may have been a similar issue but I haven't fully investigated.
The x86 code in
bn_mul.h
is similar to the x86_64 and has the same issue. I haven't looked at others.Help from the community would be appreciated, especially for platforms other than x86 and arm.
The text was updated successfully, but these errors were encountered: