Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64: Support run-time detection of FEAT_LSE2 #126

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

taiki-e
Copy link
Owner

@taiki-e taiki-e commented Oct 16, 2023

This supports run-time detection of FEAT_LSE2 in 128-bit atomic load/store.

This will greatly improve performance of 128-bit atomic load/store on hardware such as graviton 3, apple m1[^1] (even without -C target-feature=+lse2). In aarch64 macOS, FEAT_LSE2 is enabled at compile time, so it is already optimized and run-time detection is unnecessary and unused. However, Linux and other systems running on m1 will benefit from run-time detection.

The following is the result of the microbenchmark on aarch64 Linux on apple m1.

bench_portable_atomic_arch/u128_load
                        time:   [1.5079 ns 1.5090 ns 1.5104 ns]
                        change: [-79.046% -78.953% -78.891%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
bench_portable_atomic_arch/u128_store
                        time:   [1.1308 ns 1.1317 ns 1.1326 ns]
                        change: [-80.159% -80.046% -79.970%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  8 (8.00%) high mild
  7 (7.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load
                        time:   [78.471 µs 79.294 µs 80.459 µs]
                        change: [-51.261% -50.628% -49.878%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
bench_portable_atomic_arch/u128_concurrent_load_store
                        time:   [125.96 µs 126.72 µs 127.65 µs]
                        change: [-62.113% -61.410% -60.740%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
bench_portable_atomic_arch/u128_concurrent_store
                        time:   [136.19 µs 136.87 µs 137.58 µs]
                        change: [-53.565% -52.888% -52.263%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

The detection implementation uses an implementation that has already existed for testing.

cc #10

@taiki-e taiki-e added the O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state label Oct 16, 2023
@taiki-e taiki-e force-pushed the aarch64-lse2-outline-atomics branch 3 times, most recently from f999eba to 247d31a Compare October 23, 2023 13:44
@taiki-e taiki-e force-pushed the aarch64-lse2-outline-atomics branch from 247d31a to f38e694 Compare October 23, 2023 15:00
@taiki-e taiki-e merged commit dc6205b into main Oct 23, 2023
@taiki-e taiki-e deleted the aarch64-lse2-outline-atomics branch October 23, 2023 16:43
@taiki-e taiki-e added O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode and removed O-arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state labels Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-aarch64 Target: Armv8-A, Armv8-R, or later processors in AArch64 mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant