Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jemalloc performance on 64-bit ARM #34476

Closed
MagaTailor opened this issue Jun 25, 2016 · 9 comments
Closed

Jemalloc performance on 64-bit ARM #34476

MagaTailor opened this issue Jun 25, 2016 · 9 comments
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@MagaTailor
Copy link

MagaTailor commented Jun 25, 2016

I've just run the binary_trees benchmark on an ARMv8, Cortex-A53 processor, having converted an Android TV box to Linux.

I'd found previously, on a much weaker (but more power efficient) armv7 Cortex A5, the results were equal. On the new machine (using the latest official aarch64 rustc nightly) ./binary_trees 23 produces the following results:

sysalloc 1m28s 5m10s 0m10s
jemalloc 1m35s 5m10s 0m53s

which is palpably worse actually, even though Cortex-A53 is a much stronger core.

I'm beginning to think jemalloc only makes sense on Intel processors with heaps or L1/L2 cache.

More benchmark ideas welcome, though.

added retroactively:
To reproduce, unpack the attachment and run:

cargo build --release && time target/release/binary_trees 23

inside the binary_trees directory. Uncomment the first 2 lines in main.rs to produce a sysalloc version.

@MagaTailor
Copy link
Author

MagaTailor commented Jun 26, 2016

So, what happens if we run well optimized armv7 binaries on that system?

sysalloc 1m9s 3m59s 0m19s
jemalloc 1m11s 3m58s 0m25s

Ouch!

EDIT:
I did another comparison like this later, using armv7 binaries on aarch64, and for certain CPU bound workloads native code was 2-3 times faster. (even though, all else being equal, a 50% improvement is expected, probably 64-bit effect)

@sorear
Copy link
Contributor

sorear commented Jun 26, 2016

What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.

(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)

@MagaTailor
Copy link
Author

On Sun, 26 Jun 2016 01:09:27 -0700
sorear [email protected] wrote:

What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.

Those were the timings.

(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)

Yes, but the relative difference, as I'd mentioned in the opening comment, was very small which means there's also a factor of LLVM backend maturity.

@sorear
Copy link
Contributor

sorear commented Jun 26, 2016

What do the aarch64 timings look like if you turn off memory return to the OS using MALLOC_CONF=lg_dirty_mult:-1 ? That helped last time I saw jemalloc using excessive sys time.

@MagaTailor
Copy link
Author

MagaTailor commented Jun 26, 2016

Nice trick! Now it's jemalloc 1m19s 4m57s 0m2s but how does it compare to the default allocator's settings? And, by changing those, the issue stops being about any code generation comparisons.

Thanks to your tweak, the armv7 jemalloc binary, running on Cortex-A53, was able to catch up with sysalloc :) (1m9s 3m55 0m1s)

@brson brson added A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jun 27, 2016
@brson
Copy link
Contributor

brson commented Jun 27, 2016

I'd be in favor of turning jemalloc off everywhere except where it's already proven to be a win. Or everywhere period.

@MagaTailor
Copy link
Author

MagaTailor commented Jun 28, 2016

@brson Now, that I've built rust on two different ARM architectures with --disable-jemalloc, I'd like to propose a configure switch inverting the current allocator defaults. In other words, use alloc_system by default, but also build the jemalloc crate.

The current disable switch makes it impossible to use jemalloc on a per crate basis, like this:

#![feature(alloc_jemalloc)]
extern crate alloc_jemalloc;

Or more simply --disable-jemalloccould start meaning just that.

@brson
Copy link
Contributor

brson commented Jun 28, 2016

Or more simply --disable-jemalloccould start meaning just that.

sgtm

@MagaTailor
Copy link
Author

The following news makes this issue much less interesting. Who knows what effect DVFS has under different loads.

http://www.cnx-software.com/2016/08/28/amlogic-s905-and-s912-processors-appear-to-be-limited-to-1-5-ghz-not-2-ghz-as-advertised/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

3 participants