Jemalloc performance on 64-bit ARM #34476

MagaTailor · 2016-06-25T20:36:23Z

I've just run the binary_trees benchmark on an ARMv8, Cortex-A53 processor, having converted an Android TV box to Linux.

I'd found previously, on a much weaker (but more power efficient) armv7 Cortex A5, the results were equal. On the new machine (using the latest official aarch64 rustc nightly) ./binary_trees 23 produces the following results:

sysalloc 1m28s 5m10s 0m10s
jemalloc 1m35s 5m10s 0m53s

which is palpably worse actually, even though Cortex-A53 is a much stronger core.

I'm beginning to think jemalloc only makes sense on Intel processors with heaps or L1/L2 cache.

More benchmark ideas welcome, though.

added retroactively:
To reproduce, unpack the attachment and run:

cargo build --release && time target/release/binary_trees 23

inside the binary_trees directory. Uncomment the first 2 lines in main.rs to produce a sysalloc version.

The text was updated successfully, but these errors were encountered:

MagaTailor · 2016-06-26T01:01:54Z

So, what happens if we run well optimized armv7 binaries on that system?

sysalloc 1m9s 3m59s 0m19s
jemalloc 1m11s 3m58s 0m25s

Ouch!

EDIT:
I did another comparison like this later, using armv7 binaries on aarch64, and for certain CPU bound workloads native code was 2-3 times faster. (even though, all else being equal, a 50% improvement is expected, probably 64-bit effect)

sorear · 2016-06-26T08:08:58Z

What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.

(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)

MagaTailor · 2016-06-26T11:14:24Z

On Sun, 26 Jun 2016 01:09:27 -0700
sorear [email protected] wrote:

What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours.

Those were the timings.

(Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.)

Yes, but the relative difference, as I'd mentioned in the opening comment, was very small which means there's also a factor of LLVM backend maturity.

sorear · 2016-06-26T19:56:23Z

What do the aarch64 timings look like if you turn off memory return to the OS using MALLOC_CONF=lg_dirty_mult:-1 ? That helped last time I saw jemalloc using excessive sys time.

MagaTailor · 2016-06-26T20:56:54Z

Nice trick! Now it's jemalloc 1m19s 4m57s 0m2s but how does it compare to the default allocator's settings? And, by changing those, the issue stops being about any code generation comparisons.

Thanks to your tweak, the armv7 jemalloc binary, running on Cortex-A53, was able to catch up with sysalloc :) (1m9s 3m55 0m1s)

brson · 2016-06-27T20:44:26Z

I'd be in favor of turning jemalloc off everywhere except where it's already proven to be a win. Or everywhere period.

MagaTailor · 2016-06-28T16:37:15Z

@brson Now, that I've built rust on two different ARM architectures with --disable-jemalloc, I'd like to propose a configure switch inverting the current allocator defaults. In other words, use alloc_system by default, but also build the jemalloc crate.

The current disable switch makes it impossible to use jemalloc on a per crate basis, like this:

#![feature(alloc_jemalloc)]
extern crate alloc_jemalloc;

Or more simply --disable-jemalloccould start meaning just that.

brson · 2016-06-28T18:02:50Z

Or more simply --disable-jemalloccould start meaning just that.

sgtm

MagaTailor · 2016-08-29T09:26:03Z

The following news makes this issue much less interesting. Who knows what effect DVFS has under different loads.

http://www.cnx-software.com/2016/08/28/amlogic-s905-and-s912-processors-appear-to-be-limited-to-1-5-ghz-not-2-ghz-as-advertised/

brson added A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jun 27, 2016

MagaTailor mentioned this issue Jun 28, 2016

mk: Move disable-jemalloc logic into makefiles #31846

Merged

MagaTailor closed this as completed Aug 29, 2016

alexcrichton mentioned this issue Oct 6, 2016

Switch the default global allocator to System, remove alloc_jemalloc, use jemallocator in rustc #36963

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jemalloc performance on 64-bit ARM #34476

Jemalloc performance on 64-bit ARM #34476

MagaTailor commented Jun 25, 2016 •

edited

Loading

MagaTailor commented Jun 26, 2016 •

edited

Loading

sorear commented Jun 26, 2016

MagaTailor commented Jun 26, 2016

sorear commented Jun 26, 2016

MagaTailor commented Jun 26, 2016 •

edited

Loading

brson commented Jun 27, 2016

MagaTailor commented Jun 28, 2016 •

edited

Loading

brson commented Jun 28, 2016

MagaTailor commented Aug 29, 2016

Jemalloc performance on 64-bit ARM #34476

Jemalloc performance on 64-bit ARM #34476

Comments

MagaTailor commented Jun 25, 2016 • edited Loading

MagaTailor commented Jun 26, 2016 • edited Loading

sorear commented Jun 26, 2016

MagaTailor commented Jun 26, 2016

sorear commented Jun 26, 2016

MagaTailor commented Jun 26, 2016 • edited Loading

brson commented Jun 27, 2016

MagaTailor commented Jun 28, 2016 • edited Loading

brson commented Jun 28, 2016

MagaTailor commented Aug 29, 2016

MagaTailor commented Jun 25, 2016 •

edited

Loading

MagaTailor commented Jun 26, 2016 •

edited

Loading

MagaTailor commented Jun 26, 2016 •

edited

Loading

MagaTailor commented Jun 28, 2016 •

edited

Loading