-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jemalloc performance on 64-bit ARM #34476
Comments
So, what happens if we run well optimized
Ouch! EDIT: |
What precisely are you running, and what do the three numbers represent? All I can find is https://benchmarksgame.alioth.debian.org/u64q/program.php?test=binarytrees&lang=rust&id=1 , but the output is not similar to yours. (Regarding the armv7 case … it's actually not unheard of for a 32-bit version of a program to be faster than the 64-bit version on 64-bit hardware. The reason is that the pointers are smaller -> data structures are smaller -> more of them fit in cache. Obviously this is highly workload-dependent.) |
On Sun, 26 Jun 2016 01:09:27 -0700
Those were the timings.
Yes, but the relative difference, as I'd mentioned in the opening comment, was very small which means there's also a factor of LLVM backend maturity. |
What do the |
Nice trick! Now it's Thanks to your tweak, the |
I'd be in favor of turning jemalloc off everywhere except where it's already proven to be a win. Or everywhere period. |
@brson Now, that I've built rust on two different ARM architectures with The current disable switch makes it impossible to use jemalloc on a per crate basis, like this: #![feature(alloc_jemalloc)]
extern crate alloc_jemalloc; Or more simply |
sgtm |
The following news makes this issue much less interesting. Who knows what effect DVFS has under different loads. |
I've just run the
binary_trees
benchmark on anARMv8
, Cortex-A53 processor, having converted an Android TV box to Linux.I'd found previously, on a much weaker (but more power efficient)
armv7
Cortex A5, the results were equal. On the new machine (using the latest officialaarch64
rustc nightly)./binary_trees 23
produces the following results:sysalloc
1m28s 5m10s 0m10sjemalloc
1m35s 5m10s 0m53swhich is palpably worse actually, even though Cortex-A53 is a much stronger core.
I'm beginning to think
jemalloc
only makes sense on Intel processors with heaps or L1/L2 cache.More benchmark ideas welcome, though.
added retroactively:
To reproduce, unpack the attachment and run:
inside the binary_trees directory. Uncomment the first 2 lines in main.rs to produce a sysalloc version.
The text was updated successfully, but these errors were encountered: