[AOT/perf] classes holding an int can be more efficient than plain integers? #53594

modulovalue · 2023-09-23T13:55:52Z

It can be shown that a program (running on the VM AOT) using a class that holds an integer can perform better than the same program using an integer without a wrapper class.

This can lead to counterintuitive results where a class can perform better than an extension type (cf. #53572)

I'm wondering, could "integers" perform as well as a "class holding an integer"? Or is there a reason why that might not be possible, desired, or worth the effort?

micro-benchmark as a repro

dart compile exe --enable-experiment=inline-class dispatch_bench4.dart

./dispatch_bench4.dart

via class • on class with instantiated mixin: 16ms
via extension type • on class with instantiated mixin: 20ms
via integer • on class with instantiated mixin: 20ms

void main() {
  const size = 20000000;
  final datasetClass = List.generate(size, (final a) => SomeClass(foo: a));
  final datasetExtensionType = List.generate(size, (final a) => SomeExtensionType(a));
  final datasetInt = List.generate(size, (final a) => a);
  final sw = Stopwatch();
  void measure(
    final String name,
    final void Function() fn,
  ) {
    sw.reset();
    sw.start();
    fn();
    sw.stop();
    print("$name: ${sw.elapsedMilliseconds}ms");
    
  }
  const viaClass = RunClass();
  const viaExtensionType = RunExtensionType();
  const viaInt = RunInteger();
  measure(
    "via class • on class with instantiated mixin",
    () => viaClass.execute(datasetClass),
  );
  measure(
    "via extension type • on class with instantiated mixin",
    () => viaExtensionType.execute(datasetExtensionType),
  );
  measure(
    "via integer • on class with instantiated mixin",
    () => viaInt.execute(datasetInt),
  );
}

class RunExtensionType with Run<SomeExtensionType> {
  const RunExtensionType();
  
  @override
  int sum(final SomeExtensionType v) => v.foo;
}

class RunClass with Run<SomeClass> {
  const RunClass();
  
  @override
  int sum(final SomeClass v) => v.foo;
}

class RunInteger with Run<int> {
  const RunInteger();
  
  @override
  int sum(final int v) => v;
}

extension type SomeExtensionType(int foo) {
  static int fooSelector(
    final SomeExtensionType a,
  ) => a.foo;
}

class SomeClass {
  static int fooSelector(
    final SomeClass a,
  ) => a.foo;
  
  final int foo;

  const SomeClass({
    required this.foo,
  });
}

mixin Run<T> {
  int sum(
    final T v,
  );

  int execute(
    final List<T> tree,
  ) {
    int total = 0;
    for (final a in tree) {
      total += sum(a);
    }
    return total;
  }
}

The text was updated successfully, but these errors were encountered:

lrhn · 2023-09-24T09:03:13Z

It seems unlikely that this benchmark can measure a significant difference in the access of the integer. The overhead of multiple function calls is so much larger than the thing being measured, that any difference in inlining or optimization will completely drown out the possible signal.

When I compile this code (only multiplying size by 4 to get larger times), I get

via class • on class with instantiated mixin: 95ms
via extension type • on class with instantiated mixin: 118ms
via integer • on class with instantiated mixin: 117ms

It's consistent that via class is smaller than the other, but also consistent that it's very, very little.

If I add the following two lines at the end of main:

 Object o = datasetClass.first;
  if (o is! SomeClass) throw "Bad";

(as an attempt to check that all instances of SomeClass are not allocation-sunk,)
then the timing changes to:

via class • on class with instantiated mixin: 96ms
via extension type • on class with instantiated mixin: 99ms
via integer • on class with instantiated mixin: 99ms

If I change those lines to

T confuse<T>(T v1, T v2) => 
      DateTime.now().millisecondsSinceEpoch > 0 ? v1 : v2;
  Object o = confuse(datasetClass.first, datasetExtensionType.first);
  if (o is! SomeClass) throw "Bad";

(to really check that SomeClass instances exist at runtime)
the timing changes to:

via class • on class with instantiated mixin: 100ms
via extension type • on class with instantiated mixin: 62ms
via integer • on class with instantiated mixin: 63ms

Whatever this thing is timing, it's has about a factor of two of uncertainty.

It's as if the more I look at SomeClass dynamically, the faster everything else gets. (Maybe it's because preventing SomeClass from being allocation sunk to an int makes the mixin for the actual int case more specialized. But I'm guessing blindly.)

In any case, this doesn't so much show that class access is faster than extension type access, more that also doing class member access may slow down direct access.

That's also interesting.
Suggests that there are things the compiler could choose differently to optimize for speed.
But also that it's mostly black magic from the user side.

modulovalue · 2023-09-24T13:42:59Z

more that also doing class member access may slow down direct access.

This gave me the idea to try the following:

When I run the code in the issue description repeatedly, I get 15ms for the on class case. However, if I remove the extension and int portion of the benchmark, it never achieves 15ms. The presence of the extension and integer cases appears to speed up the class case.

I'm not sure what to make of this (is this a bug? is this considered expected behavior?), but it is confusing and counterintuitive to me from the perspective of a user. I expected AOT to be more consistent when it comes to performance or at least not to have code that comes after some code slow down (or speed up?) code that comes before it.

This is similar to #53571 where the use of some feature appears to slow down unrelated parts of the program.

lrhn · 2023-09-24T14:09:34Z

Micro.-benchmarks are generally not particularly good at measuring optimizing compilers.
If they do show something, there is no guarantee that it applies to real programs too. And they risk the optimizer falling into a pathological case that would never happen in real programs.

If 99% of your programs runtime is on the same piece of code, a 5% deviation seems significant.

modulovalue · 2023-09-24T14:50:58Z

Micro.-benchmarks are generally not particularly good at using optimizing compilers.

@lrhn Is there any other strategy that you can recommend that would allow one to develop a better intuition for how to write performant Dart programs running on the VM?

Even if micro benchmarks are imperfect, I'm hoping that it's better than completely relying on my intuition. I agree that it's suboptimal, but they can be useful!

lrhn · 2023-09-25T23:36:35Z

I'd write real programs, then see where their bottlenecks are.
Check which operations are actually left in the program after the compiler has optimized it.
Then I'd try different approaches to optimizing, and see what works best.

I'll always go for avoiding allocation, and avoiding copying. And avoiding unnecessary work or overhead in general.
But I don't want to take extra steps, above the normal "don't do unnecessary work", outside of the code that actually matters.

mkustermann · 2023-09-26T07:50:56Z

/cc @alexmarkov

alexmarkov · 2023-10-12T17:14:59Z

I'm wondering, could "integers" perform as well as a "class holding an integer"? Or is there a reason why that might not be possible, desired, or worth the effort?

Dart VM (both JIT and AOT) uses 2 different representations for integers: tagged (when int value can be potentially mixed with other objects, e.g. put in a container such as List) and unboxed. Tagged representation is compatible with other instances. It also avoids instance (box) creation when value is small (Smi), at the cost of an extra bit test when loading a value. That is not measured in your micro-benchmark, but equally important. What you see in this micro-benchmark is a small overhead of loading a value from a tagged representation compared to loading a field with unboxed representation. It works as intended.

If you would like to achieve maximum performance, avoid using integers as objects and use specialized lists from dart:typed_data instead of a general-purpose List.

lrhn added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. type-performance Issue relates to performance or code size labels Sep 25, 2023

alexmarkov added the closed-as-intended Closed as the reported issue is expected behavior label Oct 12, 2023

alexmarkov closed this as completed Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AOT/perf] classes holding an int can be more efficient than plain integers? #53594

[AOT/perf] classes holding an int can be more efficient than plain integers? #53594

modulovalue commented Sep 23, 2023

lrhn commented Sep 24, 2023

modulovalue commented Sep 24, 2023

lrhn commented Sep 24, 2023 •

edited

Loading

modulovalue commented Sep 24, 2023

lrhn commented Sep 25, 2023

mkustermann commented Sep 26, 2023

alexmarkov commented Oct 12, 2023

[AOT/perf] classes holding an int can be more efficient than plain integers? #53594

[AOT/perf] classes holding an int can be more efficient than plain integers? #53594

Comments

modulovalue commented Sep 23, 2023

lrhn commented Sep 24, 2023

modulovalue commented Sep 24, 2023

lrhn commented Sep 24, 2023 • edited Loading

modulovalue commented Sep 24, 2023

lrhn commented Sep 25, 2023

mkustermann commented Sep 26, 2023

alexmarkov commented Oct 12, 2023

lrhn commented Sep 24, 2023 •

edited

Loading