Optimize repeated field decoding #911

ignatz · 2024-02-01T13:29:37Z

Avoid allocating tear-off closures when decoding repeated but non-packed fields.

Also refactors BuilderInfo._decodeEnum.

Benchmarks

repeated_int64, repeated_string, and repeated_enum are the new benchmarks added with this PR.

fromBuffer is an internal benchmark that decodes a large buffer.

fromBufferUnmodifiableInput is the same as fromBuffer, but the input Uint8List is made unmodifiable.

AOT

	Before	After	Diff
repeated_int64	84,549 us	61,820 us	-22,729 us, -25.8%
repeated_string	67,183 us	67,984 us	+801 us, +1.19%
repeated_enum	52,335 us	44,772 us	-7,563 us, -14.4%
fromBuffer	25,661 us	25,299 us	-362 us, -1.4%
fromBufferUnmodifiableInput	26,135 us	25,689 us	-446 us, -1.7%

JS

	Before	After	Diff
repeated_int64	112,050 us	109,850 us	-2,200 us, -1.9%
repeated_string	123,050 us	126,650 us	+3,600 us, +2.9%
repeated_enum	43,920 us	39,400 us	-4,520 us, -10.2%
fromBuffer	211,888 us	168,250 us	-43,638 us, -20.5%
fromBufferUnmodifiableInput	227,444 us	185,454 us	-41,990 us, -18.4%

cl/609281182

osa1 · 2024-02-20T10:16:41Z

Thanks, this is great. I've also confirmed in some internal benchmarks (not micro benchmarks) that this makes quite a big difference. I will try to review and merge this this week.

osa1

Thanks again, this is great.

While reviewing this I noticed that the wins here are not coming from the cached list, but from avoiding the tear-off closure creating when reading repeated fields, for every element.

Indeed, fs._ensureRepeatedField(...) is just a list access in the common case:

  PbList<T> _ensureRepeatedField<T>(BuilderInfo meta, FieldInfo<T> fi) {
    assert(!_isReadOnly);
    assert(fi.isRepeated);
    if (fi.index == null) {                             // <----- branch not taken in the common case
      return _ensureExtensions()._ensureRepeatedField(fi as Extension<T>);
    }
    final value = _getFieldOrNull(fi);                  // <----- returns non-null after the first element of a list
    if (value != null) return value;
    ...
  }

  dynamic _getFieldOrNull(FieldInfo fi) {
    if (fi.index != null) return _values[fi.index!];    // <----- common case
    return _extensions?._getFieldOrNull(fi as Extension<dynamic>);
  }

So it makes sense that it doesn't make too much difference. (actually it makes my benchmark slower when compiled to JS, but that may be noise, see below for numbers)

The wins here are coming from avoiding the closures in e.g.

      case PbFieldType._REPEATED_BOOL:
        _readPackable(meta, fs, input, wireType, fi, input.readBool);

Here for an unpacked list, we create one tear-off closure (input.readBool) for every element. That's expensive compared to the cost of reading one element. The version in this PR avoids it when the field is not packed:

      case PbFieldType._REPEATED_BOOL:
        final list = cachedList ?? fs._ensureRepeatedField(meta, fi);
        if (wireType == WIRETYPE_LENGTH_DELIMITED) {
          _readPacked(input, () => list.add(input.readBool()));
        } else {
          list.add(input.readBool());           // <----- no closure allocation when not packed
        }

(Allocating a closure when reading packed is fine, as closure allocation is probably a small part of the whole operation, though we could even avoid that if we wanted, similar to enum code in this PR, which doesn't allocate any closures)

We should update the commit message to reflect this. I can do this if you prefer.

Benchmark results

From an internal benchmark.

Master branch

AOT

fromBuffer(RunTime): 26,048 us.
fromBufferUnmodifiableInput(RunTime): 26,596 us.

JS

fromBuffer(RunTime): 214,111 us.
fromBufferUnmodifiableInput(RunTime): 230,777 us.

With cached list

AOT

fromBuffer(RunTime): 24,877 us.
fromBufferUnmodifiableInput(RunTime): 24,864us.

JS

fromBuffer(RunTime): 193,333 us.
fromBufferUnmodifiableInput(RunTime): 192,500 us.

Same patch, but without the cached list

AOT

fromBuffer(RunTime): 25,388 us.
fromBufferUnmodifiableInput(RunTime): 25,073 us.

JS

fromBuffer(RunTime): 171,272 us.
fromBufferUnmodifiableInput(RunTime): 189,000 us.

Notes

Cached list is not that important (actually makes dart2js output slower), the important part is avoiding creating a tear-off closure for every element when reading non-packed repeated fields.

Similar differences can be seen with the benchmark in this PR.

benchmarks/bin/repeated_field.dart

protobuf/lib/src/protobuf/coded_buffer.dart

ignatz · 2024-02-21T13:14:59Z

We should update the commit message to reflect this. I can do this if you prefer.

Your findings match my findings. I never intended to claim that the caching has a big impact. My commit message specifically calls out "Boxing overhead", which is my, maybe inadequate lingo, to say I removed a closure.

I'm not sure you wanted to say that it's the allocation itself that is solely responsible for the overhead because I'm not sure that's true.

I left the cache from my experiments because it seemed fairly simple and not entirely pointless. It's not clear to me how it would slow things down unless this is some memory layout weirdness. We can remove it

Lastly, you should always feel free to edit my commit message

…ing overhead. Also memoize storage lookup from previously deserialized field, though that seems to have only a minor impact since the lookup is fast already.

It doesn't improve much in AOT, and slows down JS.

osa1 · 2024-02-22T09:13:21Z

I'm testing this internally in cl/609281182.

ignatz · 2024-02-23T09:02:24Z

Awesome. Appreciate your changes. Thanks!

Revisions updated by `dart tools/rev_sdk_deps.dart`. http (https://github.com/dart-lang/http/compare/ce0de37..6e0a46f): 6e0a46f 2024-02-24 Alex Li [cronet_http] 🏗️ Use dart-define to determine dependency (dart-lang/http#1111) d5cab74 2024-02-23 Alex Li 👷 Update configuration related files (dart-lang/http#1091) 6873731 2024-02-22 Brian Quinlan Make it possible for `CronetClient` to own the lifetime of an existing `CronetEngine` (dart-lang/http#1137) protobuf (https://github.com/dart-lang/protobuf/compare/f085bfd..ef0ab7d): ef0ab7d 2024-02-23 Sebastian Optimize repeated field decoding (google/protobuf.dart#911) tools (https://github.com/dart-lang/tools/compare/9f4e6a4..c7fbf26): c7fbf26 2024-02-21 Elias Yishak Fallback to current datetime when failing on parse session json file (dart-lang/tools#235) web (https://github.com/dart-lang/web/compare/975e55c..d96c01d): d96c01d 2024-02-23 Srujan Gaddam Use extension types and generics internally (dart-lang/web#184) yaml_edit (https://github.com/dart-lang/yaml_edit/compare/82ab64d..54884db): 54884db 2024-02-23 Devon Carew update to the latest sdk; update lints (dart-lang/yaml_edit#68) Change-Id: Ib2cb7971df05f4501f81fe5c04b7eaf88d0d93a3 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/354381 Commit-Queue: Devon Carew <[email protected]> Reviewed-by: Konstantin Shcheglov <[email protected]>

ignatz force-pushed the faster_repeated_fields branch 2 times, most recently from d557d5d to b41cc30 Compare February 1, 2024 13:49

osa1 changed the title ~~~30 Faster deserialization for packable but unpacked repeated fields by reducing boxing overhead.~~ ~30% Faster deserialization for packable but unpacked repeated fields by reducing boxing overhead. Feb 20, 2024

osa1 added the publish-ignore-warnings label Feb 20, 2024

osa1 reviewed Feb 21, 2024

View reviewed changes

benchmarks/bin/repeated_field.dart Outdated Show resolved Hide resolved

protobuf/lib/src/protobuf/coded_buffer.dart Outdated Show resolved Hide resolved

protobuf/lib/src/protobuf/coded_buffer.dart Outdated Show resolved Hide resolved

ignatz added 3 commits February 21, 2024 14:15

~30% deserialization speedup for long repeated fields by reducing box…

31452f2

…ing overhead. Also memoize storage lookup from previously deserialized field, though that seems to have only a minor impact since the lookup is fast already.

Add a dedicated benchmark for protos with long repeated fields.

d5001a9

Drive-by nullability fix in BuilderInfo._decoceEnum.

68bb70d

ignatz force-pushed the faster_repeated_fields branch from e6ce1b6 to 68bb70d Compare February 21, 2024 13:16

osa1 added 3 commits February 22, 2024 09:50

Fix sint64 case

1a95c01

Remove duplication in enum code

7ac5864

Remove cached list

c5ecd3d

It doesn't improve much in AOT, and slows down JS.

osa1 changed the title ~~~30% Faster deserialization for packable but unpacked repeated fields by reducing boxing overhead.~~ Optimize repeated field decoding Feb 23, 2024

osa1 approved these changes Feb 23, 2024

View reviewed changes

osa1 merged commit ef0ab7d into google:master Feb 23, 2024
17 of 18 checks passed

This was referenced Feb 25, 2024

Merge mono_pkg steps to avoid blocking tests with analyze issues #914

Open

[dart2wasm] Implement inlining functions with function expressions in the body dart-lang/sdk#55027

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize repeated field decoding #911

Optimize repeated field decoding #911

ignatz commented Feb 1, 2024 •

edited by osa1

Loading

osa1 commented Feb 20, 2024

osa1 left a comment

ignatz commented Feb 21, 2024

osa1 commented Feb 22, 2024

ignatz commented Feb 23, 2024

Optimize repeated field decoding #911

Optimize repeated field decoding #911

Conversation

ignatz commented Feb 1, 2024 • edited by osa1 Loading

Benchmarks

AOT

JS

osa1 commented Feb 20, 2024

osa1 left a comment

Choose a reason for hiding this comment

Benchmark results

Master branch

AOT

JS

With cached list

AOT

JS

Same patch, but without the cached list

AOT

JS

Notes

ignatz commented Feb 21, 2024

osa1 commented Feb 22, 2024

ignatz commented Feb 23, 2024

ignatz commented Feb 1, 2024 •

edited by osa1

Loading