perf: indexing: Introduce a bulk getValuesInto function to read values #12105

jasonk000 · 2021-12-30T00:33:26Z

Description

If large number of values are required from DimensionDictionary during indexing, fetch them all in a single lock/unlock call instead of lock/unlock each individual item.

During indexing there are repeated lock/unlock boundary crossing. In a sample application (57 fields to index), this consumes ~9% of the taskrunner CPU.

Depending on indexing row configuration specifics, the indexer usage of DimensionDictionary can consume anywhere from 1-20% of the CPU time during processing. This PR addresses one aspect of the processing, specifically the getValues. (I'll introduce another PR on the add/size change).

Introduce a `getValuesInto()` call to the `DimensionDictionary`, and use it

This introduces a getValuesInto call, which accepts an array of IDs to fetch, and performs the equivalent of many getValue() calls.

Expand benchmarks to cover wider rows and some concurrency

Design option

There is one other design option, that is in fact much faster again, but is a bit less intuitive.

Here are the benchmark results across all three:

single getValue()
Benchmark                                                                  (cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000          8  avgt   10  0.046 ±  0.001  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000         40  avgt   10  0.215 ±  0.007  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000          8  avgt   10  3.484 ±  0.032  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000         40  avgt   10  8.638 ±  0.514  us/op

bulk getValuesInto() <----- THIS SOLUTION
Benchmark                                                                  (cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000          8  avgt   10  0.039 ±  0.001  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000         40  avgt   10  0.120 ±  0.002  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000          8  avgt   10  0.383 ±  0.052  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000         40  avgt   10  0.386 ±  0.018  us/op

using doInsideReadLock() <----- ALTERNATIVE, FASTER BUT LESS CLEAN
Benchmark                                                                  (cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000          8  avgt   10  0.018 ±  0.001  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000         40  avgt   10  0.077 ±  0.002  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000          8  avgt   10  0.241 ±  0.004  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000         40  avgt   10  0.486 ±  0.002  us/op

Alternatives (faster!)

The alternative, doInsideReadLoop() is to pass a lambda / closure into the DimensionDictionary boundary and have it perform locking and execute on the other side. This is likely to be faster for most cases, however, it's a bit less clear in the API (I'll leave an extra comment).

The solution involves similar changes, and passing a closure across the boundary. However, it does leak implementation concerns...

/************ DimensionDictionary.java **************/
public void doInsideReadLock(BiConsumer<List<T>, Integer> fn)
{
  lock.readLock().lock();
  try {
    fn.accept(idToValue, idForNull);
  }
  finally {
    lock.readLock().unlock();
  }
}


/************ StringDimensionIndexer.java **************/
@Override
public long estimateEncodedKeyComponentSize(int[] key)
{
  int[] estimatedSize = new int[]{key.length * Integer.BYTES};
  dimLookup.doInsideReadLock((List<String> idToValue, Integer idForNull) -> {
    for (int id : key) {
      if (id != idForNull) {
          String val = idToValue.get(id);
          int sizeOfString = 28 + 16 + (2 * val.length());
          estimatedSize[0] += sizeOfString;
      }
    }
  });
  return estimatedSize[0];
}

Otherwise, I think the changes are straightforward!

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster. (as part of other changes)

FrankChen021 · 2021-12-30T01:15:44Z

processing/src/main/java/org/apache/druid/segment/DimensionDictionary.java

+  {
+    lock.readLock().lock();
+    try {
+      for (int i = 0; i < ids.length; i++) {


Since here you're copying the whole ids array to the input array without checking if the input values length matches the length of ids, I'm suggesting that this new method only accepts ids as parameter, and returns T[]

Good point on checking array lengths match.

I did explore returning value only (that's my internal impl actually), but it requires wider changes. Specifically, since Java erases generics, it's impossible to do new T[]. So, we have to either do something like:

StringDimensionIndexer() constructor { super(String.class); } // and then propagate this Class<T> to DimensionDictionary, and in DimensionDictionary, do a: T[] out = (T[]) Array.newInstance(typeFromConstructor, keys.length);

To me, both are viable and work fine, this approach seems less invasive.

So, we could either: (1) check array length explicitly or (2) create new array inside getValues() call.

What do you suggest?

Yes, handling generic in java is a little complicated. How about using List<T> as the returning type? This won't introduce any magic code.

fixed in f4e4f9e and f91f709

#12073 is adding a StringDimensionDictionary, so this method might be able to be a bit cleaner in the future, https://github.com/apache/druid/pull/12073/files#diff-f8b1f0a09275a697d95c1d716e7c9f7650cdd2d7021497018df75e4635e4377cR28

FrankChen021 · 2021-12-30T01:16:47Z

processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java

@@ -132,8 +132,10 @@ public long estimateEncodedKeyComponentSize(int[] key)
    // even though they are stored just once. It may overestimate the size by a bit, but we wanted to leave
    // more buffer to be safe
    long estimatedSize = key.length * Integer.BYTES;
-    for (int element : key) {


nit: it's better to rename the original variable name key to keys to make it intuitive.

fixed in d166364

jasonk000 · 2021-12-30T19:32:56Z

hey @FrankChen021 , feedback covered from your previous review

jasonk000 · 2021-12-30T19:33:10Z

while we are here, could you share an opinion on this implementation, it's faster again but with a different API:
jasonk000@f4354bb -> https://github.com/jasonk000/druid/pull/8/files

FrankChen021 · 2021-12-31T02:19:45Z

while we are here, could you share an opinion on this implementation, it's faster again but with a different API: jasonk000@f4354bb -> https://github.com/jasonk000/druid/pull/8/files

As a public API that accepts a callback, I'm worried about if user provides a time-consuming callback, which means the read lock will be held for a long time, and the write lock will be blocked. So I think current solution in this PR might be better.

FrankChen021

LGTM. +1 after CI.

jasonk000 · 2022-01-06T17:56:49Z

LGTM. +1 after CI.

@FrankChen021 It seems the CI failure on this PR is unrelated area in stage 2 tests. Any suggestions? I can close/open to trigger CI?

clintropolis

👍

clintropolis · 2022-01-25T12:05:21Z

processing/src/main/java/org/apache/druid/segment/DimensionDictionary.java

+  {
+    lock.readLock().lock();
+    try {
+      for (int i = 0; i < ids.length; i++) {


#12073 is adding a StringDimensionDictionary, so this method might be able to be a bit cleaner in the future, https://github.com/apache/druid/pull/12073/files#diff-f8b1f0a09275a697d95c1d716e7c9f7650cdd2d7021497018df75e4635e4377cR28

suneet-s · 2022-02-15T18:11:54Z

@jasonk000 Thanks for another optimization! Do you think you could clean up the conflicts in this PR so that it can be merged? Sorry it fell off the radar

…s in bulk If large number of values are required from DimensionDictionary during indexing, fetch them all in a single lock/unlock instead of lock/unlock each individual item.

… a new T[] rather than filling

jasonk000 · 2022-02-25T02:49:32Z

@suneet-s rebased!

@jasonk000 Thanks for another optimization! Do you think you could clean up the conflicts in this PR so that it can be merged? Sorry it fell off the radar

Rebased! Thanks.

suneet-s · 2022-02-25T20:19:15Z

Thanks @jasonk000 !

FrankChen021 reviewed Dec 30, 2021

View reviewed changes

FrankChen021 added the Performance label Dec 30, 2021

This was referenced Dec 31, 2021

perf: indexing: improve DimensionDictionary index locking on add path #12109

Closed

severe performance issue due to lock in StringDimensionIndexer.DimensionDictionary #6322

Open

FrankChen021 approved these changes Jan 6, 2022

View reviewed changes

clintropolis added the Area - Ingestion label Jan 25, 2022

clintropolis approved these changes Jan 25, 2022

View reviewed changes

jasonk000 added 4 commits February 22, 2022 15:31

perf: indexing: Introduce a bulk getValuesInto function to read value…

977fe5d

…s in bulk If large number of values are required from DimensionDictionary during indexing, fetch them all in a single lock/unlock instead of lock/unlock each individual item.

refactor: rename key to keys in function args

e1649ca

fix: check explicitly that argument length on arrays match

14dd093

refactor: getValuesInto renamed to getValues, now creates and returns…

2a1d051

… a new T[] rather than filling

jasonk000 force-pushed the dimension-dict-bulk-get branch from f91f709 to 2a1d051 Compare February 22, 2022 23:48

suneet-s merged commit eb1b53b into apache:master Feb 25, 2022

jasonk000 deleted the dimension-dict-bulk-get branch February 28, 2022 19:26

abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022

jasonk000 mentioned this pull request Jan 25, 2023

Dimension dictionary reduce locking #13710

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: indexing: Introduce a bulk getValuesInto function to read values #12105

perf: indexing: Introduce a bulk getValuesInto function to read values #12105

jasonk000 commented Dec 30, 2021

FrankChen021 Dec 30, 2021

jasonk000 Dec 30, 2021 •

edited

Loading

FrankChen021 Dec 30, 2021

jasonk000 Dec 30, 2021 •

edited

Loading

clintropolis Jan 25, 2022

FrankChen021 Dec 30, 2021

jasonk000 Dec 30, 2021

jasonk000 commented Dec 30, 2021

jasonk000 commented Dec 30, 2021 •

edited

Loading

FrankChen021 commented Dec 31, 2021

FrankChen021 left a comment

jasonk000 commented Jan 6, 2022

clintropolis left a comment

clintropolis Jan 25, 2022

suneet-s commented Feb 15, 2022

jasonk000 commented Feb 25, 2022

suneet-s commented Feb 25, 2022

perf: indexing: Introduce a bulk getValuesInto function to read values #12105

perf: indexing: Introduce a bulk getValuesInto function to read values #12105

Conversation

jasonk000 commented Dec 30, 2021

Description

Introduce a getValuesInto() call to the DimensionDictionary, and use it

Expand benchmarks to cover wider rows and some concurrency

Design option

Alternatives (faster!)

FrankChen021 Dec 30, 2021

Choose a reason for hiding this comment

jasonk000 Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

FrankChen021 Dec 30, 2021

Choose a reason for hiding this comment

jasonk000 Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

clintropolis Jan 25, 2022

Choose a reason for hiding this comment

FrankChen021 Dec 30, 2021

Choose a reason for hiding this comment

jasonk000 Dec 30, 2021

Choose a reason for hiding this comment

jasonk000 commented Dec 30, 2021

jasonk000 commented Dec 30, 2021 • edited Loading

FrankChen021 commented Dec 31, 2021

FrankChen021 left a comment

Choose a reason for hiding this comment

jasonk000 commented Jan 6, 2022

clintropolis left a comment

Choose a reason for hiding this comment

clintropolis Jan 25, 2022

Choose a reason for hiding this comment

suneet-s commented Feb 15, 2022

jasonk000 commented Feb 25, 2022

suneet-s commented Feb 25, 2022

Introduce a `getValuesInto()` call to the `DimensionDictionary`, and use it

jasonk000 Dec 30, 2021 •

edited

Loading

jasonk000 Dec 30, 2021 •

edited

Loading

jasonk000 commented Dec 30, 2021 •

edited

Loading