Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: indexing: Introduce a bulk getValuesInto function to read values #12105

Merged
merged 4 commits into from
Feb 25, 2022

Conversation

jasonk000
Copy link
Contributor

Description

If large number of values are required from DimensionDictionary during indexing, fetch them all in a single lock/unlock call instead of lock/unlock each individual item.

During indexing there are repeated lock/unlock boundary crossing. In a sample application (57 fields to index), this consumes ~9% of the taskrunner CPU.

Depending on indexing row configuration specifics, the indexer usage of DimensionDictionary can consume anywhere from 1-20% of the CPU time during processing. This PR addresses one aspect of the processing, specifically the getValues. (I'll introduce another PR on the add/size change).

Introduce a getValuesInto() call to the DimensionDictionary, and use it

This introduces a getValuesInto call, which accepts an array of IDs to fetch, and performs the equivalent of many getValue() calls.

Expand benchmarks to cover wider rows and some concurrency

Design option

There is one other design option, that is in fact much faster again, but is a bit less intuitive.

Here are the benchmark results across all three:

single getValue()
Benchmark                                                                  (cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000          8  avgt   10  0.046 ±  0.001  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000         40  avgt   10  0.215 ±  0.007  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000          8  avgt   10  3.484 ±  0.032  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000         40  avgt   10  8.638 ±  0.514  us/op

bulk getValuesInto() <----- THIS SOLUTION
Benchmark                                                                  (cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000          8  avgt   10  0.039 ±  0.001  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000         40  avgt   10  0.120 ±  0.002  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000          8  avgt   10  0.383 ±  0.052  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000         40  avgt   10  0.386 ±  0.018  us/op

using doInsideReadLock() <----- ALTERNATIVE, FASTER BUT LESS CLEAN
Benchmark                                                                  (cardinality)  (rowSize)  Mode  Cnt  Score    Error  Units
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000          8  avgt   10  0.018 ±  0.001  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSize                    10000         40  avgt   10  0.077 ±  0.002  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000          8  avgt   10  0.241 ±  0.004  us/op
StringDimensionIndexerBenchmark.estimateEncodedKeyComponentSizeTwoThreads          10000         40  avgt   10  0.486 ±  0.002  us/op

Alternatives (faster!)

The alternative, doInsideReadLoop() is to pass a lambda / closure into the DimensionDictionary boundary and have it perform locking and execute on the other side. This is likely to be faster for most cases, however, it's a bit less clear in the API (I'll leave an extra comment).

The solution involves similar changes, and passing a closure across the boundary. However, it does leak implementation concerns...

/************ DimensionDictionary.java **************/
public void doInsideReadLock(BiConsumer<List<T>, Integer> fn)
{
  lock.readLock().lock();
  try {
    fn.accept(idToValue, idForNull);
  }
  finally {
    lock.readLock().unlock();
  }
}


/************ StringDimensionIndexer.java **************/
@Override
public long estimateEncodedKeyComponentSize(int[] key)
{
  int[] estimatedSize = new int[]{key.length * Integer.BYTES};
  dimLookup.doInsideReadLock((List<String> idToValue, Integer idForNull) -> {
    for (int id : key) {
      if (id != idForNull) {
          String val = idToValue.get(id);
          int sizeOfString = 28 + 16 + (2 * val.length());
          estimatedSize[0] += sizeOfString;
      }
    }
  });
  return estimatedSize[0];
}

Otherwise, I think the changes are straightforward!


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster. (as part of other changes)

{
lock.readLock().lock();
try {
for (int i = 0; i < ids.length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since here you're copying the whole ids array to the input array without checking if the input values length matches the length of ids, I'm suggesting that this new method only accepts ids as parameter, and returns T[]

Copy link
Contributor Author

@jasonk000 jasonk000 Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on checking array lengths match.

I did explore returning value only (that's my internal impl actually), but it requires wider changes. Specifically, since Java erases generics, it's impossible to do new T[]. So, we have to either do something like:

StringDimensionIndexer() constructor {
super(String.class);
}

// and then propagate this Class<T> to DimensionDictionary, and in DimensionDictionary, do a:

T[] out = (T[]) Array.newInstance(typeFromConstructor, keys.length);

To me, both are viable and work fine, this approach seems less invasive.

So, we could either: (1) check array length explicitly or (2) create new array inside getValues() call.

What do you suggest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, handling generic in java is a little complicated. How about using List<T> as the returning type? This won't introduce any magic code.

Copy link
Contributor Author

@jasonk000 jasonk000 Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in f4e4f9e and f91f709

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#12073 is adding a StringDimensionDictionary, so this method might be able to be a bit cleaner in the future, https://github.com/apache/druid/pull/12073/files#diff-f8b1f0a09275a697d95c1d716e7c9f7650cdd2d7021497018df75e4635e4377cR28

@@ -132,8 +132,10 @@ public long estimateEncodedKeyComponentSize(int[] key)
// even though they are stored just once. It may overestimate the size by a bit, but we wanted to leave
// more buffer to be safe
long estimatedSize = key.length * Integer.BYTES;
for (int element : key) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's better to rename the original variable name key to keys to make it intuitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in d166364

@jasonk000
Copy link
Contributor Author

hey @FrankChen021 , feedback covered from your previous review

@jasonk000
Copy link
Contributor Author

jasonk000 commented Dec 30, 2021

while we are here, could you share an opinion on this implementation, it's faster again but with a different API:
jasonk000@f4354bb -> https://github.com/jasonk000/druid/pull/8/files

@FrankChen021
Copy link
Member

while we are here, could you share an opinion on this implementation, it's faster again but with a different API: jasonk000@f4354bb -> https://github.com/jasonk000/druid/pull/8/files

As a public API that accepts a callback, I'm worried about if user provides a time-consuming callback, which means the read lock will be held for a long time, and the write lock will be blocked. So I think current solution in this PR might be better.

Copy link
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 after CI.

@jasonk000
Copy link
Contributor Author

LGTM. +1 after CI.

@FrankChen021 It seems the CI failure on this PR is unrelated area in stage 2 tests. Any suggestions? I can close/open to trigger CI?

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
lock.readLock().lock();
try {
for (int i = 0; i < ids.length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#12073 is adding a StringDimensionDictionary, so this method might be able to be a bit cleaner in the future, https://github.com/apache/druid/pull/12073/files#diff-f8b1f0a09275a697d95c1d716e7c9f7650cdd2d7021497018df75e4635e4377cR28

@suneet-s
Copy link
Contributor

@jasonk000 Thanks for another optimization! Do you think you could clean up the conflicts in this PR so that it can be merged? Sorry it fell off the radar

…s in bulk

If large number of values are required from DimensionDictionary
during indexing, fetch them all in a single lock/unlock instead of
lock/unlock each individual item.
@jasonk000 jasonk000 force-pushed the dimension-dict-bulk-get branch from f91f709 to 2a1d051 Compare February 22, 2022 23:48
@jasonk000
Copy link
Contributor Author

@suneet-s rebased!

@jasonk000 Thanks for another optimization! Do you think you could clean up the conflicts in this PR so that it can be merged? Sorry it fell off the radar

Rebased! Thanks.

@suneet-s suneet-s merged commit eb1b53b into apache:master Feb 25, 2022
@suneet-s
Copy link
Contributor

Thanks @jasonk000 !

@jasonk000 jasonk000 deleted the dimension-dict-bulk-get branch February 28, 2022 19:26
@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants