Add ImmutableLookupMap for static lookups. #15675

gianm · 2024-01-12T08:35:59Z

This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup.

The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (#15626) during SQL planning with very large lookups.

More details on benchmarks and size analysis below.

In the benchmark report, keysPerValue is the number of keys that map to each value. lookupType is either hashmap (MapLookupExtractor backed by Java HashMap, as we use today) or immutable (the new one). The test cases are lookupApply (forward lookup), lookupUnapplyOne (reverse lookup of a single value), and lookupUnapplyOneThousand (reverse lookup of one thousand values).

I also took heap dumps from the benchmark datasets, to measure how much space the two implementations take.

Findings:

apply is about the same speed in both implementations.
All the reverse lookup calls are substantially faster with immutable. Performance is closest for the keysPerValue: 1000, lookupUnapplyOneThousand case. In this case, the reverse lookup is returning the entire key space, because it's 1,000 values * 1,000 keys per value = 1,000,000 keys.
MapLookupExtractor backed by HashMap had a table with 2097152 entries. It measured 168,388,688 bytes total retained.
ImmutableLookupMap had a keysToEntry table with 2097153 entries, and keys and values lists with 1000000 entries. It measured 154,501,488 bytes total retained. It's smaller because it ends up being three string arrays and one int array, which compares favorably to the array of hashtable bucket objects that HashMap uses.

Given these findings, I think it's a good idea to use the immutable version wherever we can. So, the patch also updates the main static lookups (URI and JDBC) to use it always.

In a future patch, for getting faster reverse Kafka lookups, it's probably a good idea to implement a mutable lookup that supports faster reverse lookups, and run that only on the Broker. (Assuming it will end up being larger footprint than the ConcurrentHashMap we use today, we wouldn't want to run it on every data server. Only the Broker needs fast reverse lookups.)

Benchmark                                          (keysPerValue)  (lookupType)  (numKeys)   Mode  Cnt          Score         Error  Units
LookupExtractorBenchmark.lookupApply                            1       hashmap    1000000  thrpt    5  299273576.631 ± 4669084.716  ops/s
LookupExtractorBenchmark.lookupApply                            1     immutable    1000000  thrpt    5  306273517.396 ± 4563560.722  ops/s
LookupExtractorBenchmark.lookupApply                         1000       hashmap    1000000  thrpt    5  296967596.838 ± 8820926.069  ops/s
LookupExtractorBenchmark.lookupApply                         1000     immutable    1000000  thrpt    5  306457023.086 ± 2712603.499  ops/s
LookupExtractorBenchmark.lookupUnapplyOne                       1       hashmap    1000000   avgt    5         33.949 ±       0.916  ms/op
LookupExtractorBenchmark.lookupUnapplyOne                       1     immutable    1000000   avgt    5         ≈ 10⁻⁴                ms/op
LookupExtractorBenchmark.lookupUnapplyOne                    1000       hashmap    1000000   avgt    5         35.255 ±       1.230  ms/op
LookupExtractorBenchmark.lookupUnapplyOne                    1000     immutable    1000000   avgt    5          0.011 ±       0.001  ms/op
LookupExtractorBenchmark.lookupUnapplyOneThousand               1       hashmap    1000000   avgt    5         32.402 ±       0.558  ms/op
LookupExtractorBenchmark.lookupUnapplyOneThousand               1     immutable    1000000   avgt    5          0.140 ±       0.006  ms/op
LookupExtractorBenchmark.lookupUnapplyOneThousand            1000       hashmap    1000000   avgt    5         33.245 ±       0.916  ms/op
LookupExtractorBenchmark.lookupUnapplyOneThousand            1000     immutable    1000000   avgt    5         27.472 ±       2.916  ms/op

This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup. The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (apache#15626) during SQL planning with very large lookups.

processing/src/main/java/org/apache/druid/query/lookup/ImmutableLookupMap.java

benchmarks/src/test/java/org/apache/druid/benchmark/lookup/LookupExtractorBenchmark.java

clintropolis

🤘 🚀

clintropolis · 2024-01-12T09:18:48Z

...hed-global/src/main/java/org/apache/druid/query/lookup/NamespaceLookupIntrospectHandler.java

@@ -95,6 +95,13 @@ public Response getMap()

  private Map<String, String> getLatest()
  {
-    return ((MapLookupExtractor) factory.get()).getMap();
+    final LookupExtractor lookup = factory.get();


should we just bake this into the LookupExtractor interface with a default impl that throws the unsupported exception?

Yeah, that seems like a good idea. I removed iterable and keySet from the interface, and added asMap.

clintropolis · 2024-01-12T09:40:32Z

processing/src/main/java/org/apache/druid/query/lookup/ImmutableLookupMap.java

+
+      entries = null; // save memory
+
+      final List<String> keys = new ArrayList<>();


should these also be supplied a size argument?

clintropolis · 2024-01-12T09:43:30Z

processing/src/main/java/org/apache/druid/query/lookup/ImmutableLookupMap.java

+      for (int i = 0; i < keys.size(); i++) {
+        keyToEntry.put(keys.get(i), i);
+      }


any reason not to do this while looping to build keys and values? worried about memory during loading perhaps?

Yeah, mainly a memory thing, so we don't have entriesList and keyToEntry at the same time.

clintropolis · 2024-01-12T21:57:37Z

processing/src/main/java/org/apache/druid/query/lookup/LookupSegment.java

          }

-          return extractor.iterable().iterator();
+          return extractor.asMap().entrySet().iterator();


hmm, does this punish hypothetical things that are not stored in a map? (should we leave the iterable() method?)

well, i guess if we wanted to support other things maybe in the future we should add back an iterator later that spits out something other than map entry (result row or something perhaps?) to use for the direct queries

In theory, I suppose, but there aren't any examples in the code base where it would be useful. I thought about keeping the flexibility in case some extensions found it useful, but decided against it, because I'm not aware of any and I didn't want to keep complexity for unclear benefit.

gianm · 2024-01-13T21:13:58Z

Only failure is this one that's been having problems: standard-its / integration-query-tests-middleManager (security) / security integration test (Compile=jdk8, Run=jdk8, Indexer=middleManager, Mysql=com.mysql.jdbc.Driver)

gianm added Performance Area - Lookups labels Jan 12, 2024

Use in one more test.

f479e33

github-actions bot added the Area - Querying label Jan 12, 2024

gianm added 2 commits January 12, 2024 00:46

Fix benchmark.

327c25d

Object2ObjectOpenHashMap

2c30e7f

github-advanced-security bot found potential problems Jan 12, 2024

View reviewed changes

processing/src/main/java/org/apache/druid/query/lookup/ImmutableLookupMap.java Fixed Show fixed Hide fixed

benchmarks/src/test/java/org/apache/druid/benchmark/lookup/LookupExtractorBenchmark.java Fixed Show fixed Hide fixed

clintropolis reviewed Jan 12, 2024

View reviewed changes

gianm added 2 commits January 12, 2024 09:55

Fixes, and LookupExtractor interface update to have asMap.

16ced2a

Merge branch 'master' into rmlex

defa902

github-actions bot added the Area - Segment Format and Ser/De label Jan 12, 2024

gianm added 4 commits January 12, 2024 10:02

Remove commented-out code.

2264d76

Fix style.

12261c2

Fix import order.

1510660

Add fastutil.

fd117c2

github-actions bot added the Area - Dependencies label Jan 12, 2024

clintropolis reviewed Jan 12, 2024

View reviewed changes

Avoid storing Map entries.

f89acee

clintropolis approved these changes Jan 12, 2024

View reviewed changes

gianm merged commit 500681d into apache:master Jan 13, 2024
82 of 83 checks passed

gianm deleted the rmlex branch January 13, 2024 21:14

LakshSingla added this to the 29.0.0 milestone Jan 29, 2024

gianm mentioned this pull request Feb 6, 2024

Add sqlReverseLookupThreshold for ReverseLookupRule. #15832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ImmutableLookupMap for static lookups. #15675

Add ImmutableLookupMap for static lookups. #15675

gianm commented Jan 12, 2024

clintropolis left a comment

clintropolis Jan 12, 2024

gianm Jan 12, 2024

clintropolis Jan 12, 2024

gianm Jan 12, 2024

clintropolis Jan 12, 2024

gianm Jan 12, 2024

clintropolis Jan 12, 2024

clintropolis Jan 12, 2024

gianm Jan 12, 2024

gianm commented Jan 13, 2024


		entries = null; // save memory

		final List<String> keys = new ArrayList<>();

Add ImmutableLookupMap for static lookups. #15675

Add ImmutableLookupMap for static lookups. #15675

Conversation

gianm commented Jan 12, 2024

clintropolis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Jan 13, 2024