-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ImmutableLookupMap for static lookups. #15675
Conversation
This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup. The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (apache#15626) during SQL planning with very large lookups.
processing/src/main/java/org/apache/druid/query/lookup/ImmutableLookupMap.java
Fixed
Show fixed
Hide fixed
benchmarks/src/test/java/org/apache/druid/benchmark/lookup/LookupExtractorBenchmark.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤘 🚀
@@ -95,6 +95,13 @@ public Response getMap() | |||
|
|||
private Map<String, String> getLatest() | |||
{ | |||
return ((MapLookupExtractor) factory.get()).getMap(); | |||
final LookupExtractor lookup = factory.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just bake this into the LookupExtractor
interface with a default impl that throws the unsupported exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems like a good idea. I removed iterable
and keySet
from the interface, and added asMap
.
|
||
entries = null; // save memory | ||
|
||
final List<String> keys = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these also be supplied a size argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, yes.
for (int i = 0; i < keys.size(); i++) { | ||
keyToEntry.put(keys.get(i), i); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to do this while looping to build keys
and values
? worried about memory during loading perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, mainly a memory thing, so we don't have entriesList
and keyToEntry
at the same time.
} | ||
|
||
return extractor.iterable().iterator(); | ||
return extractor.asMap().entrySet().iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, does this punish hypothetical things that are not stored in a map? (should we leave the iterable() method?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, i guess if we wanted to support other things maybe in the future we should add back an iterator later that spits out something other than map entry (result row or something perhaps?) to use for the direct queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, I suppose, but there aren't any examples in the code base where it would be useful. I thought about keeping the flexibility in case some extensions found it useful, but decided against it, because I'm not aware of any and I didn't want to keep complexity for unclear benefit.
Only failure is this one that's been having problems: standard-its / integration-query-tests-middleManager (security) / security integration test (Compile=jdk8, Run=jdk8, Indexer=middleManager, Mysql=com.mysql.jdbc.Driver) |
This patch adds a new ImmutableLookupMap, which comes with an ImmutableLookupExtractor. It uses a fastutil open hashmap plus two lists to store its data in such a way that forward and reverse lookups can both be done quickly. I also observed footprint to be somewhat smaller than Java HashMap + MapLookupExtractor for a 1 million row lookup.
The main advantage, though, is that reverse lookups can be done much more quickly than MapLookupExtractor (which iterates the entire map for each call to unapplyAll). This speeds up the recently added ReverseLookupRule (#15626) during SQL planning with very large lookups.
More details on benchmarks and size analysis below.
In the benchmark report,
keysPerValue
is the number of keys that map to each value.lookupType
is eitherhashmap
(MapLookupExtractor backed by Java HashMap, as we use today) orimmutable
(the new one). The test cases arelookupApply
(forward lookup),lookupUnapplyOne
(reverse lookup of a single value), andlookupUnapplyOneThousand
(reverse lookup of one thousand values).I also took heap dumps from the benchmark datasets, to measure how much space the two implementations take.
Findings:
apply
is about the same speed in both implementations.immutable
. Performance is closest for thekeysPerValue: 1000
,lookupUnapplyOneThousand
case. In this case, the reverse lookup is returning the entire key space, because it's 1,000 values * 1,000 keys per value = 1,000,000 keys.keysToEntry
table with 2097153 entries, andkeys
andvalues
lists with 1000000 entries. It measured 154,501,488 bytes total retained. It's smaller because it ends up being three string arrays and one int array, which compares favorably to the array of hashtable bucket objects that HashMap uses.Given these findings, I think it's a good idea to use the immutable version wherever we can. So, the patch also updates the main static lookups (URI and JDBC) to use it always.
In a future patch, for getting faster reverse Kafka lookups, it's probably a good idea to implement a mutable lookup that supports faster reverse lookups, and run that only on the Broker. (Assuming it will end up being larger footprint than the
ConcurrentHashMap
we use today, we wouldn't want to run it on every data server. Only the Broker needs fast reverse lookups.)