-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Posting fetch optimizations #6416
Comments
If we have the cardinality per posting, we can use an easy way to expand postings:
This sounds less work to do when calculating intersections as mentioned in https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf Edited: probably this is unnecessary and existing intersection algorithm is good enough. Need benchmarks. |
Is this related to #6357? |
#6357 is tackling the problem from a different way IIUC. |
Problem Statement
When processing a series query on store gateway, the process looks like below for each block we need to query:
For example, we have a query like
up{cluster="us", env="prod"}
. It contains 3 matchers and let's assume we have a block with 1M time series.In this situation, each query we are trying to fetch postings for the 3 matchers so it is fetching (500K + 100K + 100) postings first and then we apply intersection. Since the number of expanded posting is bound by the highest selective matcher, the expanded posting might contain only 50 series but we end up fetching 600K+ postings.
A better way to do in this case might be fetching 100 series matching
__name__ = "up"
first then we decode the series labels and see if it matches{cluster="us"}
and{env="prod"}
.Another situation might be the same query, but each matcher matches 100, 200 and 500K series respectively. In this situation, maybe we can do intersection on the first 2 matchers first, fetch series and do series matching on the last label matcher.
Last situation, each matcher might match 100, 200 and 300 series respectively and then we apply our current posting fetching logic.
Proposal
To address this problem, the first thing is to have an easy way to get cardinality per posting. Number of entries is already part of the index file so maybe we can make it part of the index header.
Once we have the cardinality info, we need a cost model to calculate what's the cost of each strategy and decide which we should pick. We need a cheap way to calculate the cost so that's why we need number of entries per posting in index header.
Regarding strategy implementation, it is still TBA. We can start simple by maybe hardcoding or default to one strategy.
The text was updated successfully, but these errors were encountered: