Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Posting fetch optimizations #6416

Closed
yeya24 opened this issue Jun 6, 2023 · 3 comments · Fixed by #6465
Closed

Posting fetch optimizations #6416

yeya24 opened this issue Jun 6, 2023 · 3 comments · Fixed by #6465

Comments

@yeya24
Copy link
Contributor

yeya24 commented Jun 6, 2023

Problem Statement

When processing a series query on store gateway, the process looks like below for each block we need to query:

  1. Fetch postings per label matcher, either from index cache or from objstore
  2. Doing boolean operations for the postings we fetched, mainly AND (intersection), OR (merge). This is usually called posting expansion
  3. After expanding postings, we fetch corresponding series and chunks.
  4. ... (skip as not important) for this issue

For example, we have a query like up{cluster="us", env="prod"}. It contains 3 matchers and let's assume we have a block with 1M time series.

  • name = "up", matches 100 series
  • cluster = "us", matches 100K series
  • env = "prod", matches 500K series

In this situation, each query we are trying to fetch postings for the 3 matchers so it is fetching (500K + 100K + 100) postings first and then we apply intersection. Since the number of expanded posting is bound by the highest selective matcher, the expanded posting might contain only 50 series but we end up fetching 600K+ postings.

A better way to do in this case might be fetching 100 series matching __name__ = "up" first then we decode the series labels and see if it matches {cluster="us"} and {env="prod"}.

Another situation might be the same query, but each matcher matches 100, 200 and 500K series respectively. In this situation, maybe we can do intersection on the first 2 matchers first, fetch series and do series matching on the last label matcher.

Last situation, each matcher might match 100, 200 and 300 series respectively and then we apply our current posting fetching logic.

Proposal

To address this problem, the first thing is to have an easy way to get cardinality per posting. Number of entries is already part of the index file so maybe we can make it part of the index header.

image

Once we have the cardinality info, we need a cost model to calculate what's the cost of each strategy and decide which we should pick. We need a cheap way to calculate the cost so that's why we need number of entries per posting in index header.

Regarding strategy implementation, it is still TBA. We can start simple by maybe hardcoding or default to one strategy.

@yeya24 yeya24 changed the title Posting fetching optimization strategy Posting fetching optimizations Jun 6, 2023
@yeya24
Copy link
Contributor Author

yeya24 commented Jun 6, 2023

If we have the cardinality per posting, we can use an easy way to expand postings:

  1. Start from the two smallest posting lists and calculate intersection
  2. Continue doing intersection with next smallest posting list until done.

This sounds less work to do when calculating intersections as mentioned in https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Edited: probably this is unnecessary and existing intersection algorithm is good enough. Need benchmarks.

@fpetkovski
Copy link
Contributor

Is this related to #6357?

@yeya24
Copy link
Contributor Author

yeya24 commented Jun 6, 2023

#6357 is tackling the problem from a different way IIUC.
It doesn't change how we fetch postings so we are still fetching postings from all matchers.

@yeya24 yeya24 changed the title Posting fetching optimizations Posting fetch optimizations Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants