Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in MatchedQueriesPhase #120130

Closed
Kimahriman opened this issue Jan 14, 2025 · 2 comments · Fixed by #120133
Closed

Performance regression in MatchedQueriesPhase #120130

Kimahriman opened this issue Jan 14, 2025 · 2 comments · Fixed by #120133
Labels
>bug :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@Kimahriman
Copy link

Elasticsearch Version

8.17.0

Installed Plugins

No response

Java Version

bundled

OS Version

Public docker image

Problem Description

We started noticing some search tasks hanging around for several hours after we upgraded from 8.6.1 to 8.15.1. They were marked as cancelled but wouldn't get cleaned up for a long time. When inspecting the queries, we noticed they were very heavy geospatial queries, with hundreds to thousands of polygon queries on a geo_point field. We assumed this was just complex queries, but couldn't really explain why the cancelled tasks hung around for so long. We recently upgraded to 8.17.0 and still continued to see this behavior, and started noticing more and more of this happening. We finally started digging in more and checked the hot threads on the nodes these tasks were running and saw this part of a stack trace every time this was happening:

...
app/[email protected]/org.apache.lucene.index.SingletonSortedNumericDocValues.nextValue(SingletonSortedNumericDocValues.java:72)
app/[email protected]/org.apache.lucene.document.LatLonDocValuesQuery$2.matches(LatLonDocValuesQuery.java:178)
app/[email protected]/org.apache.lucene.search.TwoPhaseIterator$TwoPhaseIteratorAsDocIdSetIterator.doNext(TwoPhaseIterator.java:85)
app/[email protected]/org.apache.lucene.search.TwoPhaseIterator$TwoPhaseIteratorAsDocIdSetIterator.advance(TwoPhaseIterator.java:78)
app/[email protected]/org.elasticsearch.search.fetch.subphase.MatchedQueriesPhase$1.process(MatchedQueriesPhase.java:79)
app/[email protected]/org.elasticsearch.search.FetchPhase$1.nextDoc(FetchPhase.java:182)
...

This led us to do some testing and confirmed that the issue is fully around the fact that we are doing named queries on each of these polygons so we can find out which polygon the point was in. When removing the named queries, this issue goes away.

Looking at changes to the MatchedQueriesPhase, I believe this may be caused by #94564.

Before this MR, matches were found using Bits matchingBits = Lucene.asSequentialAccessBits(readerContext.reader().maxDoc(), ss);, which under the hood operates roughly like:

        final Scorer scorer = scorerSupplier.get(estimatedGetCount); // this never returns null
        final TwoPhaseIterator twoPhase = scorer.twoPhaseIterator();
        final DocIdSetIterator iterator;
        if (twoPhase == null) {
            iterator = scorer.iterator();
        } else {
            iterator = twoPhase.approximation();
        }

        return new Bits() {
            ...
            @Override
            public boolean get(int index) {
                ...
                int doc = iterator.docID();
                if (doc < index) {
                    try {
                        doc = iterator.advance(index);
                    } catch (IOException e) {
                        throw new IllegalStateException("Cannot advance iterator", e);
                    }
                }
                if (index == doc) {
                    try {
                        return previousMatched = twoPhase == null || twoPhase.matches();
                    } catch (IOException e) {
                        throw new IllegalStateException("Cannot validate match", e);
                    }
                }

Here, advance is always run on the inner iterator in the case of the twoPhaseIterator, never on the twoPhaseIterator itself.

The new code simply does:

            @Override
            public void setNextReader(LeafReaderContext readerContext) throws IOException {
                matchingIterators.clear();
                for (Map.Entry<String, Weight> entry : weights.entrySet()) {
                    ScorerSupplier ss = entry.getValue().scorerSupplier(readerContext);
                    if (ss != null) {
                        Scorer scorer = ss.get(0L);
                        if (scorer != null) {
                            matchingIterators.put(entry.getKey(), scorer);
                        }
                    }
                }
            }

            @Override
            public void process(HitContext hitContext) throws IOException {
                Map<String, Float> matches = new LinkedHashMap<>();
                int doc = hitContext.docId();
                for (Map.Entry<String, Scorer> entry : matchingIterators.entrySet()) {
                    Scorer scorer = entry.getValue();
                    if (scorer.iterator().docID() < doc) {
                        scorer.iterator().advance(doc);
                    }
                    if (scorer.iterator().docID() == doc) {
                        matches.put(entry.getKey(), scorer.score());
                    }
                }
                hitContext.hit().matchedQueries(matches);
            }

in this case, ScorerSupplier ss is a IndexOrDocValuesQuery, which is wrapping an index query, LatLonPointQuery, and a docValues query, LatLonDocValuesQuery. When the scorer returned by the scorer supplier ends up being the doc values query instead of the index query due to the cost, scorer.iterator is a TwoPhaseIteratorAsDocIdSetIterator. Looking at that implementation:

    @Override
    public int advance(int target) throws IOException {
      return doNext(approximation.advance(target));
    }

    private int doNext(int doc) throws IOException {
      for (; ; doc = approximation.nextDoc()) {
        if (doc == NO_MORE_DOCS) {
          return NO_MORE_DOCS;
        } else if (twoPhaseIterator.matches()) {
          return doc;
        }
      }
    }

Which looks like after advancing approximation (which in this case is LatLonDocValuesQuery) to target, it iterates over every record in approximation until it finds the next match, meaning the polygon query will run on every record in the shard (or segment?) even if it's not part of the results being fetched.

The original implementation didn't have this issue because it effectively was doing twoPhaseIterator.approximation.advance(target) directly instead.

Steps to Reproduce

Very hard to reproduce, we had to do hundreds of small polygons that matched very few records to finally recreate it on our production data. I can modify the MatchedQueriesIT tests to show that the TwoPhaseIterator is used when the estimated cost of the index query is greater than 8, which means it estimates at least 8 points match the query on the shard (or segment?).

Logs (if relevant)

No response

@Kimahriman Kimahriman added >bug needs:triage Requires assignment of a team area label labels Jan 14, 2025
@jimczi jimczi added :Search Relevance/Search Catch all for Search Relevance and removed needs:triage Requires assignment of a team area label labels Jan 14, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jan 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

jimczi added a commit to jimczi/elasticsearch that referenced this issue Jan 14, 2025
This PR resolves a regression introduced in elastic#94564 by ensuring that the approximation is used when advancing matched query clauses.
Utilizing the two-phase iterator to validate matches guarantees that we do not attempt to find the next document fulfilling the two-phase criteria beyond the current document.
This fix prevents scenarios where matching a document in the second phase significantly increases query complexity, especially in cases involving restrictive second-pass filters.

Closes elastic#120130
@jimczi
Copy link
Contributor

jimczi commented Jan 14, 2025

Thanks for the detailed explanation @Kimahriman , I agree that this is a performance regression.
I opened #120133 to restore the original behaviour/intent.

@jimczi jimczi closed this as completed in 77392c3 Jan 14, 2025
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jan 14, 2025
This PR resolves a regression introduced in elastic#94564 by ensuring that the approximation is used when advancing matched query clauses.
Utilizing the two-phase iterator to validate matches guarantees that we do not attempt to find the next document fulfilling the two-phase criteria beyond the current document.
This fix prevents scenarios where matching a document in the second phase significantly increases query complexity, especially in cases involving restrictive second-pass filters.

Closes elastic#120130
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jan 14, 2025
This PR resolves a regression introduced in elastic#94564 by ensuring that the approximation is used when advancing matched query clauses.
Utilizing the two-phase iterator to validate matches guarantees that we do not attempt to find the next document fulfilling the two-phase criteria beyond the current document.
This fix prevents scenarios where matching a document in the second phase significantly increases query complexity, especially in cases involving restrictive second-pass filters.

Closes elastic#120130
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jan 14, 2025
This PR resolves a regression introduced in elastic#94564 by ensuring that the approximation is used when advancing matched query clauses.
Utilizing the two-phase iterator to validate matches guarantees that we do not attempt to find the next document fulfilling the two-phase criteria beyond the current document.
This fix prevents scenarios where matching a document in the second phase significantly increases query complexity, especially in cases involving restrictive second-pass filters.

Closes elastic#120130
elasticsearchmachine pushed a commit that referenced this issue Jan 14, 2025
This PR resolves a regression introduced in #94564 by ensuring that the approximation is used when advancing matched query clauses.
Utilizing the two-phase iterator to validate matches guarantees that we do not attempt to find the next document fulfilling the two-phase criteria beyond the current document.
This fix prevents scenarios where matching a document in the second phase significantly increases query complexity, especially in cases involving restrictive second-pass filters.

Closes #120130
elasticsearchmachine pushed a commit that referenced this issue Jan 14, 2025
This PR resolves a regression introduced in #94564 by ensuring that the approximation is used when advancing matched query clauses.
Utilizing the two-phase iterator to validate matches guarantees that we do not attempt to find the next document fulfilling the two-phase criteria beyond the current document.
This fix prevents scenarios where matching a document in the second phase significantly increases query complexity, especially in cases involving restrictive second-pass filters.

Closes #120130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants