-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Aggregations and Hybrid query #604
Comments
After doing more testing and deep dives I had to update RFC on Option 1. I'll put updates in RFC description, but main changes are listed below for visibility:
I'm working on one more option, that needs more POCing before it can be included to the RFC. That may be a new |
I'm adding Option 3 that should be our preferred option for number of reasons: it's extensible, it supports all aggregations and it's compatible with concurrent search (that went GA in 2.12). Many thanks to @reta who reviewed and provided guidance for this option Solution DetailsOption 3: New Collector Manager on pre QueryPhaseSearcher stage [Preferred]Pros:
Cons:
This Option is similar to Option 1, main difference is that new collector manager should be added to the context before other core collector managers. In such case we can rely on existing core code to execute collectors in correct order and reduce results at the end of search. In contrast for Option 1 that has to be done in custom QueryPhaseSearcher for hybrid query. Following classes will be added with this option Below is main sequence of flow for this approach for scenario when concurrent search is enabled and disabled. Main differences are:
Simpler case of concurrent search, newCollector and reduce methods are called inside the Searcher.search Sequence diagram for case when Concurrent Search is enabled More complex case of non-concurrent search, only newCollector called inside the Searcher.search and reduce must be called manually Sequence diagram for case when Concurrent Search is disabled All tested aggregations are supported by this Option. That includes potential new aggregations added with default support mode for only default search. Short Term/Long Term implementationFor preferred option there are short and long term implementations. Main differentiator is extension point for adding custom collector manager in a query phase. We need at least two extension points that are executed one before and one after the searcher.search() Short term
Aggregator Processor is returned by the plugin specific QueryPhaseSearcher. Plugins can register QueryPhaseSearcher, only on is allowed from all plugins. Long term Switching from Short to Long term shouldn’t cause massive changes in code of Collector Manager and Collector and can be done quickly. Main dependency is on adding proper extension point to core. |
Thanks a lot @martin-gaievski , just to flash out a bit of context here:
We probably should not touch the existing Aggegator Processor flow (even if its lifecycle fits well) but have a new kind of hook / extension to allow adding to the |
Introduction
This document describes details of design for Aggregations in Hybrid Query. This feature has been requested by customer through GitHub issues #509 and #422.
Background
Initial implementation of Hybrid Query has minimum functionality and by design does not work with any type of existing aggregations. Queries with both “aggs” and hybrid were not blocked but results were not guaranteed.
Aggregations were planned for be added in future releases using existing architecture of Hybrid query using custom implementation of QueryPhaseSearcher. With time code pieces in core OpenSearch related to QueryPhaseSearcher were evolved and concurrent search became GA in 2.12. We need to take into account those changes when doing this design.
In context of this design we understand aggregation as a alternative approach for getting data stats. Aggregations can be used as a standalone request or together with search queries. In first case data is collected for all documents in the index, similar to case if “match_all” query is passed. When it’s bundled with the search query then results are based on the documents returned by the query. There is a special mixed case, for aggregation type global when the query results are collected but aggregation is done for all data set.
Functional Requirements
Non Functional Requirements
Document Scope
In this document we propose a solution for the questions below:
Out of Document Scope
Solution Overview
Solution is based on calling collectors that are created by code logic when executing Query phase. Different options for solution are around of how to call those collectors and compile final query results.
Certain changes in current behavior of Hybrid search query are required regardless of the option:
For example, for 1 shard we have following results:
with current logic this will be reduced to:
{ doc1, doc3, doc2 }
with proposed change it will be:
{ doc1, doc3, doc2, doc4 }
note that final result list has size of max of all sub queries (3) and one of the docs (doc4) is dropped from the final collection.
Such change is required to keep results of aggregations and hybrid query in sync. For non-global aggregation results are based on sub-set of data that is returned by the query. If some of the query results are dropped then two resulting collections will be inconsistent.
Below is continuation of the example with 2 sub-queries.
If our docs look like:
and “terms” aggregation is part of the query:
then with current logic the final result will look like:
after the change the final result will look like:
Risks / Known limitations
Must be compatible with concurrent segment search in core OpenSearch
Depending on the option preferred change may not be fully backward compatible with existing hybrid query.
At the moment we're planning to relax the limit of max hits per shard.
Some aggregations may not be supported depending on the chosen option. If fix is complex and aggregation is question is not widely used it can be blocked, query that contains such aggregation type will throw an exception.
Option 1:
This is the list of aggregations we are checking to verify design options solution
Metric aggregations
Bucket aggregations
Pipeline sibling aggregations
Pipeline, parent
Solution Details
Option 1: New Collector Manager
Pros:
Cons:
We can create a new custom CollectorManager and pass it to search method that takes CollectorManager. That should replace the call to a similar search method. New method will:
Advantage of such approach is that such collection manager can be stacked with other manager objects into multi collector manager object. And aggregations do have their own collector manager that when executed collects all the data needed to form final aggregation results.
Below is diagram for classes that will be affected if we go with this option.
High level flow will look like below
Another zoomed in flow diagram to show how scores are collected for hybrid sub-queries and aggregations.
For the new implementation of the AggregationProcessor we need to have following logic for both pre and post processing methods:
Reason for this is additional CollectorManager added for Hybrid Query. With that we need to do search with a Multi Collector Manager, in the same way as it’s currently done for the parallel segment search. For all other cases default logic will be used.
POC has been done to check feasibility of this option, repo with the POC code: https://github.com/martin-gaievski/neural-search/tree/add_aggregations_to_hybrid_query
We've found that approach isn't compatible with following aggregations for scenarios when concurrent segment search feature is enabled
That's because these aggregations are explicitly set internal flag
should use concurrent search
tofalse
, or use default implementation that does the same. As per core logic system must switch to a no concurrent search mode for such aggs, that must happen transparently to user.For Hybrid search for these aggregations system return identical runtime exception response:
Possible mitigation:
We can add a check and block queries with such aggregations if concurrent segment search is enabled.
For
geohex grid
this can be fixed easily by addingshould use concurrent search
flag.Option 2: Adding existing collector to Query Collector Context
We can add aggregation collectors to the SearchContext manually and then run the post search processing that should collect results for aggregations.
Pros:
Cons:
Methods that we need to change from “default” to “public” are:
Class QueryCollectorContext:
Event flow will be following
get Collector Context → create new collector based on context and next collector passed to QueryPhaseSearcher → do search with searcher.search(query, collector) → for each collector context call manually postProcess
Risk is in skipping CollectorManager from the loop, in such case functionality of concurrent search is ignored.
Special cases
Case 1: Doc appears in result list for multiple sub-queries
Concern: Doc can be counted more than once by aggregation logic
Actual result: Aggregation doesn’t multiple result for one doc based on number of sub-queries with that doc.
Example:
Index has following docs
hybrid query with sum and avg aggregations
actual result
if concern would be valid the result looks like below, where numbers for doc1 counted twice as it appears in results for both bool and term sub-queries
Case 2: Aggregation ignores all hits in all shards and is based only on a limited sub set
Concern: If size value is passed with query then all docs that are above that size and are cut from query results are also ignored by the aggregation
Actual result: All results are taking into account by aggregation processing before they are limited based on the size param
Example:
Index has following docs
hybrid query with sum and avg aggregations
actual result
sum is calculated as:
20 (doc2) + 15 (doc3) + 11 (doc6) + 25 (doc4) + 17 (doc7) = 88.0
avg is calculated as:
sum (doc2 + doc3 + doc6 + doc4 + doc7) / 5 = 88.0/5 = 17.6
if concern would be valid the result looks like below, where numbers for doc1 counted twice as it appears in results for both bool and term sub-queries
Open Questions
Currently filtering is added as one more collector and passed to QueryPhaseSearcher. In core the flow is following:
query searcher adds TopDocsCollector as a first collector →
collectors are added into one collector manager →
search is executed with collector manager, query results are collected first, then filter collector is applied on top of those results
Currently context methods are closed in core for both calling and extending
Can we extend approach by adding support for sorting
TBD
What should be behavior for aggregations that are not compatible with hybrid query?
If such a list is limited and does not include the most adopted aggregations, we can block the requests.
References
Feedback Required
As solution may drop support for certain aggregation types depending on the preferred approach, we mainly want to get some feedback if those unsupported aggregations are widely used. This will help us to understand if additional effort is required to support them, and if it's not possible then how system should behave.
The text was updated successfully, but these errors were encountered: