Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Sorting in Hybrid Search #866

Closed
vibrantvarun opened this issue Aug 14, 2024 · 0 comments
Closed

[RFC] Sorting in Hybrid Search #866

vibrantvarun opened this issue Aug 14, 2024 · 0 comments
Assignees
Labels
enhancement RFC Roadmap:Search Project-wide roadmap label

Comments

@vibrantvarun
Copy link
Member

Introduction

This document proposes a solution for enabling the sorting feature in Hybrid Search. The proposal aims to use existing extensions from OpenSearch as much as possible.

Problem Statement

The problem statement is that Hybrid search results should be in sorted form as per the sort criteria provided in the search request. Since, Hybrid search went GA in 2.11 there are multiple asks from the customers via Github issues to enable different type of features like Aggregations, Post Filter. One such ask is to enable the Sorting in Hybrid Search. Therefore, we are proposing this feature to be released in the upcoming release 2.15.

Scope

The scope of this document is to enable the support all the scenarios through which sorting is supported in OpenSearch. However, there are certain limitations wrt to Hybrid Search which we will discuss in the later sections.

Current State

How Sort operates in OpenSearch?

Overview

In the Query Phase of traditional search query like match, term etc individual shard results are retrieved in a sorted form as per the sort criteria given in the search request. The results are then sent to the Fetch Phase.

Low level Functioning

Since sorting is performed at the shard level, the collector adds the query result (docId) to the priority queue, which operates based on the sort field comparator.

Note

  1. When sort is applied then Query Phase results will not contain scores as scoring is not done in the Query Phase. The results will be purely based of sort criteria whether it be any field, docId, _id etc. The scores are calculated in the Fetch Phase only if track_scores=true is sent in the search request.
  2. If track_scores=true then the field relevance score is calculated which says that how relevant that field is to the search result.
  3. If track_scores=false then scoring will not be done at all in the entire search process. Consequently, the hits will contain null scores.
  4. As per the OpenSearch architecture, Fetch phase will only set scores in the search hits in the following cases
    1. sort criteria is by _score
    2. sort criteria is by any field (not _score) and track_scores=true
    3. If no sorting is applied ( in hybrid query scenario)

Internal Functioning of Hybrid Search

In the Query Phase of Hybrid Search, results from multiple subqueries are retrieved, and the top documents from those individual query results form the individual shard result. After the Query Phase, the normalization processor normalizes and combines the scores to the query results on each shard and removes the duplicate documents from the shard results. These final results from all shards are then sent to the Fetch Phase.

Note

  1. When hybrid search is executed then scores are calculated in the Query Phase. The reason behind it is in hybrid search the top documents of an individual query are computed on basis of the score.
  2. By default, Hybrid Search results are sorted in decreasing order by scores.

track_scores

If track_scores=false is sent in the search request it means that scoring will not be done in the entire search process. Whereas, when track_scores=true, the score would be calculated and then set into the search hits.

Limitations

There are certain limitations of applying sort in the hybrid query.

Logical limitation

As per the operation of sort, during search execution, we need results based on the sort criteria from the Query Phase. However, in hybrid search, results are typically based on scores. Therefore, according to OpenSearch's sort logic, enabling sorting in hybrid search would mean trying to obtain results from the collector based on both scores and sort criteria by any field. This creates a logically contradictory situation.
ScoringandSorting (1)

Functional limitation

As we discussed in the section above, hybrid search calculates the scores in the Query Phase and while sorting the scores are calculated in the Fetch Phase. At any given point, the scores should only be calculated once in the entire search process whether in the Query Phase or in the Fetch Phase. Therefore, applying sort with track_scores=true in Hybrid Search breaks the functioning of sort in OpenSearch.

  • As per the OpenSearch architecture, phase results processor (Normalization processor) always applies after the Query Phase in multiple shards scenario. If any of the conditions is met then in the Hybrid Search the scores will be calculated twice first in the Query Phase because of hybrid search functioning and second in the Fetch Phase due to sort functioning. Moreover, As we discussed earlier in the notes of low level functioning the scores which are normalized by the normalization processor will get updated in the fetch phase after recalulation of scores. Therefore, we need to block the case when sort is applied and track_scores=true to stop the recalculation of the scores.

Conclusion from the above limitations

  1. Hybrid Search results when sorted by field and track_scores=false cannot have scores due to functional limitation and the way fetch Phase works.
  2. If track_scores=true and sort is applied with Hybrid Search, then the request should be blocked with an exception.
  3. If the sort criteria is _score, perform a normal hybrid search without additional sorting due to the functional limitation mentioned above. Hybrid Search will automatically sort the results in decreasing order by scores by default.
  4. If the sort criteria is by _score and a field together like in the example below, then block the request with error message as the results can either be sorted on the basis of score or a field due to logical as well as functional limitation.
{
  "sort":[
     {
      "foo":{
         "order":"desc"
      }
     },
     {
      "_score":{
        "order":"asc"
     }       
  ]
}

Solution Overview

The solution proposes that during the Query phase for every shard, get the individual sub-query results in the sorted form based on sort criteria. Later, once the query phase gets completed on all shards and the normalization processor starts executing, combine the sorted results from the multiple queries to form one final result with no duplicates. To understand the solution overview with an example please refer in the appendix section.

High level design

Based on the solution overview we need to address couple of challenges

Challenges

  1. On each shard the individual sub-query result should be in the sorted form as per the sort criteria sent in the search request
  2. For every shard combine the multiple sub-query results to form one final result which is sorted and duplicate-free.

Addressing the challenges

  1. As discussed earlier, for sorting to work, we need a priority queue with a comparator based on the sort criteria specified in the search request. To address challenge 1, we will add a new collector (HybridTopDocSortCollector), which will add the subquery results to the priority queue and later pop the results based on the sort criteria to obtain the top documents of each individual sub-query in sorted form.
  2. To address the second challenge in the normalization process, for every shard, after we combine the normalized score with the individual query result, we will merge the individual subquery results based on the sorting criteria to create a list of results that is sorted and duplicate-free. Once this process is completed on every shard, we will have a sorted list of results for each shard.
    In the diagram below, the collector addresses the first challenge. At step 3, we see the query results sorted individually. Later, in the normalization phase, we tackle the second challenge. At step 5, after the normalization process, we obtain search results where the individual shard results are sorted.
    SortingHLD (1)

Low Level Design

QueryPhase

In the neural search plugin, we will add a HybridTopDocSortCollector class that contains priority queues with comparators. The number of priority queues will be equal to the number of subqueries. These individual priority queues are responsible for sorting the sub-query results they represent.

The matching result (docId) for a query is captured by the leaf collector (HybridTopDocSortLeafCollector). The leaf collector determines whether the result found is also a result for another subquery by calculating hybrid scores.

Let’s understand this with an example: Suppose we send a search request with a sort query on the field "foo" in descending order. Consider docId 20, which has a foo value of 100, being sent by the docIdIterator to the leaf collector. If there are 3 subqueries, the leaf collector calculates the hybrid scores. The hybrid scores array would look like [0.8, 0, 1.0], where 0.8 is the score of the first subquery, 0 is the score from the second subquery (indicating no match), and 1.0 is the score from the third subquery. Therefore, the leaf collector will add docId 20 to the first priority queue and the third priority queue. The comparators in the queue will also include the foo value of 100, which will act as a reference for comparing query results.

Later, in the HybridCollectorManager's reduce method, when we determine the top field docs of an individual shard, these priority queues will pop the results in sorted order.
SortingQueryPhaseLLD

Normalization Processor

When the normalization processor receives input from the Query Phase in the form of a SearchPhaseResult object, it first retrieves the list of top field docsfrom individual shards within the SearchPhaseResult in the NormalizationProcessor class. The size of this list equals the number of shards. For example, if the list size is 3, it represents the top field docs from 3 shards.
Later, in the NormalizationProcessorWorkFlow class, after calculating normalized scores for every document ID in the top field docs, the normalization processor performs the following steps during score combination in the ScoreCombiner class:

  1. Sort the document IDs according to the sort criteria using the merge method of TopDocs. The sorted result will contain duplicate document IDs.
  2. Remove the duplicate document IDs from the top docs using a Set.
  3. Assign the normalized scores to those document IDs.
    NormalizationLLD

Extra Milestone

The above design will also support the search_after feature with sorting. The search_after parameter provides a live cursor that uses the previous page’s results to obtain the next page’s results. It works by using the sort values of the last document as a reference point, specifying the sort values from which to start the next page. OpenSearch then efficiently retrieves the next set of results based on these provided sort values. The search_after feature is also implemented at the shard level, where the collector only collects results that come after the specified value in the search_after clause. In order to achieve this, we will need to add after parameter in the collector and handle it in the collect method and cater hits accordingly. For more information, please refer to the documentation.

Example

{
  "query": {
    "hybrid": [
      {
        "match": {
          "foo": "opensearch"
        }
      }
    ]
  },
  "sort": [
    {"zoo": "asc"}
  ],
  "search_after": [
    70
  ]
}

Overall, the query will:

  1. Search for documents where the "foo" field matches "opensearch".
  2. Sort the matching documents in ascending order by the "zoo" field.
  3. Retrieve results starting from the document with a "zoo" field value of 70, allowing for efficient pagination.

Testability

  • Integration tests that covers the scenarios mentioned below
    • Sanity of the Hybrid query results in the sorted form when concurrent search is enabled and disabled (Multiple shards)
    • Sanity of the Hybrid query results in the sorted form when concurrent search is enabled and disabled (Single shard)
    • Sanity of the results when sorting is applied with post_filter and aggregations
    • Different type of Sorting scenarios are discussed in the appendix section below.
  • BWC tests

Appendix

Query Phase

Search in OpenSearch executes in multiple phase. The first phase is called Query Phase which execute on every shard get the query result in the form docId’s.

Fetch Phase

The last phase in the search is called the Fetch Phase. The fetch phase is responsible for getting the source of the docId’s from the shard and create a search hit array which will be returned to the user. It will also calculate the score if sort is applied and track_scores=true.

Collector

Collector is responsible to collect the query results in the form of docId’s in the Query Phase.

Types of Sorting

  1. Single Field Sort: Sorting based on one type of field comparator. When the docId’s are fetched from the shard against the query then on the basis of one field comparator the docId’s would be arranged in the Priority Queue to fetch the result.
 "sort": [
        {
            "foo": {
                 "order": "asc"
            }
        }
]
  1. Multi Field Sort: Sorting based on more than one type of field comparator. When the docId’s are fetched from the shard against the query then on the basis of multi field comparator the docId’s would be arranged in the result.
"sort": [
        {
             "foo": {
                 "order": "asc"
             }
        },
        {
            "_id": {
                "order": "asc"
            }
        }
]

Normalization Processor

The normalization-processor is a search phase results processor that runs between the query and fetch phases of search execution. It intercepts the query phase results and then normalizes and combines the document scores from different query clauses before passing the documents to the fetch phase. For more details, Please refer documentation

Normalization

Normalization is a data transformation process that aligns data values to a common scale or distribution of values.
Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

ScoreCombination

When the normalized score is assigned to the query results i.e list of score docs then that process is called score combination.

Score Doc

When a collector collects a docId from a shard as a query result it is converted to a ScoreDoc object. The ScoreDoc class object contains an int object docId, an int object shardIndex, and a float object score.

FieldDoc

When sort is applied then collector collects a docId from a shard as a query result it is converted to a FieldDoc object. The FieldDoc class object contains an int object docId, an int object shardIndex, a float object score, and an of Objects fields. fields object contains the value of field on which sorting has been applied. FieldDoc extends class ScoreDoc.

Top Docs

The TopDocs contains all the query results found in the shard in the form of list of ScoreDocs. It is also contains the count totalHits found in that shard.

TopFieldDocs

When sort is applied the TopFieldDocs contains all the query results found in the shard in the form of list of ScoreDocs. It also contains the count totalHits found in that shard. The sort fields information is present in the SortField[] fields array. TopFieldDocs extends the class TopDocs.

Duplicates

When a query result is part of more than one query, it will be present in the top docs of all the queries it belongs to. This query result is referred to as a duplicate.

Duplicate-free

After calculating the normalized scores, we remove the duplicates from the top docs of individual shards. Consequently, those shard results are duplicate-free.

Example to elaborate on the overview of the solution

Let’s consider we have 2 shards and we are executing a hybrid query which has 2 subqueries in it. The sort criteria we applied is by an integer field Foo to order by in descending order.

Query Phase

The results(Top Field Docs) from shard 1 after the query phase is completed will look like below.

docId shardIndex score field  
0 -1 -9549511920.4881596047f 1 Delimiter to mark the start of hybrid query result
0 -1 -4422440593.9791198149f 1 Delimiter to mark the start of subquery 1 result
9 -1 0.8 10  
8 -1 0.8 7  
10 -1 0.8 14  
0 -1 -4422440593.9791198149f 1 Delimiter to mark the start of subquery 2 result
9 -1 0.9 10  
0 -1 -9549511920.4881596047f 1 Delimiter to mark the end of hybrid query result

As we can clearly see above the docId 9 is the result of query 1 and query 2 result and have different scores respectively. Therefore, it is duplicate.

The results(Top Field Docs) from shard 2 after the query phase is completed will look like below.

docId shardIndex score field  
0 -1 -9549511920.4881596047f 1 Delimiter to mark the start of hybrid query result
0 -1 -4422440593.9791198149f 1 Delimiter to mark the start of subquery 1 result
1 -1 0.8 13  
0 -1 0.8 12  
0 -1 -4422440593.9791198149f 1 Delimiter to mark the start of subquery 2 result
7 -1 0.9 18  
0 -1 -9549511920.4881596047f 1 Delimiter to mark the end of hybrid query result

Normalization Phase

The results (Top Field Docs) from shard 1 after the normalization is completed will look like below. The duplicates and delimiters are removed.

docId shardIndex score field
10 -1 0.3 14
9 -1 0.6 10
8 -1 0.3 7

Fetch Phase

Fetch phase will assign shard Index to the results coming from normalization phase. Moreover, It will fetch the source of the docId’s from the shards and create the search hit array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement RFC Roadmap:Search Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

2 participants