[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

martin-gaievski · 2023-06-03T00:16:08Z

Introduction

This document describes details of Low Level Query Phase Searcher Design in scope of Score Normalization and Combination Feature. This is one of multiple LLDs in scope the Score Normalization and Combination Feature. Pre-read of following documents is highly recommended: high-level design [RFC] High Level Approach and Design For Normalization and Score Combination, antecedent LLD design [RFC] Low Level Design for Normalization and Score Combination Query.

Background

As per HLD and in scope of Normalization feature we need an ability to collect results of multiple sub-queries and send them unmerged for post-processing by coordinator node. New part that currently does not exist in OpenSearch is storing of such multiple results. Current logic assumes that search at shard level should return a single list of results (doc ids and scores), and merge should happen at shard level.

OpenSearch provides mechanism of extensions for this type of use cases. Such goal can be achieved through custom QueryPhaseSearcher, that can call custom DocCollector. Both are exwecuted at the shard level as part of the Query phase of search request (caller of the query phase searcher). Those abstractions along with new DTO that can hold results of multiple sub-queries will be a focus for this design.

New QueryPhaseSearcher and related classes will be added as part of the Neural Search plugin and code changes will be done in the plugin repo.

Requirements

Results with doc ids from multiple sub-queries need to be collected and passed to controller node as part of the query execution results. At later Fetch phase we need to be able to identify what result belongs to what sub-query.
New query should keep added latency (for functions like query parsing etc.) to minimum and not degrade performance in both latency and resource utilization comparing to a similar query that does combination at shard level. We will add exact latency numbers after the benchmark is done, initial expectations are: added latency within 15%.
Fetch phase of query execution (“reduce” in terms of OpenSearch, executed at coordinator node) should work without changes.

Scope

In this document we propose a solution for the questions below:

How do we collect scores for each sub-query and form final resulting DTO.
How do we use existing extension point of OpenSearch for implementation of custom QueryPhaseSearcher.

Solution Overview

New custom query phase searcher will be created as implementation of core OpenSearch interface QueryPhaseSearcher. New custom DocCollector will be responsible for collecting search results from a single shard. We’ll be using new DTO object that holds collection of top scored docs for all sub-queries.

Risks / Known limitations

currently core OpenSearch supports only single custom QueryPhaseSearcher, and there is one under “Concurrent Search” feature that has been registered recently. Currently it’s implemented behind the feature flag (Moving concurrent-search out of the sandbox plugin to core behind the feature flag). There is an open issue in core to provide ability to register multiple phase searchers.

In first phase we’ll create a setting in the plugin that will allow to disable our query phase searcher in case user needs to use concurrent search feature. Using feature flag has a drawback of using command line arguments that are set during the distribution build. For setting user can change the value and restart the cluster. By default hybrid query searcher will be enabled, and setting will allow to disable it.

new DTO for top docs will be implementing two views for results: collection of doc ids for single query and collection of collections of matching doc ids for a new hybrid query. First collection is required to be complied with existing core API contracts and keep number of changes to minimum. Actual results per sub-query will be used in later implementations, in transformer that is part of the search pipeline; it will be processed on a Query phase and before the Fetch phase.
new doc collector will be implementing minimal set of features required to collect results of hybrid query. This is because collector will be used for only one query type, and some of the features supported by core doc collector (like pagination or max score threshold) are not supported in initial release (e.g. pagination, see [RFC] High Level Approach and Design For Normalization and Score Combination for details).

Future extensions

Resolve multiple query searchers limitation. This can be a feature in core, currently it’s under discussion (Enabling Multiple QueryPhaseSearcher in OpenSearch OpenSearch#7020). There are two possible ways here:
- use a single query searcher, choose one of multiple implementations in runtime
- create new query phase (or sub-phase), plugins can register multiple custom implementations
pagination for query results. Assumption is that this should work if we pass “to” and “from” to doc collector, but it needs testing (small POC using first implementation version or previous POCs).

Solution Details

We’re going to use existing plugin class NeuralSearch as an entry point to register new HybridQueryPhaseSearcher. Searcher will call new doc collector only in case query is of type HybridQuery, this check is required as query searcher is global at plugin level and will work for all queries (currently NeuralSearch query).

Figure 1: Class diagram for HybridQueryPhaseSearcher implementation

Below is the general data flow for collecting query results using custom doc collector for Hybrid query. QueryPhaseSearcher will be executing at shard level, after coordinator node sends the Query request to each shard. DocCollector will Get max scores for each sub-query using doc id iterator and priority queue, then it will form a collection of all results. This is set as a shard query result and sent back to coordinator node for fetch phase.

Figure 2: General sequence diagram for collecting query results from shards

Final query results will be set to the QueryResult object as a single instance of CompoundTopDocs object.

For example, we are sending Hybrid Query search request with 3 sub-queries:

POST <index-name>/_search
{
    "query": {
        "hybrid": {
            "queries": [
                { /* standard term query 1 */ },
                { /* standard term query 2 */ },
                { /* neural query */ }
            ]
        }
    }
}

Our DTO with query results will look something like this:

CompoundTopDocs: //new class
    docs:
        [0] TopDocs: //existing class
            totalHits
            scoreDocs:
                [0] ScoreDoc: //existing class
                    docId
                    score
                [1] ScoreDoc:
                    docId
                    score
        [1] TopDocs:
            totalHits
            scoreDocs:
                [0] ScoreDoc:
                    docId
                    score
        [2] TopDocs:
            totalHits
            scoreDocs:
                [0] ScoreDoc:
                    docId
                    score
                [1] ScoreDoc:
                    docId
                    score
                [2] ScoreDoc:
                    docId
                    score

Main difference between new TopDocs and existing core implementation is that new object has a collection of results. Standard core object has always single result, for instance

TopDocs: 
    totalHits
    scoreDocs:
         [0] ScoreDoc:
               docId
               score
         [1] ScoreDoc:
               docId
               score

Testability

New query phase searcher is testable via existing /search REST API and lower level direct API calls. Main testing will be done via unit and integration tests. We don’t need backward compatibility tests as Neural-search is in experimental mode and there is no commitment for support of previous versions.

Tests will be focused on overall query stability and results that are collected. Actual explicit testing of result correctness is not possible at this stage, as score normalization and combination is done at later stage by future extension on text processor (or similar alternative implementation):

collect result doc ids for hybrid query that has no sub-queries, has one sub-query and has multiple sub-queries
test on cluster with multiple shards/nodes
check that query result object is set and accessible from SearchQueryThenFetchAsyncAction part of workflow

Mentioned tests are part of the plugin repo CI, main OpenSearch build CI, and also can be executed on demand from development environment.

Tests for metrics like normalized score correctness, performance etc. will be added in later implementations when end-to-end solution will be available.

Reference Links

Meta Issue for Feature: [META] Score Combination and Normalization for Semantics Search. Score Normalization for k-NN and BM25 #123
[RFC] High Level Approach and Design For Normalization and Score Combination: [RFC] High Level Approach and Design For Normalization and Score Combination #126
[RFC] Low Level Design for Normalization and Score Combination Query: [RFC] Low Level Design for Normalization and Score Combination Query #174

The text was updated successfully, but these errors were encountered:

martin-gaievski · 2023-09-01T16:35:27Z

We have found during testing that for multiple node scenario, which is typical in production, custom implementation of TopDocs doesn't work well. We choose to adjust our format to existing logic and switch to existing TopDocs with a single list of scores. Details of our findings we explained in core OpenSearch issue, TL;DR: coordinator node will receive only single list of scores from data nodes: opensearch-project/OpenSearch#9697

Updated approach for DTO

We're using using TopDocs class as a DTO for sending results from data nodes to coordinator node. Scores of sub-queries will be in a single list of scores, each sub-query will be preceded with a special delimiter score. Special score will also be first and last element of the score list, this will mark such TopDocs as related to Hybrid Query and simplify parsing of that score list.

High level protocol

     *  doc_id | magic_number_1    //start/stop
     *  doc_id | magic_number_2   //delimiter for sub-query 1
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 2
     *  ...
     *  doc_id | magic_number_2   //delimiter for sub-query 3
     *  ...
     *  doc_id | magic_number_1  //start/stop

Example

TopDocs: 
    totalHits
    scoreDocs:
         [0] ScoreDoc:
               0, start/stop_score
         [1] ScoreDoc:
               0, delimiter_score
         [2] ScoreDoc:
               0, 0.95
         [3] ScoreDoc:
               2, 0.9
         [4] ScoreDoc:
               3, 0.75
         [5] ScoreDoc:
               0, delimiter_score
         [6] ScoreDoc:
               0, 12.1
         [7] ScoreDoc:
               1, 8.9
         [8] ScoreDoc:
               0, start/stop_score

We have to utilize only score field of ScoreDocs, this is due to limitation in implementation of pipelines and processor execution. For case of only 1 shard it's not guaranteed that FETCH phase will be executed before the processor is called, even for case when processor is registered between QUERY and FETCH phases. In such case if docId is not valid then Fetch phase code fails.

Corresponding code in normalization processor needs to support this new format. Main logic of score processing will remain the same, changes will be only in logic of parsing TopDocs objects from shards/nodes.

navneet1v · 2023-09-22T21:08:39Z

Resolving this github issue as the changes for RC of 2.10 is finalized and merged. Please create a github issue if there are any further questions.

martin-gaievski added Enhancements Increases software capabilities beyond original client specifications neural-search Features Introduces a new unit of functionality that satisfies a requirement RFC v2.9.0 labels Jun 3, 2023

github-actions bot added the untriaged label Jun 3, 2023

martin-gaievski mentioned this issue Jun 3, 2023

[FEATURE] New Doc Collector for Normalization and Score Combination Query #194

Closed

3 tasks

navneet1v removed the untriaged label Jun 22, 2023

navneet1v assigned martin-gaievski Jun 22, 2023

navneet1v added v2.10.0 Issues targeting release v2.10.0 and removed v2.9.0 labels Jul 15, 2023

martin-gaievski mentioned this issue Jul 19, 2023

[FEATURE] Provide way of defining methods for score normalization and combination in scope of Hybrid search #228

Closed

2 tasks

navneet1v mentioned this issue Aug 7, 2023

[RFC] Improved Hybrid Search relevancy by Normalization and Score Combination Feature API Design LLD #244

Closed

navneet1v closed this as completed Sep 22, 2023

martin-gaievski mentioned this issue Feb 14, 2024

[RFC] Aggregations and Hybrid query #604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

martin-gaievski commented Jun 3, 2023

martin-gaievski commented Sep 1, 2023 •

edited

Loading

navneet1v commented Sep 22, 2023

[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

[RFC] Low Level Design for Normalization and Score Combination Query Phase Searcher #193

Comments

martin-gaievski commented Jun 3, 2023

Introduction

Background

Requirements

Scope

Solution Overview

Risks / Known limitations

Future extensions

Solution Details

Figure 1: Class diagram for HybridQueryPhaseSearcher implementation

Figure 2: General sequence diagram for collecting query results from shards

Testability

Reference Links

martin-gaievski commented Sep 1, 2023 • edited Loading

navneet1v commented Sep 22, 2023

martin-gaievski commented Sep 1, 2023 •

edited

Loading