Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Score mode support other than max with KNN nested field #1743

Closed
heemin32 opened this issue Jun 11, 2024 · 3 comments
Closed

[FEATURE] Score mode support other than max with KNN nested field #1743

heemin32 opened this issue Jun 11, 2024 · 3 comments
Labels

Comments

@heemin32
Copy link
Collaborator

heemin32 commented Jun 11, 2024

Current KNN nested field works with max score mode which use max score among child documents(nested field document) as the parent document score. I would like to use other score mode like avg or sum of all child documents scores.

How score mode works with nested field

During query, it returns matched child document with it score and the child documents is joined back to its parent documents. Based on score mode, the parent document's score is calculated using returned child documents and their scores.

Challenge

KNN query does not return all child documents of a parent documents but only the one with max score. Therefore regardless of score mode, min, max, avg and sum, the score will be only max as of now.
Even if we could return all child documents of the selected parent documents with their scores, it won't guarantee that the final result is correct because there could be a parent documents which might have higher score but none of their child document was not returned during search phase.

Solution

Option 1

After querying KNN fields, we retrieve all sibling documents of the searched child documents, calculate their score and add them in the search result. The rest of the work will be handled by OpenSearch core. Still, the end result might not exactly match with what we might get from exact search because we are not comparing all the final parent document score but only subset of it.

Option 2

When the score mode is not max, we do knn search on nested doc level without any deduplication. Then, the score mode will be applied to only those returned nested field doc. This might not guarantee that the query will return the k result.

Option 3

Introduce index setting where you can execute knn search in nested field level. (old behavior where there is no deduplication per parent documents)

Alternative

1. Rescoring (Incorrect)

Retrieve parent documents based on max child document score. Then, we rescore the document by calculating score on child documents.

{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [
              1,
              1,
              1
            ],
            "k": 2
          }
        }
      }
    }
  },
  "rescore": {
    "window_size": 2,
    "query": {
      "query_weight": 0.0,
      "rescore_query_weight": 1.0,
      "rescore_query": {
        "nested": {
          "path": "nested_field",
          "score_mode": "avg",
          "query": {
            "function_score": {
              "script_score": {
                "script": {
                  "lang": "knn",
                  "source": "knn_score",
                  "params": {
                    "field": "nested_field.my_vector",
                    "query_value": [
                      1,
                      1,
                      1
                    ],
                    "space_type": "l2"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

2. Exact search(Not efficient)

We can run exact search to find all child documents and its score

{
  "query": {
    "nested": {
      "path": "nested_field",
      "score_mode": "avg",
      "query": {
        "function_score": {
          "script_score": {
            "script": {
              "lang": "knn",
              "source": "knn_score",
              "params": {
                "field": "nested_field.my_vector",
                "query_value": [
                  1,
                  1,
                  1
                ],
                "space_type": "l2"
              }
            }
          }
        }
      }
    }
  }
}
@yuye-aws
Copy link
Member

There exists a use case by text_chunking processor. Despite we recommend max score mode to search the chunked embedding field: https://opensearch.org/docs/latest/search-plugins/text-chunking/#step-4-search-the-index-using-neural-search, it also makes sense to search with avg score mode.

Supporting knn query is a good idea since the neural and neural sparse query will inherit the behavior.

@yuye-aws
Copy link
Member

Option 1

After querying KNN fields, we retrieve all sibling documents of the searched child documents, calculate their score and add them in the search result. The rest of the work will be handled by OpenSearch core. Still, the end result might not exactly match with what we might get from exact search because we are not comparing all the final parent document score but only subset of it.

Option 2

When the score mode is not max, we do knn search on nested doc level without any deduplication. Then, the score mode will be applied to only those returned nested field doc. This might not guarantee that the query will return the k result.

Option 1 makes more sense to me. Personally, i do not feel like the specific logic in option 2 on max score_mode.

@vamshin vamshin moved this from Backlog (Hot) to 2.19.0 in Vector Search RoadMap Oct 4, 2024
@vamshin vamshin added the Roadmap:Vector Database/GenAI Project-wide roadmap label label Oct 4, 2024
@heemin32 heemin32 moved this from 2.19.0 to Backlog (Hot) in Vector Search RoadMap Oct 30, 2024
@heemin32
Copy link
Collaborator Author

Closing in favor of #2283
Feel free to reopen if there is a requirement that the mentioned PR cannot cover.

@github-project-automation github-project-automation bot moved this from Backlog (Hot) to ✅ Done in Vector Search RoadMap Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

3 participants