Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS #2283

Merged
merged 1 commit into from
Dec 11, 2024

Conversation

heemin32
Copy link
Collaborator

@heemin32 heemin32 commented Nov 20, 2024

Description

This PR introduces support for returning all nested fields with their scores inside innerHit for nested k-NN fields, applicable to both Lucene and FAISS engines.

The implementation involves executing a search request across all segments and collecting results at the shard level, similar to the approach used in disk-based k-NN searches. After reducing the results to the top k, we retrieve all sibling documents associated with these results. Using the IDs of the retrieved sibling documents as filtered document IDs, we perform another exact search to score them comprehensively.

Here are additional explanations for the changes made:

  1. Added JsonPath as a dependency exclusively for integration testing, using version 2.8.0. 2.9.0 has an dependency conflict issue with SLF4J.
  2. Adopted a composite approach in NestedKnnVectorInnerHitQuery.java to enable code reuse between byte vectors and float vectors.
  3. Replaced the use of BitSet with DocIdSetIterator for filteredDocId to eliminate the overhead of converting from an iterator to a BitSet and back to an iterator.

Related Issues

Resolves #2249

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@heemin32 heemin32 changed the title Multiple innerHit in nested fields Add Support for Multi Values in innerHit for Nested k-NN Fields in Lucene and FAISS Nov 20, 2024
@heemin32 heemin32 force-pushed the innerhit branch 2 times, most recently from 36cab2b to 80998eb Compare November 20, 2024 18:12
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reviewing the lucene lib classes as of now and it will take some time for me get through them but publishing the comments here. to kick start the discussion on some of the other comments in the code

Comment on lines 106 to 118
if (createQueryRequest.getRescoreContext().isPresent()) {
return new NativeEngineKnnVectorQuery(knnQuery, QUERY_UTILS, isInnerHitQuery);
} else if (ENGINES_SUPPORTING_MULTI_VECTORS.contains(knnEngine) && isInnerHitQuery) {
return new NativeEngineKnnVectorQuery(knnQuery, QUERY_UTILS, isInnerHitQuery);
} else {
return knnQuery;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify this logic. We can always call the NativeEngineKnnVectorQuery. Since we are not doing running the query rewrites.

@shatejas , @jmazanec15 was there a reason we kept the logic of to send query in different paths with NativeEngineKnnVectorQuery and knnQuery.

Copy link
Collaborator

@shatejas shatejas Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was to isolate disk based vector search as a precaution, we can always call NativeQuery. If we do that we should consider if we need KNNQuery and KNNWeight classes as it makes the code convoluted with NativeEngine again delegating to another query

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas at-least for this PR, I would like us to track and simplify this logic and then make be take another PR for removing the KNNQuery. Atleast for now I think if we want to remove the KNNQuery it will be a big refactor which is completely out of scope for this PR. Open for suggestions here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create a separate PR for the change if needed, making it easier to revert in case of any unforeseen issues.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say a separate PR you mean for the simplification of this condition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

) throws IOException {
// Construct query
List<Callable<TopDocs>> nestedQueryTasks = new ArrayList<>(leafReaderContexts.size());
Weight filterWeight = getFilterWeight(indexSearcher);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter query seems to be executing twice in the flow (one in rewrite and another in here). Its redundant and might add to latencies.

Is there an alternative solution where the support for innerhits can be added in existing lucene queries instead? there might be optimizations like single execution of filter query, not creating Doc and score query multiple times, that can be leveraged

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FilterWeight in AbstractKnnVectorQuery class need to be stored in variable and it should be accessible from child class. Then, we can reuse it.

@heemin32 heemin32 force-pushed the innerhit branch 6 times, most recently from 661cff7 to 2b1c552 Compare November 26, 2024 06:58
navneet1v
navneet1v previously approved these changes Dec 9, 2024
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me. Just check this thing, since we are not returning all the child documents of the parent docs, will this results into same behavior where if 1 parent child docs are better than other parent child docs, will Opensearch returns just 1 parent doc to customer or it will return 2 parent docs to customers.

jmazanec15
jmazanec15 previously approved these changes Dec 10, 2024
@heemin32
Copy link
Collaborator Author

Code looks good to me. Just check this thing, since we are not returning all the child documents of the parent docs, will this results into same behavior where if 1 parent child docs are better than other parent child docs, will Opensearch returns just 1 parent doc to customer or it will return 2 parent docs to customers.

Even when multiple nested documents are returned per parent document, they are joined back to the parent document, ensuring that the final parent document count remains unaffected. It has been confirmed that, in such cases, the result will still include 2 parent documents.

Create Index With 2 shards

PUT /my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 2
    }
  },
  "mappings": {
    "properties": {
      "nested_field": {
        "type": "nested",
        "properties": {
          "my_vector": {
            "type": "knn_vector",
            "dimension": 3,
            "space_type": "l2",
            "method": {
              "name": "hnsw",
              "engine": "faiss"
            }
          }
        }
      }
    }
  }
}

Ingest 4 documents with 10 nested docs per each

PUT /_bulk?refresh=true
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{"nested_field":[{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]},{"my_vector":[1,1,1]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{"nested_field":[{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]},{"my_vector":[10,10,10]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{"nested_field":[{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]},{"my_vector":[100,100,100]}]}
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
{"nested_field":[{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]},{"my_vector":[1000,1000,1000]}]}

Check documents are distributed across 2 shards

GET /_cat/shards/my-knn-index-1
my-knn-index-1 0 p STARTED    33 4.8kb 127.0.0.1 integTest-0
my-knn-index-1 0 r UNASSIGNED                    
my-knn-index-1 1 p STARTED    11 4.5kb 127.0.0.1 integTest-0
my-knn-index-1 1 r UNASSIGNED                    

Search

GET /my-knn-index-1/_search
{
  "_source": false,
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector": {
            "vector": [
              10,
              10,
              10
            ],
            "k": 4,
            "expand_nested_docs": true
          }
        }
      },
      "score_mode": "max"
    }
  }
}

Result

Confirmed that 4 result is returned properly.

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my-knn-index-1",
        "_id": "2",
        "_score": 1.0
      },
      {
        "_index": "my-knn-index-1",
        "_id": "1",
        "_score": 0.0040983604
      },
      {
        "_index": "my-knn-index-1",
        "_id": "3",
        "_score": 4.115057E-5
      },
      {
        "_index": "my-knn-index-1",
        "_id": "4",
        "_score": 3.4010122E-7
      }
    ]
  }
}

@navneet1v
Copy link
Collaborator

@heemin32 thanks for confirming. Do we have a similar IT test added? also were you able to figure out in the code where this translation of child to parent docs is happening and ensuring that are picking up the parent docs == size only.

@heemin32
Copy link
Collaborator Author

heemin32 commented Dec 10, 2024

@heemin32 thanks for confirming. Do we have a similar IT test added? also were you able to figure out in the code where this translation of child to parent docs is happening and ensuring that are picking up the parent docs == size only.

Let me add one. There was a hidden bug that I missed as well. The translation of child to parent docs is happening in NestedQueryBuilder -> OpenSearchToParentBlockJoinQuery -> ToParentBlockJoinQuery.

@heemin32 heemin32 merged commit 88792e4 into opensearch-project:main Dec 11, 2024
37 of 39 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2283-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 88792e42f121b050f2fc9cf32b039052aab62128
# Push it to GitHub
git push --set-upstream origin backport/backport-2283-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2283-to-2.x.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 11, 2024
Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 88792e4)
heemin32 added a commit that referenced this pull request Dec 11, 2024
Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 88792e4)
navneet1v pushed a commit that referenced this pull request Dec 11, 2024
Signed-off-by: Heemin Kim <[email protected]>
(cherry picked from commit 88792e4)

Co-authored-by: Heemin Kim <[email protected]>
owenhalpert pushed a commit to owenhalpert/k-NN that referenced this pull request Dec 19, 2024
@heemin32
Copy link
Collaborator Author

heemin32 commented Jan 8, 2025

The benchmark results do not indicate a significant difference in latency when "expand_nested_docs" is enabled.

Setup

Screenshot 2025-01-08 at 9 48 52 AM

Result

Screenshot 2025-01-08 at 9 53 57 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Multiple inner hits for nested field
5 participants