-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ParentJoin KNN support #12434
Add ParentJoin KNN support #12434
Conversation
From a quick look, this lower level KNN collection API looks interesting. It has currently a high surface - presumably because extending the queue was easier to have a working prototype, which is cool - I'm curious how much leaner it can be made. It feels like we'd need at least |
@alessandrobenedetti I took some of your ideas on deduplicating vector IDs based on some other id for this PR. If this work continues, I think some of it can transfer to the native multi-vector support in Lucene. |
I will dig a bit more on making this cleaner. My biggest performance concerns are around keeping track of the heap-index -> ID and shuffling those around so often and resolving the docId by vector ordinal on every push. |
@jpountz I took another shot at the KnnResults interface. I restricted the abstract and |
@jpountz my original benchmarks were flawed. There was a bug in my testing. Nested is actually 80% slower (or 1.8x times) than the current search times. I am investigating the current possible causes. |
@msokolov let me know if there are further changes required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I think you addressed my comments and I don't have anything else. I guess my only outstanding question is whether we have any approach to performance testing this -- we don't have any sample documents structured like this or test queries today in luceneutil, but that would be a nice followup
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing. However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent. This commit adds this ability through some significant changes: - New leaf reader function that allows a collector for knn results - The knn results can then utilize bit-sets to join back to the parent id This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this. This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
…rn highest score child doc ID by parent id (#12510) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: #12434
…rn highest score child doc ID by parent id (#12510) The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search). So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id. Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set. Related to: #12434
Thanks @benwtrent for this work! I finally had the chance to take a look. As a side note, do you happen to have any performance benchmark? I am quite curious as I always label nested docs approaches in Lucene to be 'slow', but having some facts (that potentially contradicts my statement) would be super cool! |
Yes, I am sorry about that. But the good news is that the integration for multi-value vectors has some nicer APIs to take advantage of (e.g. KnnCollector) and it could possibly copy/paste the deduplicating nearest neighbor min-heap implementation.
The following test was completed over 139004 documents with 768 float32 dimensions. The statistics for the nested value distributions:
GASP! Nested seems 2x to 4x slower! But, keep in mind, we are eagerly joining! When I dug into the difference, I discovered that eagerly joining on this dataset meant we were visiting 3x to 5x more vectors. Consequently doing 3-5x more vector comparisons and deeper exploration of the graph. This lines up really nicely with the performance difference. Since HNSW is I am not sure these numbers are reflective of other nested/block-joining operations (like a term search). |
No worries at all! My work is still paused, looking for sponsors, so no harm! When I resume it as you said I may find benefits (and do improvements) to the new data structures added (I admint I got lost in the amount of KnnCollectors and similar classes added, but I'm super curious to explore each of them thoroughfully.
|
@benwtrent - did this really make it into 9.8.0? I downloaded the 9.8.0 release and ToParentBlockJoinFloatKnnVectorQuery does not seem to be present.
|
@david-sitsky sorry for the confusion, it was renamed |
Ah.. no worries, thanks. We should update the changelog https://lucene.apache.org/core/9_8_0/changes/Changes.html#v9.8.0.new_features since it is still referring to the old classnames. |
A
join
within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.However, when searching for the closest
k
, it is still the k nearest children vectors with no way to join back to the parent.This commit adds this ability through some significant changes:
This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.
This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).