Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParentJoin KNN support #12434

Merged
merged 49 commits into from
Aug 7, 2023

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jul 11, 2023

A join within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.

However, when searching for the closest k, it is still the k nearest children vectors with no way to join back to the parent.

This commit adds this ability through some significant changes:

  • New leaf reader function that allows a collector for knn results
  • The knn results can then utilize bit-sets to join back to the parent id

This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.

This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).

@jpountz
Copy link
Contributor

jpountz commented Jul 11, 2023

From a quick look, this lower level KNN collection API looks interesting. It has currently a high surface - presumably because extending the queue was easier to have a working prototype, which is cool - I'm curious how much leaner it can be made. It feels like we'd need at least collect(int docID, float similarity), float minSimilarity() and TopDocs topDocs(), would it be enough or is there more?

@benwtrent
Copy link
Member Author

@alessandrobenedetti I took some of your ideas on deduplicating vector IDs based on some other id for this PR. If this work continues, I think some of it can transfer to the native multi-vector support in Lucene.

@benwtrent
Copy link
Member Author

would it be enough or is there more?

I will dig a bit more on making this cleaner.

My biggest performance concerns are around keeping track of the heap-index -> ID and shuffling those around so often and resolving the docId by vector ordinal on every push.

@benwtrent
Copy link
Member Author

@jpountz I took another shot at the KnnResults interface. I restricted the abstract and @Override methods to narrow the API. Additionally, I disconnected it from the queue, but it still has a queue object internally that sub-classes can utilize.

@benwtrent
Copy link
Member Author

@jpountz my original benchmarks were flawed. There was a bug in my testing. Nested is actually 80% slower (or 1.8x times) than the current search times.

I am investigating the current possible causes.

@benwtrent benwtrent requested a review from msokolov August 1, 2023 14:09
@benwtrent
Copy link
Member Author

@msokolov let me know if there are further changes required.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, I think you addressed my comments and I don't have anything else. I guess my only outstanding question is whether we have any approach to performance testing this -- we don't have any sample documents structured like this or test queries today in luceneutil, but that would be a nice followup

@benwtrent benwtrent merged commit a65cf89 into apache:main Aug 7, 2023
@benwtrent benwtrent deleted the feature/add-join-support-knn branch August 7, 2023 18:46
benwtrent added a commit that referenced this pull request Aug 7, 2023
A `join` within Lucene is built by adding child-docs and parent-docs in order. Since our vector field already supports sparse indexing, it should be able to support parent join indexing.

However, when searching for the closest `k`, it is still the k nearest children vectors with no way to join back to the parent.

This commit adds this ability through some significant changes:
 - New leaf reader function that allows a collector for knn results
 - The knn results can then utilize bit-sets to join back to the parent id

This type of support is critical for nearest passage retrieval over larger documents. Generally, you want the top-k documents and knowledge of the nearest passages over each top-k document. Lucene's join functionality is a nice fit for this.

This does not replace the need for multi-valued vectors, which is important for other ranking methods (e.g. colbert token embeddings). But, it could be used in the case when metadata about the passage embedding must be stored (e.g. the related passage).
benwtrent added a commit that referenced this pull request Aug 14, 2023
…when parents are missing (#12504)

This is a follow up to: #12434

Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
benwtrent added a commit that referenced this pull request Aug 14, 2023
…when parents are missing (#12504)

This is a follow up to: #12434

Adds a test for when parents are missing in the index and verifies we return no hits. Previously this would have thrown an NPE
benwtrent added a commit that referenced this pull request Aug 16, 2023
…rn highest score child doc ID by parent id (#12510)

The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

Related to: #12434
benwtrent added a commit that referenced this pull request Aug 16, 2023
…rn highest score child doc ID by parent id (#12510)

The current query is returning parent-id's based off of the nearest child-id score. However, its difficult to invert that relationship (meaning determining what exactly the nearest child was during search).

So, I changed the new `ToParentBlockJoin[Byte|Float]KnnVectorQuery` to `DiversifyingChildren[Byte|Float]KnnVectorQuery` and now it returns the nearest child-id instead of just that child's parent id. The results are still diversified by parent-id.

Now its easy to determine the nearest child vector as that is what the query is returning. To determine its parent, its as simple as using the previously provided parent bit set.

Related to: #12434
@zhaih zhaih added this to the 9.8.0 milestone Sep 20, 2023
@alessandrobenedetti
Copy link
Contributor

Thanks @benwtrent for this work! I finally had the chance to take a look.
It's a lot and I see it's already merged, so I don't have any meaningful comment at the moment, but if I have time I'll dive into it in the future! (mostly when and if I resume the work on multi-valued, for which I am still waiting for fundings).
The work here drastically changes the way also my Pull Request should look like right now.

As a side note, do you happen to have any performance benchmark? I am quite curious as I always label nested docs approaches in Lucene to be 'slow', but having some facts (that potentially contradicts my statement) would be super cool!

@benwtrent
Copy link
Member Author

The work here drastically changes the way also my Pull Request should look like right now.

Yes, I am sorry about that. But the good news is that the integration for multi-value vectors has some nicer APIs to take advantage of (e.g. KnnCollector) and it could possibly copy/paste the deduplicating nearest neighbor min-heap implementation.

As a side note, do you happen to have any performance benchmark?

The following test was completed over 139004 documents with 768 float32 dimensions.

The statistics for the nested value distributions:

1944 total unique documents
62.0 median number of nested values
71.50411522633745 mean number of nested values
309 max number of nested values
1 min number of nested values
2156.9469722481676 variance of nested values

|                                        50th percentile latency |          knn-search-10-100 |   3.10031     |     ms |
|                                        90th percentile latency |          knn-search-10-100 |   3.5629      |     ms |
|                                        99th percentile latency |          knn-search-10-100 |   4.60912     |     ms |
|                                      99.9th percentile latency |          knn-search-10-100 |  14.322       |     ms |
|                                       100th percentile latency |          knn-search-10-100 |  72.6463      |     ms |
|                                        50th percentile latency |   knn-nested-search-10-100 |   6.2615      |     ms |
|                                        90th percentile latency |   knn-nested-search-10-100 |   6.95849     |     ms |
|                                        99th percentile latency |   knn-nested-search-10-100 |   7.8881      |     ms |
|                                      99.9th percentile latency |   knn-nested-search-10-100 |  12.0871      |     ms |
|                                       100th percentile latency |   knn-nested-search-10-100 |  57.9238      |     ms |
|                                        50th percentile latency |        knn-search-100-1000 |   7.30288     |     ms |
|                                        90th percentile latency |        knn-search-100-1000 |   8.18694     |     ms |
|                                        99th percentile latency |        knn-search-100-1000 |   9.23673     |     ms |
|                                      99.9th percentile latency |        knn-search-100-1000 |  18.7072      |     ms |
|                                       100th percentile latency |        knn-search-100-1000 |  23.8712      |     ms |
|                                        50th percentile latency | knn-search-nested-100-1000 |  26.6446      |     ms |
|                                        90th percentile latency | knn-search-nested-100-1000 |  38.2561      |     ms |
|                                        99th percentile latency | knn-search-nested-100-1000 |  44.3627      |     ms |
|                                      99.9th percentile latency | knn-search-nested-100-1000 |  51.1843      |     ms |
|                                       100th percentile latency | knn-search-nested-100-1000 |  52.0864      |     ms |

GASP! Nested seems 2x to 4x slower!

But, keep in mind, we are eagerly joining! When I dug into the difference, I discovered that eagerly joining on this dataset meant we were visiting 3x to 5x more vectors. Consequently doing 3-5x more vector comparisons and deeper exploration of the graph. This lines up really nicely with the performance difference.

Since HNSW is log(n) the overall constant overhead of nested seems rather minor compared to the need to gather nearest vectors.

I am not sure these numbers are reflective of other nested/block-joining operations (like a term search).

@alessandrobenedetti
Copy link
Contributor

The work here drastically changes the way also my Pull Request should look like right now.

Yes, I am sorry about that. But the good news is that the integration for multi-value vectors has some nicer APIs to take advantage of (e.g. KnnCollector) and it could possibly copy/paste the deduplicating nearest neighbor min-heap implementation.

No worries at all! My work is still paused, looking for sponsors, so no harm! When I resume it as you said I may find benefits (and do improvements) to the new data structures added (I admint I got lost in the amount of KnnCollectors and similar classes added, but I'm super curious to explore each of them thoroughfully.

As a side note, do you happen to have any performance benchmark?

The following test was completed over 139004 documents with 768 float32 dimensions.

The statistics for the nested value distributions:

1944 total unique documents 62.0 median number of nested values 71.50411522633745 mean number of nested values 309 max number of nested values 1 min number of nested values 2156.9469722481676 variance of nested values

|                                        50th percentile latency |          knn-search-10-100 |   3.10031     |     ms |
|                                        90th percentile latency |          knn-search-10-100 |   3.5629      |     ms |
|                                        99th percentile latency |          knn-search-10-100 |   4.60912     |     ms |
|                                      99.9th percentile latency |          knn-search-10-100 |  14.322       |     ms |
|                                       100th percentile latency |          knn-search-10-100 |  72.6463      |     ms |
|                                        50th percentile latency |   knn-nested-search-10-100 |   6.2615      |     ms |
|                                        90th percentile latency |   knn-nested-search-10-100 |   6.95849     |     ms |
|                                        99th percentile latency |   knn-nested-search-10-100 |   7.8881      |     ms |
|                                      99.9th percentile latency |   knn-nested-search-10-100 |  12.0871      |     ms |
|                                       100th percentile latency |   knn-nested-search-10-100 |  57.9238      |     ms |
|                                        50th percentile latency |        knn-search-100-1000 |   7.30288     |     ms |
|                                        90th percentile latency |        knn-search-100-1000 |   8.18694     |     ms |
|                                        99th percentile latency |        knn-search-100-1000 |   9.23673     |     ms |
|                                      99.9th percentile latency |        knn-search-100-1000 |  18.7072      |     ms |
|                                       100th percentile latency |        knn-search-100-1000 |  23.8712      |     ms |
|                                        50th percentile latency | knn-search-nested-100-1000 |  26.6446      |     ms |
|                                        90th percentile latency | knn-search-nested-100-1000 |  38.2561      |     ms |
|                                        99th percentile latency | knn-search-nested-100-1000 |  44.3627      |     ms |
|                                      99.9th percentile latency | knn-search-nested-100-1000 |  51.1843      |     ms |
|                                       100th percentile latency | knn-search-nested-100-1000 |  52.0864      |     ms |

GASP! Nested seems 2x to 4x slower!

But, keep in mind, we are eagerly joining! When I dug into the difference, I discovered that eagerly joining on this dataset meant we were visiting 3x to 5x more vectors. Consequently doing 3-5x more vector comparisons and deeper exploration of the graph. This lines up really nicely with the performance difference.

Since HNSW is log(n) the overall constant overhead of nested seems rather minor compared to the need to gather nearest vectors.

I am not sure these numbers are reflective of other nested/block-joining operations (like a term search).
Interesting and thanks for the heads up, I hope to investigate this further as well in the future!

@david-sitsky
Copy link

@benwtrent - did this really make it into 9.8.0? I downloaded the 9.8.0 release and ToParentBlockJoinFloatKnnVectorQuery does not seem to be present.

lucene-9.8.0/modules$ ls
lucene-analysis-common-9.8.0.jar      lucene-codecs-9.8.0.jar       lucene-queries-9.8.0.jar
lucene-analysis-icu-9.8.0.jar         lucene-core-9.8.0.jar         lucene-queryparser-9.8.0.jar
lucene-analysis-kuromoji-9.8.0.jar    lucene-demo-9.8.0.jar         lucene-replicator-9.8.0.jar
lucene-analysis-morfologik-9.8.0.jar  lucene-expressions-9.8.0.jar  lucene-sandbox-9.8.0.jar
lucene-analysis-nori-9.8.0.jar        lucene-facet-9.8.0.jar        lucene-spatial3d-9.8.0.jar
lucene-analysis-opennlp-9.8.0.jar     lucene-grouping-9.8.0.jar     lucene-spatial-extras-9.8.0.jar
lucene-analysis-phonetic-9.8.0.jar    lucene-highlighter-9.8.0.jar  lucene-suggest-9.8.0.jar
lucene-analysis-smartcn-9.8.0.jar     lucene-join-9.8.0.jar         META-INF
lucene-analysis-stempel-9.8.0.jar     lucene-luke-9.8.0.jar         module-info.class
lucene-backward-codecs-9.8.0.jar      lucene-memory-9.8.0.jar       org
lucene-benchmark-9.8.0.jar            lucene-misc-9.8.0.jar
lucene-classification-9.8.0.jar       lucene-monitor-9.8.0.jar
lucene-9.8.0/modules$ for file in *.jar; do unzip -v $file | grep ToParentBlockJoinFloatKnnVectorQuery; done
lucene-9.8.0/modules$ for file in *.jar; do unzip -v $file | grep ToParentBlockJoin; done
     948  Defl:N      541  43% 2023-09-21 21:59 3e2c2007  org/apache/lucene/search/join/ToParentBlockJoinQuery$1.class
    7806  Defl:N     3472  56% 2023-09-21 21:59 8e8db572  org/apache/lucene/search/join/ToParentBlockJoinQuery$BlockJoinScorer.class
    1814  Defl:N      709  61% 2023-09-21 21:59 d028b861  org/apache/lucene/search/join/ToParentBlockJoinQuery$BlockJoinWeight$1.class
    4114  Defl:N     1462  65% 2023-09-21 21:59 1e07024a  org/apache/lucene/search/join/ToParentBlockJoinQuery$BlockJoinWeight.class
    1589  Defl:N      838  47% 2023-09-21 21:59 cf93f84a  org/apache/lucene/search/join/ToParentBlockJoinQuery$ParentApproximation.class
    1869  Defl:N      836  55% 2023-09-21 21:59 1c0824bd  org/apache/lucene/search/join/ToParentBlockJoinQuery$ParentTwoPhase.class
    4723  Defl:N     1900  60% 2023-09-21 21:59 53b7ab29  org/apache/lucene/search/join/ToParentBlockJoinQuery.class
    2824  Defl:N     1085  62% 2023-09-21 21:59 323f6562  org/apache/lucene/search/join/ToParentBlockJoinSortField$1.class
    3191  Defl:N     1106  65% 2023-09-21 21:59 c9d8a699  org/apache/lucene/search/join/ToParentBlockJoinSortField$2$1.class
    1491  Defl:N      607  59% 2023-09-21 21:59 3a82677f  org/apache/lucene/search/join/ToParentBlockJoinSortField$2.class
    3196  Defl:N     1105  65% 2023-09-21 21:59 ec8017c1  org/apache/lucene/search/join/ToParentBlockJoinSortField$3$1.class
    1484  Defl:N      600  60% 2023-09-21 21:59 5bd0b2df  org/apache/lucene/search/join/ToParentBlockJoinSortField$3.class
    1368  Defl:N      576  58% 2023-09-21 21:59 29e61acb  org/apache/lucene/search/join/ToParentBlockJoinSortField$4$1$1.class
    3413  Defl:N     1152  66% 2023-09-21 21:59 4f73794f  org/apache/lucene/search/join/ToParentBlockJoinSortField$4$1.class
    1489  Defl:N      606  59% 2023-09-21 21:59 5132747c  org/apache/lucene/search/join/ToParentBlockJoinSortField$4.class
    1367  Defl:N      568  58% 2023-09-21 21:59 f6deee3a  org/apache/lucene/search/join/ToParentBlockJoinSortField$5$1$1.class
    3418  Defl:N     1151  66% 2023-09-21 21:59 d09c4733  org/apache/lucene/search/join/ToParentBlockJoinSortField$5$1.class
    1494  Defl:N      607  59% 2023-09-21 21:59 1e11e4cf  org/apache/lucene/search/join/ToParentBlockJoinSortField$5.class
    1266  Defl:N      685  46% 2023-09-21 21:59 18b34568  org/apache/lucene/search/join/ToParentBlockJoinSortField$6.class
    5837  Defl:N     2120  64% 2023-09-21 21:59 bfc259c3  org/apache/lucene/search/join/ToParentBlockJoinSortField.class

@benwtrent
Copy link
Member Author

@david-sitsky sorry for the confusion, it was renamed DiversifyingChildren*KnnVectorQuery

@david-sitsky
Copy link

@david-sitsky sorry for the confusion, it was renamed DiversifyingChildren*KnnVectorQuery

Ah.. no worries, thanks. We should update the changelog https://lucene.apache.org/core/9_8_0/changes/Changes.html#v9.8.0.new_features since it is still referring to the old classnames.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants