Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-value support for KnnVectorField #12314

Draft
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented May 19, 2023

Description

This pull request aims to introduce support for multiple values in a single Knn vector field.
The adopted solution relies on:
Index time
Sparse vector values approach where an Ordinal(vectorId) to DocId map is used to keep the relation between a DocId and all its vectors.
In the current sparse vector approach, we have just one vectorId per docID
In this proposed contribution, multiple vectorIds are mapped to the same docID
Query time
A multi-valued strategy choice is offered to the user:
MAX/SUM
In exact nearest neighbor, for each document accepted by the query/filter :
MAX = the similarity score between the query and each vector is computed, the max score is chosen for the search result
SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score

In aproximate nearest neighbor, for each document accepted by the query/filter :
MAX = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated keeping the maximum between what it was there and the new score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score

N.B. This Pull Request is not meant to be ready to be merged at this stage.
I can identify at least this set of activities before this draft can move to a 'production ready' version:

  1. validate the overall idea and approach
  2. validate index time usage of sparse vector values for the multi-valued use case
  3. validate merge policy for the multi-valued use case
  4. validate query time MAX/SUM approach
  5. validate query time modifiable heap and neighborQueue usage
  6. validate regressions
  7. introduce more tests

It's a big contribution and It will take time and effort to be completed.
It's the result of many months of work/fun (on and off), I am happy if something good comes from it, not necessarily the adoption of the whole idea, even if it's just some bit and pieces that ends up being beneficial, it's cool.
Any help is welcome.

alessandrobenedetti and others added 30 commits May 18, 2022 23:16
# Conflicts:
#	lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene60/Lucene60FieldInfosFormat.java
#	lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90FieldInfosFormat.java
#	lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextFieldInfosFormat.java
#	lucene/codecs/src/test/org/apache/lucene/codecs/uniformsplit/TestBlockWriter.java
#	lucene/codecs/src/test/org/apache/lucene/codecs/uniformsplit/sharedterms/TestSTBlockReader.java
#	lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java
#	lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java
#	lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java
#	lucene/core/src/java/org/apache/lucene/document/FieldType.java
#	lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java
#	lucene/core/src/java/org/apache/lucene/index/FieldInfo.java
#	lucene/core/src/java/org/apache/lucene/index/FieldInfos.java
#	lucene/core/src/java/org/apache/lucene/index/IndexableFieldType.java
#	lucene/core/src/java/org/apache/lucene/index/IndexingChain.java
#	lucene/core/src/java/org/apache/lucene/index/RandomAccessVectorValues.java
#	lucene/core/src/java/org/apache/lucene/index/ReadersAndUpdates.java
#	lucene/core/src/java/org/apache/lucene/index/VectorValues.java
#	lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java
#	lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
#	lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
#	lucene/core/src/test/org/apache/lucene/index/TestCodecs.java
#	lucene/core/src/test/org/apache/lucene/index/TestFieldInfos.java
#	lucene/core/src/test/org/apache/lucene/index/TestFieldsReader.java
#	lucene/core/src/test/org/apache/lucene/index/TestIndexableField.java
#	lucene/core/src/test/org/apache/lucene/index/TestPendingSoftDeletes.java
#	lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
#	lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
#	lucene/core/src/test/org/apache/lucene/util/hnsw/MockVectorValues.java
#	lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java
#	lucene/highlighter/src/java/org/apache/lucene/search/highlight/TermVectorLeafReader.java
#	lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java
#	lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseFieldInfoFormatTestCase.java
#	lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseIndexFileFormatTestCase.java
#	lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
#	lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomPostingsTester.java
…alued

# Conflicts:
#	lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java
#	lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java
#	lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/OffHeapFloatVectorValues.java
#	lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene94/Lucene94HnswVectorsWriter.java
#	lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
#	lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java
#	lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java
#	lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java
#	lucene/core/src/java/org/apache/lucene/index/BufferingKnnVectorsWriter.java
#	lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
#	lucene/core/src/java/org/apache/lucene/index/CodecReader.java
#	lucene/core/src/java/org/apache/lucene/index/VectorValues.java
#	lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java
#	lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
#	lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java
#	lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
#	lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java
#	lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
#	lucene/test-framework/src/java/org/apache/lucene/tests/codecs/asserting/AssertingKnnVectorsFormat.java
@alessandrobenedetti
Copy link
Contributor Author

I have been working a bit on the removal of the need for "vectorMultiValued" as a special fieldInfo and index time information.
I think the results are encouraging, the current tests are relatively stable and diff went down to 'only' 27 files (we started with 85!).
I'll keep cleaning it up and refine it, the necessity of additional pair of eyes for the review is still there as I am not convinced some variables are in the best place (multiValued in the vector values and graph) but it's progressing.

I may have also brought regressions with this simplifications, but everything is in the git history so it should be doable to fix them.

…alued

# Conflicts:
#	lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented May 25, 2023

@alessandrobenedetti Thanks for starting this effort. We also found a need for multi-value vector fields, and here is the list of questions we thought about for its usage. Would be nice to get yours and other's ideas on them:

  1. Is the main usage for breaking a long text into several paragraphs? Or is it also to search across several different fields (e.g. [embedding_of_title, embedding_of_content]) where we create a single graph for several fields?
  2. Would it be possible to retrieve which vector was the closest match? For example, if we break a long text into paragraphs and want to highlight which paragraph was the closest match. This could be crucial for some use cases.
  3. An extension to question 2, can we support some metadata attached to vector values?

@alessandrobenedetti
Copy link
Contributor Author

Would be nice to get yours and other's ideas on them:

  1. Is the main usage for breaking a long text into several paragraphs? Or is it also to search across several different fields (e.g. [embedding_of_title, embedding_of_content]) where we create a single graph for several fields?
    I would say the first, I have seen examples of that, especially when the large language model chosen by the customer has a limit in input tokens smaller than the document length for the customer.
    A single graph over several fields would be the equivalent of a catch-all field for the lexical search, it was not my primary focus but should be doable once we have multi-valued fields.
  2. Would it be possible to retrieve which vector was the closest match? For example, if we break a long text into paragraphs and want to highlight which paragraph was the closest match. This could be crucial for some use cases.
    I agree, that can be a nice addition to the explain!

I guess attaching metadata to vectors is a different story, but I agree it could be a good idea!

@jimczi
Copy link
Contributor

jimczi commented May 25, 2023

Thanks for sharing and working on a prototype @alessandrobenedetti !

I have additional questions and comments ;)
Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes.

Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader:

  public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException;
  public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException;
  public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;

The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field.
I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently.
We can do that with the explicit approach too. searchMulti can use search under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally.

Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search.

Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus.

@@ -22,6 +22,8 @@
import org.apache.lucene.util.FixedBitSet;
import org.apache.lucene.util.RamUsageEstimator;

import java.util.Stack;
Copy link
Contributor

@mayya-sharipova mayya-sharipova May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep this class unchanged; it is used in other places besides vectors and is only about documents.

For keeping track of the number of values in docs, may be we can use another data-structure (PackedLongValues.Builder, similar how SortedSetDocValuesWriter does it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, for the time being I reverted the class and extended it. Feel free to have a look and apply/propose any change you see fit.

private int currentDocVectorsCount;
private int vectorsCount;

private boolean multiValued = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also strange to have this multiValued field, we probably should just multi values by presence/absence of another data structure that keeps number of values in docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about this, it's a simple change though if necessary

DocIdSetIterator iterator = docsWithField.iterator();
int[] valuesPerDocument = docsWithField.getValuesPerDocument();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to use a lot of memory, may be for inspiration, we can use how SortedSetDocValuesWriter class writes multiple values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check later on, happy to use a different implementation

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented May 26, 2023

I proceeded with some additional refactoring and refinements that find in the latest commits.
The diff is down to 25 classes, query time has been simplified (only MAX supported right now, it will be easy to bring back the strategy mechanism in the future though), and explicit multi-valued has been moved to transparent multi-valued.
Tests are more stable, but no all green yet.
I am close to exhausting the funds (fully self-funded at the moment) my company has allocated to the project, but I'll be happy to continue with occasional discussions and reviews.
If we get any external funds to continue the project, I'll let you know here.
I'll follow up to a response to @jimczi in a next comment.

@alessandrobenedetti
Copy link
Contributor Author

Thanks for sharing and working on a prototype @alessandrobenedetti !

I have additional questions and comments ;) Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes.

Ale: I guess the use case for multi valued vector-fields is not much different from any other multi valued fields:
you may want to avoid normalisation and the necessity to build additional complexity with multiple duplicated documents that only have the vector field as a difference.
If you have simple documents with just the vector field, splitting them in passages and then aggregating them in the end of your search pipeline is not going to be that annoying, but imaging more complex situations where you have aggregations already, nested documents or just documents with many more metadata along.
On top of that we got curiosity from some customers about the functionality, for this reason I felt it was a nice addition.

Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader:

  public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException;
  public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException;
  public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;

The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field. I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently. We can do that with the explicit approach too. searchMulti can use search under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally.

That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all.
Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario).
I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess.

Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search.

Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that.

Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus.

Thanks for the feedback!

@jimczi
Copy link
Contributor

jimczi commented May 26, 2023

That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all.
Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario).
I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess.

My main worry is the change to FloatVectorValue, moving to a multivalued iterator changes the access pattern so I don't find it right to change the interface and the meaning of the ordinals that are returned based on multivalued or not.
If only search was exposed in the format that would be ok I think but we're exposing direct access to the document's vector so the parallel with doc values is important imo.

Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that.

Well I don't have a good idea to change that if we expect to have an ordinal per value. Switching the ordinal in the hnsw to int64 is not really practical. Adding a sub-id to each ordinal seems also quite invasive. Again not saying that it should block the change but we need to be clear on the downside of this approach and also ensures that we have the right guards for overflows.

@alessandrobenedetti
Copy link
Contributor Author

My main worry is the change to FloatVectorValue, moving to a multivalued iterator changes the access pattern so I don't find it right to change the interface and the meaning of the ordinals that are returned based on multivalued or not.
If only search was exposed in the format that would be ok I think but we're exposing direct access to the document's vector so the parallel with doc values is important

Hi @jimczi, nothing in this PR is final nor I have any strong opinion about it.
My main intention is to keep the PR as small and as valuable as possible, to build a common ground (and tests) to build the functionality (if nice to have, if not, it was a cool exercise and that's equally fine).

In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them!

@jimczi
Copy link
Contributor

jimczi commented May 26, 2023

nothing in this PR is final nor I have any strong opinion about it.

Sure, we're just discussing the approach, no worries.

In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them!

This change in [Float|Byte]VectorValues:

 /** The maximum length of a vector */
  public static final int MAX_DIMENSIONS = 1024;

  protected boolean multiValued = false;

  public boolean isMultiValued() {
    return multiValued;
  }

  /** Sole constructor */
  protected FloatVectorValues() {}

@@ -57,4 +63,13 @@ public final long cost() {
   * @return the vector value
   */
  public abstract float[] vectorValue() throws IOException;

  /**
   * Return the document ID corresponding to the input vector id(ordinal)
   * @param ord vector ID(ordinal)
   * @return the document ID
   */
  public int ordToDoc(int ord){
    return ord;
  }
}

Today the expectation is that it iterates over doc ids. This change adds an indirection that needs to be checked (isMultivalued) and if it's the case then the doc id is an ordinal id that needs to be transformed with ordToDoc .
That's trappy, I am not even sure how you can advance to a doc rather than an ordinal and the code would mean different things based on whether you're working on multivalued or not. Considering how the original APIs was made for single valued I don't think we should try to sneak multi-valued into the model. That's why I propose that we add it as separated like you originally did. It's a doc values + search APIs so it needs to be usable for these two purposes in a more predictive way.

@joshdevins
Copy link

joshdevins commented May 30, 2023

SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score

Just a note on the aggregation functions max and sum. Most commonly it seems that max is used as it is length independent. When using sum, the longer the original text of a document field, and thus the more passages it will have, the higher the sum of all matching passages will be since all passages will "match", thus biasing scoring towards documents with longer text. I'm not sure if it will matter in the end, but my suggestion would be that if sum is used, one could optionally use a radius/similarity threshold to limit the advantage of longer texts, and/or allow using just a limited top-k passages of a document for sum.

@alessandrobenedetti Do you have any good references/papers on approaches to re-aggregating passages into documents for SERPs? It seems that the art was abandoned a couple years ago with most approaches settling on max passage (which I see is the only method implemented for now).

@alessandrobenedetti
Copy link
Contributor Author

SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score

Just a note on the aggregation functions max and sum. Most commonly it seems that max is used as it is length independent. When using sum, the longer the original text of a document field, and thus the more passages it will have, the higher the sum of all matching passages will be since all passages will "match", thus biasing scoring towards documents with longer text. I'm not sure if it will matter in the end, but my suggestion would be that if sum is used, one could optionally use a radius/similarity threshold to limit the advantage of longer texts, and/or allow using just a limited top-k passages of a document for sum.

@alessandrobenedetti Do you have any good references/papers on approaches to re-aggregating passages into documents for SERPs? It seems that the art was abandoned a couple years ago with most approaches settling on max passage (which I see is the only method implemented for now).

Hi @joshdevins ,
the dual strategy(MAX/SUM) is implemented in an old commit, I removed it to build an initial smaller and cleaner pull request.
Some of the feedback was to introduce strategies later on, and that's fine, I agree with that.

I didn't have the time to dive deeper into the aggregation strategies so I don't have references yet.
My main focus was to reach a working prototype and then iterate on the various components to make them better/deal.

Your observation regarding 'SUM' is correct.
In my 'naive' first implementation I used an approach where only the closer vectors you encounter are considered in the SUM when running an approximate search.
You can take a look at the commits before the simplification if you are curious, but I believe it would be better to address this discussion when we introduce 'Strategies' again in a separate future PR.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more on this implementation. It seems like we will need at a minimum a new NeighborQueue

I am not sure the existing one needs to be updated, but we instead should have a MultiValueNeighborQueue.

The reason for this is that not only does this queue contain information about result sets, it keeps track of how many nodes are visited and the TopHits returned utilize that number. Consequently, the visited count should keep track of documents visited, not vectors visited. All these changes indicates a new queueing mechanism for multi-valued vector fields.

Another thought is that Lucene already has the concept of index join values. Effectively creating child document IDs under a single parent. This allows for even greater flexibility by indexing the passage the vector represents, and potentially even greater scoring flexibility.

The issue I could see happening here is ensuring the topdocs searching has the ability to deduplicate (if desired) based on parent document ID.

Did you consider utilizing this when digging into this implementation?

@alessandrobenedetti
Copy link
Contributor Author

Thinking more on this implementation. It seems like we will need at a minimum a new NeighborQueue

I am not sure the existing one needs to be updated, but we instead should have a MultiValueNeighborQueue.

The reason for this is that not only does this queue contain information about result sets, it keeps track of how many nodes are visited and the TopHits returned utilize that number. Consequently, the visited count should keep track of documents visited, not vectors visited. All these changes indicates a new queueing mechanism for multi-valued vector fields.

Another thought is that Lucene already has the concept of index join values. Effectively creating child document IDs under a single parent. This allows for even greater flexibility by indexing the passage the vector represents, and potentially even greater scoring flexibility.

The issue I could see happening here is ensuring the topdocs searching has the ability to deduplicate (if desired) based on parent document ID.

Did you consider utilizing this when digging into this implementation?

I think it's a good idea to create a new dedicated MultiValued NeighborQueue, I'll do it when I have time but feel free to do it if you like!

In regards to index time join, I am not sure it's relevant here (are you talking about block join?):
isn't it a different concept from multivalued?
i.e. we have the mechanism in Lucene along multi-valued vectors for pretty much all the field types, haven't we?

Comment on lines +99 to +112
@Override
public Map<Integer, Float> scoreMultiValued(BitSet acceptedDocs) throws IOException {
Map<Integer, Float> docToScore = new HashMap<>();
for (int vectorId = values.nextDoc(); vectorId != NO_MORE_DOCS; vectorId = values.nextDoc()) {
int docID = values.ordToDoc(vectorId);
if (acceptedDocs.get(docID)) {
float currentScore = similarity.compare(query, values.vectorValue());
docToScore.putIfAbsent(docID, currentScore);
docToScore.computeIfPresent(docID,
(key, previousScore) -> Math.max(previousScore, currentScore));
}
}
return docToScore;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how we can do this here. Instead of randomly iterating the BitSet and going to the exact document, we are iterating every vector in the scorer and checking its membership in acceptedDocs.

Even multi-valued *VectorValues classes need to be able to iterate to the exact document. Otherwise searching with a filter like this is untenable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, didn't spend much time on this, I drafted a few lines of code just to make the exact search tests green.
Happy to get suggestions/other people implementations here!

Comment on lines +839 to +849
private static DocsWithVectorsSet writeByteVectorData(
IndexOutput output, ByteVectorValues mergedVectorValues) throws IOException {
DocsWithVectorsSet docsWithVectors = new DocsWithVectorsSet();
for (int vectorId = mergedVectorValues.nextDoc(); vectorId != NO_MORE_DOCS; vectorId = mergedVectorValues.nextDoc()) {
int docID = mergedVectorValues.ordToDoc(vectorId);
byte[] binaryValue = mergedVectorValues.vectorValue();
assert binaryValue.length == mergedVectorValues.dimension() * VectorEncoding.BYTE.byteSize;
output.writeBytes(binaryValue, binaryValue.length);
docsWithField.add(docV);
docsWithVectors.add(docID);
}
return docsWithField;
return docsWithVectors;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, this is simply writing each vector value in order, regardless if they are part of the same document or not correct?

I suppose we get the document order for free since they all have to be supplied at the same time correct?

The reason I say this, is that if we want to be able to iterate by document, being able to skip directly to the document and read its vectors is important, which wouldn't be easily possible unless all the vectors in a document were written right next to each other.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I wanted and hopefully implemented we get the document order for free since all vectors have to be supplied at the same time.
As everything in this Pull Request, it's open to discussion as my focus was to bring a starting milestone working end to end and then refine each step.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about outdated Stack class. But my major concern is integer ordinals. Like for SortedSetDocvalues we need long vector ordinals.

import org.apache.lucene.util.FixedBitSet;
import org.apache.lucene.util.RamUsageEstimator;

import java.util.Stack;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use the outdated Stack class, which is also synchronized. Modern replacement is ArrayDeque implementing Deque interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #12404

* @param ord vector ID(ordinal)
* @return the document ID
*/
public int ordToDoc(int ord){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the new implementation is that ordinal numbers need to get longs (like in SortedSetDocValues), because as each document of a 2B docs index sement may have more than one value, so the total number of ordinals may go beyond 2B.

Of course this needs more work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-value Support for KnnVectorField
8 participants