Multi-value support for KnnVectorField #12314

alessandrobenedetti · 2023-05-19T15:46:46Z

Description

This pull request aims to introduce support for multiple values in a single Knn vector field.
The adopted solution relies on:
Index time
Sparse vector values approach where an Ordinal(vectorId) to DocId map is used to keep the relation between a DocId and all its vectors.
In the current sparse vector approach, we have just one vectorId per docID
In this proposed contribution, multiple vectorIds are mapped to the same docID
Query time
A multi-valued strategy choice is offered to the user:
MAX/SUM
In exact nearest neighbor, for each document accepted by the query/filter :
MAX = the similarity score between the query and each vector is computed, the max score is chosen for the search result
SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score

In aproximate nearest neighbor, for each document accepted by the query/filter :
MAX = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated keeping the maximum between what it was there and the new score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score

N.B. This Pull Request is not meant to be ready to be merged at this stage.
I can identify at least this set of activities before this draft can move to a 'production ready' version:

validate the overall idea and approach
validate index time usage of sparse vector values for the multi-valued use case
validate merge policy for the multi-valued use case
validate query time MAX/SUM approach
validate query time modifiable heap and neighborQueue usage
validate regressions
introduce more tests

It's a big contribution and It will take time and effort to be completed.
It's the result of many months of work/fun (on and off), I am happy if something good comes from it, not necessarily the adoption of the whole idea, even if it's just some bit and pieces that ends up being beneficial, it's cool.
Any help is welcome.

…riting

…alued

# Conflicts: # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene60/Lucene60FieldInfosFormat.java # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90FieldInfosFormat.java # lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextFieldInfosFormat.java # lucene/codecs/src/test/org/apache/lucene/codecs/uniformsplit/TestBlockWriter.java # lucene/codecs/src/test/org/apache/lucene/codecs/uniformsplit/sharedterms/TestSTBlockReader.java # lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java # lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java # lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java # lucene/core/src/java/org/apache/lucene/document/FieldType.java # lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java # lucene/core/src/java/org/apache/lucene/index/FieldInfo.java # lucene/core/src/java/org/apache/lucene/index/FieldInfos.java # lucene/core/src/java/org/apache/lucene/index/IndexableFieldType.java # lucene/core/src/java/org/apache/lucene/index/IndexingChain.java # lucene/core/src/java/org/apache/lucene/index/RandomAccessVectorValues.java # lucene/core/src/java/org/apache/lucene/index/ReadersAndUpdates.java # lucene/core/src/java/org/apache/lucene/index/VectorValues.java # lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java # lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java # lucene/core/src/test/org/apache/lucene/index/TestCodecs.java # lucene/core/src/test/org/apache/lucene/index/TestFieldInfos.java # lucene/core/src/test/org/apache/lucene/index/TestFieldsReader.java # lucene/core/src/test/org/apache/lucene/index/TestIndexableField.java # lucene/core/src/test/org/apache/lucene/index/TestPendingSoftDeletes.java # lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java # lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java # lucene/core/src/test/org/apache/lucene/util/hnsw/MockVectorValues.java # lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java # lucene/highlighter/src/java/org/apache/lucene/search/highlight/TermVectorLeafReader.java # lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseFieldInfoFormatTestCase.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseIndexFileFormatTestCase.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomPostingsTester.java

…alued # Conflicts: # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/OffHeapFloatVectorValues.java # lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene94/Lucene94HnswVectorsWriter.java # lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java # lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java # lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java # lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java # lucene/core/src/java/org/apache/lucene/index/BufferingKnnVectorsWriter.java # lucene/core/src/java/org/apache/lucene/index/CheckIndex.java # lucene/core/src/java/org/apache/lucene/index/CodecReader.java # lucene/core/src/java/org/apache/lucene/index/VectorValues.java # lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java # lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java # lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java # lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java # lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java # lucene/test-framework/src/java/org/apache/lucene/tests/codecs/asserting/AssertingKnnVectorsFormat.java

alessandrobenedetti · 2023-05-24T15:49:20Z

I have been working a bit on the removal of the need for "vectorMultiValued" as a special fieldInfo and index time information.
I think the results are encouraging, the current tests are relatively stable and diff went down to 'only' 27 files (we started with 85!).
I'll keep cleaning it up and refine it, the necessity of additional pair of eyes for the review is still there as I am not convinced some variables are in the best place (multiValued in the vector values and graph) but it's progressing.

I may have also brought regressions with this simplifications, but everything is in the git history so it should be doable to fix them.

…alued # Conflicts: # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

mayya-sharipova · 2023-05-25T14:48:17Z

@alessandrobenedetti Thanks for starting this effort. We also found a need for multi-value vector fields, and here is the list of questions we thought about for its usage. Would be nice to get yours and other's ideas on them:

Is the main usage for breaking a long text into several paragraphs? Or is it also to search across several different fields (e.g. [embedding_of_title, embedding_of_content]) where we create a single graph for several fields?
Would it be possible to retrieve which vector was the closest match? For example, if we break a long text into paragraphs and want to highlight which paragraph was the closest match. This could be crucial for some use cases.
An extension to question 2, can we support some metadata attached to vector values?

alessandrobenedetti · 2023-05-25T15:07:08Z

Would be nice to get yours and other's ideas on them:

Is the main usage for breaking a long text into several paragraphs? Or is it also to search across several different fields (e.g. [embedding_of_title, embedding_of_content]) where we create a single graph for several fields?
I would say the first, I have seen examples of that, especially when the large language model chosen by the customer has a limit in input tokens smaller than the document length for the customer.
A single graph over several fields would be the equivalent of a catch-all field for the lexical search, it was not my primary focus but should be doable once we have multi-valued fields.

Would it be possible to retrieve which vector was the closest match? For example, if we break a long text into paragraphs and want to highlight which paragraph was the closest match. This could be crucial for some use cases.
I agree, that can be a nice addition to the explain!

I guess attaching metadata to vectors is a different story, but I agree it could be a good idea!

jimczi · 2023-05-25T19:43:28Z

Thanks for sharing and working on a prototype @alessandrobenedetti !

I have additional questions and comments ;)
Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes.

Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader:

  public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException;
  public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException;
  public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;

The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field.
I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently.
We can do that with the explicit approach too. searchMulti can use search under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally.

Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search.

Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus.

mayya-sharipova · 2023-05-25T20:55:19Z

lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java

@@ -22,6 +22,8 @@
 import org.apache.lucene.util.FixedBitSet;
 import org.apache.lucene.util.RamUsageEstimator;

+import java.util.Stack;


I think we should keep this class unchanged; it is used in other places besides vectors and is only about documents.

For keeping track of the number of values in docs, may be we can use another data-structure (PackedLongValues.Builder, similar how SortedSetDocValuesWriter does it)?

I agree, for the time being I reverted the class and extended it. Feel free to have a look and apply/propose any change you see fit.

mayya-sharipova · 2023-05-25T21:06:29Z

lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java

+  private int currentDocVectorsCount;
+  private int vectorsCount;
+
+  private boolean multiValued = false;


It is also strange to have this multiValued field, we probably should just multi values by presence/absence of another data structure that keeps number of values in docs.

I don't have a strong opinion about this, it's a simple change though if necessary

mayya-sharipova · 2023-05-25T21:07:59Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

      DocIdSetIterator iterator = docsWithField.iterator();
+      int[] valuesPerDocument = docsWithField.getValuesPerDocument();


This seems to use a lot of memory, may be for inspiration, we can use how SortedSetDocValuesWriter class writes multiple values.

I'll check later on, happy to use a different implementation

alessandrobenedetti · 2023-05-26T10:09:25Z

I proceeded with some additional refactoring and refinements that find in the latest commits.
The diff is down to 25 classes, query time has been simplified (only MAX supported right now, it will be easy to bring back the strategy mechanism in the future though), and explicit multi-valued has been moved to transparent multi-valued.
Tests are more stable, but no all green yet.
I am close to exhausting the funds (fully self-funded at the moment) my company has allocated to the project, but I'll be happy to continue with occasional discussions and reviews.
If we get any external funds to continue the project, I'll let you know here.
I'll follow up to a response to @jimczi in a next comment.

alessandrobenedetti · 2023-05-26T10:18:37Z

Thanks for sharing and working on a prototype @alessandrobenedetti !

I have additional questions and comments ;) Starting with the devil advocate but I'd like to understand once more what is the use case? One possible approach today if the use case is passage retrieval is to index one document per passage/sentence. It works fine and you have access to the exact context of the match directly. What are the benefits of moving all passages to a single document? I am not saying there are none but they must be compelling if that requires all these changes.

Ale: I guess the use case for multi valued vector-fields is not much different from any other multi valued fields:
you may want to avoid normalisation and the necessity to build additional complexity with multiple duplicated documents that only have the vector field as a difference.
If you have simple documents with just the vector field, splitting them in passages and then aggregating them in the end of your search pipeline is not going to be that annoying, but imaging more complex situations where you have aggregations already, nested documents or just documents with many more metadata along.
On top of that we got curiosity from some customers about the functionality, for this reason I felt it was a nice addition.

Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader:
  public abstract MultiFloatVectorValues getMultiFloatVectorValues(String field) throws IOException;
  public abstract MultiByteVectorValues getMultiByteVectorValues(String field) throws IOException;
  public abstract TopDocs searchMulti(String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException;
The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field. I know that it's easy to detect if a field is single or multi-valued under the hood so we could handle this transparently. We can do that with the explicit approach too. searchMulti can use search under the hood if we detect that the field contains exactly one vector per document. So my point is that we can try to share code as much as possible internally but we don't need to hide the difference externally.

That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all.
Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario).
I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess.

Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search.

Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that.

Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus.

Thanks for the feedback!

jimczi · 2023-05-26T10:54:00Z

That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all.
Following some of the feedback here I moved to transparent approach, that prooved extremely beneficial in reducing the complexity of the pull request (at the cost of impacting potentially some single valued case scenario).
I iterated a bit, so I should have reduced the impact already on the single valued case, but not to 100% I guess.

My main worry is the change to FloatVectorValue, moving to a multivalued iterator changes the access pattern so I don't find it right to change the interface and the meaning of the ordinals that are returned based on multivalued or not.
If only search was exposed in the format that would be ok I think but we're exposing direct access to the document's vector so the parallel with doc values is important imo.

Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that.

Well I don't have a good idea to change that if we expect to have an ordinal per value. Switching the ordinal in the hnsw to int64 is not really practical. Adding a sub-id to each ordinal seems also quite invasive. Again not saying that it should block the change but we need to be clear on the downside of this approach and also ensures that we have the right guards for overflows.

alessandrobenedetti · 2023-05-26T15:37:54Z

My main worry is the change to FloatVectorValue, moving to a multivalued iterator changes the access pattern so I don't find it right to change the interface and the meaning of the ordinals that are returned based on multivalued or not.
If only search was exposed in the format that would be ok I think but we're exposing direct access to the document's vector so the parallel with doc values is important

Hi @jimczi, nothing in this PR is final nor I have any strong opinion about it.
My main intention is to keep the PR as small and as valuable as possible, to build a common ground (and tests) to build the functionality (if nice to have, if not, it was a cool exercise and that's equally fine).

In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them!

jimczi · 2023-05-26T16:54:10Z

nothing in this PR is final nor I have any strong opinion about it.

Sure, we're just discussing the approach, no worries.

In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them!

This change in [Float|Byte]VectorValues:

 /** The maximum length of a vector */
  public static final int MAX_DIMENSIONS = 1024;

  protected boolean multiValued = false;

  public boolean isMultiValued() {
    return multiValued;
  }

  /** Sole constructor */
  protected FloatVectorValues() {}

@@ -57,4 +63,13 @@ public final long cost() {
   * @return the vector value
   */
  public abstract float[] vectorValue() throws IOException;

  /**
   * Return the document ID corresponding to the input vector id(ordinal)
   * @param ord vector ID(ordinal)
   * @return the document ID
   */
  public int ordToDoc(int ord){
    return ord;
  }
}

Today the expectation is that it iterates over doc ids. This change adds an indirection that needs to be checked (isMultivalued) and if it's the case then the doc id is an ordinal id that needs to be transformed with ordToDoc .
That's trappy, I am not even sure how you can advance to a doc rather than an ordinal and the code would mean different things based on whether you're working on multivalued or not. Considering how the original APIs was made for single valued I don't think we should try to sneak multi-valued into the model. That's why I propose that we add it as separated like you originally did. It's a doc values + search APIs so it needs to be usable for these two purposes in a more predictive way.

joshdevins · 2023-05-30T10:26:38Z

SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score

Just a note on the aggregation functions max and sum. Most commonly it seems that max is used as it is length independent. When using sum, the longer the original text of a document field, and thus the more passages it will have, the higher the sum of all matching passages will be since all passages will "match", thus biasing scoring towards documents with longer text. I'm not sure if it will matter in the end, but my suggestion would be that if sum is used, one could optionally use a radius/similarity threshold to limit the advantage of longer texts, and/or allow using just a limited top-k passages of a document for sum.

@alessandrobenedetti Do you have any good references/papers on approaches to re-aggregating passages into documents for SERPs? It seems that the art was abandoned a couple years ago with most approaches settling on max passage (which I see is the only method implemented for now).

alessandrobenedetti · 2023-05-30T13:02:31Z

SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score

Just a note on the aggregation functions max and sum. Most commonly it seems that max is used as it is length independent. When using sum, the longer the original text of a document field, and thus the more passages it will have, the higher the sum of all matching passages will be since all passages will "match", thus biasing scoring towards documents with longer text. I'm not sure if it will matter in the end, but my suggestion would be that if sum is used, one could optionally use a radius/similarity threshold to limit the advantage of longer texts, and/or allow using just a limited top-k passages of a document for sum.

@alessandrobenedetti Do you have any good references/papers on approaches to re-aggregating passages into documents for SERPs? It seems that the art was abandoned a couple years ago with most approaches settling on max passage (which I see is the only method implemented for now).

Hi @joshdevins ,
the dual strategy(MAX/SUM) is implemented in an old commit, I removed it to build an initial smaller and cleaner pull request.
Some of the feedback was to introduce strategies later on, and that's fine, I agree with that.

I didn't have the time to dive deeper into the aggregation strategies so I don't have references yet.
My main focus was to reach a working prototype and then iterate on the various components to make them better/deal.

Your observation regarding 'SUM' is correct.
In my 'naive' first implementation I used an approach where only the closer vectors you encounter are considered in the SUM when running an approximate search.
You can take a look at the commits before the simplification if you are curious, but I believe it would be better to address this discussion when we introduce 'Strategies' again in a separate future PR.

benwtrent

Thinking more on this implementation. It seems like we will need at a minimum a new NeighborQueue

I am not sure the existing one needs to be updated, but we instead should have a MultiValueNeighborQueue.

The reason for this is that not only does this queue contain information about result sets, it keeps track of how many nodes are visited and the TopHits returned utilize that number. Consequently, the visited count should keep track of documents visited, not vectors visited. All these changes indicates a new queueing mechanism for multi-valued vector fields.

Another thought is that Lucene already has the concept of index join values. Effectively creating child document IDs under a single parent. This allows for even greater flexibility by indexing the passage the vector represents, and potentially even greater scoring flexibility.

The issue I could see happening here is ensuring the topdocs searching has the ability to deduplicate (if desired) based on parent document ID.

Did you consider utilizing this when digging into this implementation?

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

alessandrobenedetti · 2023-06-22T06:41:52Z

Thinking more on this implementation. It seems like we will need at a minimum a new NeighborQueue

I am not sure the existing one needs to be updated, but we instead should have a MultiValueNeighborQueue.

The reason for this is that not only does this queue contain information about result sets, it keeps track of how many nodes are visited and the TopHits returned utilize that number. Consequently, the visited count should keep track of documents visited, not vectors visited. All these changes indicates a new queueing mechanism for multi-valued vector fields.

Another thought is that Lucene already has the concept of index join values. Effectively creating child document IDs under a single parent. This allows for even greater flexibility by indexing the passage the vector represents, and potentially even greater scoring flexibility.

The issue I could see happening here is ensuring the topdocs searching has the ability to deduplicate (if desired) based on parent document ID.

Did you consider utilizing this when digging into this implementation?

I think it's a good idea to create a new dedicated MultiValued NeighborQueue, I'll do it when I have time but feel free to do it if you like!

In regards to index time join, I am not sure it's relevant here (are you talking about block join?):
isn't it a different concept from multivalued?
i.e. we have the mechanism in Lucene along multi-valued vectors for pretty much all the field types, haven't we?

benwtrent · 2023-06-27T11:54:06Z

lucene/core/src/java/org/apache/lucene/search/VectorScorer.java

+    @Override
+    public Map<Integer, Float> scoreMultiValued(BitSet acceptedDocs) throws IOException {
+      Map<Integer, Float> docToScore = new HashMap<>();
+      for (int vectorId = values.nextDoc(); vectorId != NO_MORE_DOCS; vectorId = values.nextDoc()) {
+        int docID = values.ordToDoc(vectorId);
+        if (acceptedDocs.get(docID)) {
+          float currentScore = similarity.compare(query, values.vectorValue());
+          docToScore.putIfAbsent(docID, currentScore);
+          docToScore.computeIfPresent(docID,
+                  (key, previousScore) -> Math.max(previousScore, currentScore));
+        }
+      }
+      return docToScore;
+    }


I don't see how we can do this here. Instead of randomly iterating the BitSet and going to the exact document, we are iterating every vector in the scorer and checking its membership in acceptedDocs.

Even multi-valued *VectorValues classes need to be able to iterate to the exact document. Otherwise searching with a filter like this is untenable.

Yep, didn't spend much time on this, I drafted a few lines of code just to make the exact search tests green.
Happy to get suggestions/other people implementations here!

benwtrent · 2023-06-27T12:10:26Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

+  private static DocsWithVectorsSet writeByteVectorData(
+      IndexOutput output, ByteVectorValues mergedVectorValues) throws IOException {
+    DocsWithVectorsSet docsWithVectors = new DocsWithVectorsSet();
+    for (int vectorId = mergedVectorValues.nextDoc(); vectorId != NO_MORE_DOCS; vectorId = mergedVectorValues.nextDoc()) {
+      int docID = mergedVectorValues.ordToDoc(vectorId);
+      byte[] binaryValue = mergedVectorValues.vectorValue();
+      assert binaryValue.length == mergedVectorValues.dimension() * VectorEncoding.BYTE.byteSize;
      output.writeBytes(binaryValue, binaryValue.length);
-      docsWithField.add(docV);
+      docsWithVectors.add(docID);
    }
-    return docsWithField;
+    return docsWithVectors;


From what I can tell, this is simply writing each vector value in order, regardless if they are part of the same document or not correct?

I suppose we get the document order for free since they all have to be supplied at the same time correct?

The reason I say this, is that if we want to be able to iterate by document, being able to skip directly to the document and read its vectors is important, which wouldn't be easily possible unless all the vectors in a document were written right next to each other.

As far as I wanted and hopefully implemented we get the document order for free since all vectors have to be supplied at the same time.
As everything in this Pull Request, it's open to discussion as my focus was to bring a starting milestone working end to end and then refine each step.

uschindler

I left a comment about outdated Stack class. But my major concern is integer ordinals. Like for SortedSetDocvalues we need long vector ordinals.

uschindler · 2023-06-29T06:39:33Z

lucene/core/src/java/org/apache/lucene/index/DocsWithVectorsSet.java

+import org.apache.lucene.util.FixedBitSet;
+import org.apache.lucene.util.RamUsageEstimator;
+
+import java.util.Stack;


Please don't use the outdated Stack class, which is also synchronized. Modern replacement is ArrayDeque implementing Deque interface.

I opened #12404

uschindler · 2023-06-29T06:43:52Z

lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java

+   * @param ord vector ID(ordinal)
+   * @return the document ID
+   */
+  public int ordToDoc(int ord){


The problem with the new implementation is that ordinal numbers need to get longs (like in SortedSetDocValues), because as each document of a 2B docs index sement may have more than one value, so the total number of ordinals may go beyond 2B.

Of course this needs more work.

alessandrobenedetti and others added 30 commits May 18, 2022 23:16

multi valued HNSW draft

92aa1dc

multi valued HNSW draft - index time

06367c0

multi valued HNSW draft - introducing params - index time

b6eec73

multi valued HNSW draft - introducing params buffered vector values w…

7a9b160

…riting

multi valued HNSW draft - debugging tests

1797a55

update to 93 codec

3344876

9.3 update

7f58edf

Merge remote-tracking branch 'upstream/main' into feature/hnsw-multiv…

b05de03

…alued

fixes post merge

6443d1a

fixes post merge

c31ea14

fixes post merge

73823ec

fixes post merge

d519fca

big milestone, clean-up and testing still necessary

bbae929

minor cleanup

2973bef

second cleanup

b935371

third cleanup to fix tests

e1d6169

fourth cleanup to fix tests

1af7ce0

fifth cleanup to fix tests

14f828a

minor test refactor

1e6bb16

test fix for merge

590a89b

base tests passed

1de02d3

base tests passed + simple multiValued test pass

36903ff

additional tests pass

12d0fba

neighbour queue fixes

74d5eb1

empty stack fix

1ea8837

empty stack fix

512b5d6

merge in progress

07336bf

merge in progress - compilation fixes

e9155d7

alessandrobenedetti added 3 commits May 23, 2023 11:13

build

e360150

tentative removal of index time multi valued config in the field info

ba1002b

leave old codecs alone

4c09ca7

Merge remote-tracking branch 'upstream/main' into feature/hnsw-multiv…

d80f2d3

…alued # Conflicts: # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java

alessandrobenedetti added 2 commits May 25, 2023 16:12

unnecessary ordToDoc removal and multiValued moved to values

5a12030

clean up of unnecessary code

d59edfc

mayya-sharipova reviewed May 25, 2023

View reviewed changes

alessandrobenedetti added 4 commits May 26, 2023 10:42

migration away of DocsWithField

4dcc0c9

minor

07cb4b5

minor

e6a5676

minor

924013b

benwtrent reviewed Jun 21, 2023

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java Outdated Show resolved Hide resolved

addressing review comment, acceptOrds are document Ids

1523ee7

benwtrent reviewed Jun 27, 2023

View reviewed changes

uschindler requested changes Jun 29, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-value support for KnnVectorField #12314

Multi-value support for KnnVectorField #12314

alessandrobenedetti commented May 19, 2023 •

edited

Loading

alessandrobenedetti commented May 24, 2023

mayya-sharipova commented May 25, 2023 •

edited

Loading

alessandrobenedetti commented May 25, 2023

jimczi commented May 25, 2023

mayya-sharipova May 25, 2023 •

edited

Loading

alessandrobenedetti May 26, 2023

mayya-sharipova May 25, 2023

alessandrobenedetti May 26, 2023

mayya-sharipova May 25, 2023

alessandrobenedetti May 26, 2023

alessandrobenedetti commented May 26, 2023 •

edited

Loading

alessandrobenedetti commented May 26, 2023

jimczi commented May 26, 2023

alessandrobenedetti commented May 26, 2023

jimczi commented May 26, 2023

joshdevins commented May 30, 2023 •

edited

Loading

alessandrobenedetti commented May 30, 2023

benwtrent left a comment

alessandrobenedetti commented Jun 22, 2023

benwtrent Jun 27, 2023

alessandrobenedetti Jun 27, 2023

benwtrent Jun 27, 2023

alessandrobenedetti Jun 27, 2023

uschindler left a comment •

edited

Loading

uschindler Jun 29, 2023

uschindler Jun 29, 2023

uschindler Jun 29, 2023

		DocIdSetIterator iterator = docsWithField.iterator();
		int[] valuesPerDocument = docsWithField.getValuesPerDocument();

Multi-value support for KnnVectorField #12314

Are you sure you want to change the base?

Multi-value support for KnnVectorField #12314

Conversation

alessandrobenedetti commented May 19, 2023 • edited Loading

Description

alessandrobenedetti commented May 24, 2023

mayya-sharipova commented May 25, 2023 • edited Loading

alessandrobenedetti commented May 25, 2023

jimczi commented May 25, 2023

mayya-sharipova May 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrobenedetti commented May 26, 2023 • edited Loading

alessandrobenedetti commented May 26, 2023

jimczi commented May 26, 2023

alessandrobenedetti commented May 26, 2023

jimczi commented May 26, 2023

joshdevins commented May 30, 2023 • edited Loading

alessandrobenedetti commented May 30, 2023

benwtrent left a comment

Choose a reason for hiding this comment

alessandrobenedetti commented Jun 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uschindler left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrobenedetti commented May 19, 2023 •

edited

Loading

mayya-sharipova commented May 25, 2023 •

edited

Loading

mayya-sharipova May 25, 2023 •

edited

Loading

alessandrobenedetti commented May 26, 2023 •

edited

Loading

joshdevins commented May 30, 2023 •

edited

Loading

uschindler left a comment •

edited

Loading