-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-value support for KnnVectorField #12314
base: main
Are you sure you want to change the base?
Multi-value support for KnnVectorField #12314
Conversation
# Conflicts: # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene60/Lucene60FieldInfosFormat.java # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90FieldInfosFormat.java # lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextFieldInfosFormat.java # lucene/codecs/src/test/org/apache/lucene/codecs/uniformsplit/TestBlockWriter.java # lucene/codecs/src/test/org/apache/lucene/codecs/uniformsplit/sharedterms/TestSTBlockReader.java # lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java # lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java # lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsWriter.java # lucene/core/src/java/org/apache/lucene/document/FieldType.java # lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java # lucene/core/src/java/org/apache/lucene/index/FieldInfo.java # lucene/core/src/java/org/apache/lucene/index/FieldInfos.java # lucene/core/src/java/org/apache/lucene/index/IndexableFieldType.java # lucene/core/src/java/org/apache/lucene/index/IndexingChain.java # lucene/core/src/java/org/apache/lucene/index/RandomAccessVectorValues.java # lucene/core/src/java/org/apache/lucene/index/ReadersAndUpdates.java # lucene/core/src/java/org/apache/lucene/index/VectorValues.java # lucene/core/src/java/org/apache/lucene/index/VectorValuesWriter.java # lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java # lucene/core/src/test/org/apache/lucene/index/TestCodecs.java # lucene/core/src/test/org/apache/lucene/index/TestFieldInfos.java # lucene/core/src/test/org/apache/lucene/index/TestFieldsReader.java # lucene/core/src/test/org/apache/lucene/index/TestIndexableField.java # lucene/core/src/test/org/apache/lucene/index/TestPendingSoftDeletes.java # lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java # lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java # lucene/core/src/test/org/apache/lucene/util/hnsw/MockVectorValues.java # lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java # lucene/highlighter/src/java/org/apache/lucene/search/highlight/TermVectorLeafReader.java # lucene/memory/src/java/org/apache/lucene/index/memory/MemoryIndex.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseFieldInfoFormatTestCase.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseIndexFileFormatTestCase.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java # lucene/test-framework/src/java/org/apache/lucene/tests/index/RandomPostingsTester.java
…alued # Conflicts: # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java # lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/OffHeapFloatVectorValues.java # lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene94/Lucene94HnswVectorsWriter.java # lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java # lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java # lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java # lucene/core/src/java/org/apache/lucene/document/KnnFloatVectorField.java # lucene/core/src/java/org/apache/lucene/index/BufferingKnnVectorsWriter.java # lucene/core/src/java/org/apache/lucene/index/CheckIndex.java # lucene/core/src/java/org/apache/lucene/index/CodecReader.java # lucene/core/src/java/org/apache/lucene/index/VectorValues.java # lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java # lucene/core/src/java/org/apache/lucene/util/hnsw/RandomAccessVectorValues.java # lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java # lucene/core/src/test/org/apache/lucene/search/BaseKnnVectorQueryTestCase.java # lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java # lucene/test-framework/src/java/org/apache/lucene/tests/codecs/asserting/AssertingKnnVectorsFormat.java
I have been working a bit on the removal of the need for "vectorMultiValued" as a special fieldInfo and index time information. I may have also brought regressions with this simplifications, but everything is in the git history so it should be doable to fix them. |
…alued # Conflicts: # lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
@alessandrobenedetti Thanks for starting this effort. We also found a need for multi-value vector fields, and here is the list of questions we thought about for its usage. Would be nice to get yours and other's ideas on them:
|
I guess attaching metadata to vectors is a different story, but I agree it could be a good idea! |
Thanks for sharing and working on a prototype @alessandrobenedetti ! I have additional questions and comments ;) Regarding the change itself, I wonder if the approach could be more explicit. Instead of hiding the new feature I'd argue that we need to follow the doc values model where single and multi-valued are separated. In practice that would mean adding this to the KnnVectorsReader:
The writer side is a bit more tricky but you see what I mean. It's a big change that impacts the implementation and expectations of many existing functions if the change is directly inserted in the existing knn field. Another challenge for this change is the possible overflow in the graph representation. If we change the ordinal value to be the offset of the multivalued position, the maximum allowed is not per doc anymore but per vector. We use int32 in the graph to represents these ordinals, which matches with the max doc limit of Lucene. Maybe not a big deal for small scale but that's something to keep in mind when implementing things like merge or even multi-segments search. Anyway, quite exciting discussions, Ben and Mayya are reviewing too so just adding my 2cents and happy to help if/when we reach a consensus. |
@@ -22,6 +22,8 @@ | |||
import org.apache.lucene.util.FixedBitSet; | |||
import org.apache.lucene.util.RamUsageEstimator; | |||
|
|||
import java.util.Stack; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this class unchanged; it is used in other places besides vectors and is only about documents.
For keeping track of the number of values in docs, may be we can use another data-structure (PackedLongValues.Builder
, similar how SortedSetDocValuesWriter
does it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, for the time being I reverted the class and extended it. Feel free to have a look and apply/propose any change you see fit.
private int currentDocVectorsCount; | ||
private int vectorsCount; | ||
|
||
private boolean multiValued = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also strange to have this multiValued
field, we probably should just multi values by presence/absence of another data structure that keeps number of values in docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about this, it's a simple change though if necessary
DocIdSetIterator iterator = docsWithField.iterator(); | ||
int[] valuesPerDocument = docsWithField.getValuesPerDocument(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to use a lot of memory, may be for inspiration, we can use how SortedSetDocValuesWriter
class writes multiple values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check later on, happy to use a different implementation
I proceeded with some additional refactoring and refinements that find in the latest commits. |
Ale: I guess the use case for multi valued vector-fields is not much different from any other multi valued fields:
That was the initial approach, it was explicit at index and query time, and they were separate code paths from the single valued use case. So it was not affecting the single valued scenario at all.
Yes, with one vector per document, the maximum amount of supported vectors was in line with the docs, this is still the case but I agree that right now we potentially support less docs, happy to change that.
Thanks for the feedback! |
My main worry is the change to
Well I don't have a good idea to change that if we expect to have an ordinal per value. Switching the ordinal in the hnsw to int64 is not really practical. Adding a sub-id to each ordinal seems also quite invasive. Again not saying that it should block the change but we need to be clear on the downside of this approach and also ensures that we have the right guards for overflows. |
Hi @jimczi, nothing in this PR is final nor I have any strong opinion about it. In regards to your main worry, can you point me to the areas of code you don't like specifically and I can have a thought in how to modify them! |
Sure, we're just discussing the approach, no worries.
This change in [Float|Byte]VectorValues:
Today the expectation is that it iterates over doc ids. This change adds an indirection that needs to be checked ( |
Just a note on the aggregation functions @alessandrobenedetti Do you have any good references/papers on approaches to re-aggregating passages into documents for SERPs? It seems that the art was abandoned a couple years ago with most approaches settling on |
Hi @joshdevins , I didn't have the time to dive deeper into the aggregation strategies so I don't have references yet. Your observation regarding 'SUM' is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on this implementation. It seems like we will need at a minimum a new NeighborQueue
I am not sure the existing one needs to be updated, but we instead should have a MultiValueNeighborQueue
.
The reason for this is that not only does this queue contain information about result sets, it keeps track of how many nodes are visited and the TopHits returned utilize that number. Consequently, the visited count should keep track of documents visited, not vectors visited. All these changes indicates a new queueing mechanism for multi-valued vector fields.
Another thought is that Lucene already has the concept of index join
values. Effectively creating child document IDs under a single parent. This allows for even greater flexibility by indexing the passage the vector represents, and potentially even greater scoring flexibility.
The issue I could see happening here is ensuring the topdocs searching has the ability to deduplicate (if desired) based on parent document ID.
Did you consider utilizing this when digging into this implementation?
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
Outdated
Show resolved
Hide resolved
I think it's a good idea to create a new dedicated MultiValued NeighborQueue, I'll do it when I have time but feel free to do it if you like! In regards to index time join, I am not sure it's relevant here (are you talking about block join?): |
@Override | ||
public Map<Integer, Float> scoreMultiValued(BitSet acceptedDocs) throws IOException { | ||
Map<Integer, Float> docToScore = new HashMap<>(); | ||
for (int vectorId = values.nextDoc(); vectorId != NO_MORE_DOCS; vectorId = values.nextDoc()) { | ||
int docID = values.ordToDoc(vectorId); | ||
if (acceptedDocs.get(docID)) { | ||
float currentScore = similarity.compare(query, values.vectorValue()); | ||
docToScore.putIfAbsent(docID, currentScore); | ||
docToScore.computeIfPresent(docID, | ||
(key, previousScore) -> Math.max(previousScore, currentScore)); | ||
} | ||
} | ||
return docToScore; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how we can do this here. Instead of randomly iterating the BitSet and going to the exact document, we are iterating every vector in the scorer and checking its membership in acceptedDocs
.
Even multi-valued *VectorValues
classes need to be able to iterate to the exact document. Otherwise searching with a filter like this is untenable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, didn't spend much time on this, I drafted a few lines of code just to make the exact search tests green.
Happy to get suggestions/other people implementations here!
private static DocsWithVectorsSet writeByteVectorData( | ||
IndexOutput output, ByteVectorValues mergedVectorValues) throws IOException { | ||
DocsWithVectorsSet docsWithVectors = new DocsWithVectorsSet(); | ||
for (int vectorId = mergedVectorValues.nextDoc(); vectorId != NO_MORE_DOCS; vectorId = mergedVectorValues.nextDoc()) { | ||
int docID = mergedVectorValues.ordToDoc(vectorId); | ||
byte[] binaryValue = mergedVectorValues.vectorValue(); | ||
assert binaryValue.length == mergedVectorValues.dimension() * VectorEncoding.BYTE.byteSize; | ||
output.writeBytes(binaryValue, binaryValue.length); | ||
docsWithField.add(docV); | ||
docsWithVectors.add(docID); | ||
} | ||
return docsWithField; | ||
return docsWithVectors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell, this is simply writing each vector value in order, regardless if they are part of the same document or not correct?
I suppose we get the document order for free since they all have to be supplied at the same time correct?
The reason I say this, is that if we want to be able to iterate by document, being able to skip directly to the document and read its vectors is important, which wouldn't be easily possible unless all the vectors in a document were written right next to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I wanted and hopefully implemented we get the document order for free since all vectors have to be supplied at the same time.
As everything in this Pull Request, it's open to discussion as my focus was to bring a starting milestone working end to end and then refine each step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about outdated Stack class. But my major concern is integer ordinals. Like for SortedSetDocvalues we need long vector ordinals.
import org.apache.lucene.util.FixedBitSet; | ||
import org.apache.lucene.util.RamUsageEstimator; | ||
|
||
import java.util.Stack; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't use the outdated Stack
class, which is also synchronized. Modern replacement is ArrayDeque
implementing Deque
interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #12404
* @param ord vector ID(ordinal) | ||
* @return the document ID | ||
*/ | ||
public int ordToDoc(int ord){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the new implementation is that ordinal numbers need to get longs (like in SortedSetDocValues), because as each document of a 2B docs index sement may have more than one value, so the total number of ordinals may go beyond 2B.
Of course this needs more work.
Description
This pull request aims to introduce support for multiple values in a single Knn vector field.
The adopted solution relies on:
Index time
Sparse vector values approach where an Ordinal(vectorId) to DocId map is used to keep the relation between a DocId and all its vectors.
In the current sparse vector approach, we have just one vectorId per docID
In this proposed contribution, multiple vectorIds are mapped to the same docID
Query time
A multi-valued strategy choice is offered to the user:
MAX/SUM
In exact nearest neighbor, for each document accepted by the query/filter :
MAX = the similarity score between the query and each vector is computed, the max score is chosen for the search result
SUM = the similarity score between the query and each vector is computed, all scores are summed to get the final score
In aproximate nearest neighbor, for each document accepted by the query/filter :
MAX = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated keeping the maximum between what it was there and the new score
SUM = every time we find a nearest neighbor vector to be added to the topK, if the document is already there, its score is updated summing the old and new score
N.B. This Pull Request is not meant to be ready to be merged at this stage.
I can identify at least this set of activities before this draft can move to a 'production ready' version:
It's a big contribution and It will take time and effort to be completed.
It's the result of many months of work/fun (on and off), I am happy if something good comes from it, not necessarily the adoption of the whole idea, even if it's just some bit and pieces that ends up being beneficial, it's cool.
Any help is welcome.