-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Optimize the de-serialization of vector when reading from Doc Values #1050
Comments
Caching could be potentially tricky. I wonder if we should first evaluate lucene's .vec approach and potentially move towards that instead of first implementing caching. In the long run, I think we should convert our codec to KnnVectorFormat anyways. |
@jmazanec15 Yes caching can be tricky. I added some ideas which comes in my mind while thinking about how to reduce the latency. We can very well ignore and move to a better solution. |
Agree we should move to KnnVectorFormat in long term and piggy back on the improvements from Lucene. Should we create dedicated GitHub issue to merge the codec to Vector classes used by Lucene? |
@vamshin we can do that. But we need to make sure that we are not loosing focus on what we want to achieve. Plus BWC can be challenge here. |
@navneet1v I think merging to KnnVectorFormat is one of the solution to get optimizations. If we think its a long term solution, I dont see having dedicated GitHub issue is a problem. I am fine if we can track it part of the same issue as well. |
@vamshin I am not saying we shouldn’t create a separate issue. I just want to make sure that we are not loosing focus on why we want to merge the format. Also, merging formats is a big change. Given de-serialization is a long poll should we consider other alternatives too. Also, will having same vector format solves the core problem? What if lucene also has the same issue? I think there are unknowns here hence I first want to make sure first that we have exhausted all the options |
@navneet1v @bugmakerrrrrr Id be interested to see if with #1573 the derserialization of vector values is faster or slower than our binarydocvalues The above experiments i think would be hard to retry with vector values, but Im wondering if we could just run with script scoring instead of with faiss and check the latency diff. |
+1 we can do that. We have added micro-benchmarks module in k-NN can we leverage that? @jmazanec15 what your thoughts? or we should go with script score query and do this on a cluster? |
@navneet1v I think micro-benchmarks would be the right way to see the diff - I started trying this comparison here: https://github.com/jmazanec15/k-NN-1/blob/micro-benchmarks/micro-benchmarks/src/main/java/org/opensearch/knn/VectorSerdeBenchmarks.java, but Im not sure it is quite apples to apples comparison - so Im not pursuing this any further. I think we would need to try to figure out how to properly setup this class to run the test: https://github.com/opensearch-project/k-NN/blob/main/src/main/java/org/opensearch/knn/index/KNNVectorDVLeafFieldData.java#L24-L28. It might get a little tricky initializing the LeafReader and everything though. |
We could potentially re-use logic from the tests: https://github.com/opensearch-project/k-NN/blob/main/src/test/java/org/opensearch/knn/index/KNNVectorScriptDocValuesTests.java#L54 |
@jmazanec15 I think we can do something like this: https://github.com/opensearch-project/OpenSearch/blob/main/benchmarks/src/main/java/org/opensearch/benchmark/index/mapper/CustomBinaryDocValuesFieldBenchmark.java#L72-L78 to make the docValues for micro benchmarks. |
@navneet1v ah nice that looks good |
@navneet1v additionally, I think we should be able to see if we can integrate into the IndexFieldDataCache for caching purposes and benchmark this. It seems index field data can possibly be auto-cached: https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/index/fielddata/IndexFieldData.java#L85-L88. Ill come up with a set of benchmarks for this if you havent already started. |
FieldDataCache in my understanding was an old concept of Lucene which is used before DocValues came in. You can do the benchmarks but as per my knowledge its not used now. I would say lets come up with another Cache layer where we can cache the vectors serialized from DocValues. |
I decided to start from running some end to end benchmarks to get an initial sense of the perf difference between the two vector representations. I created a repo for running some of these experiments from docker: https://github.com/jmazanec15/opensearch-knn-rescore-experiments. The repo is very basic right now. But I ran some preliminary tests and the initial results suggest that the KnnVectorsFormat may perform better for script scoring:
(ref) Will iterate on these experiments and have some higher fidelity results in the next week or two. |
@jmazanec15 thanks for sharing the initial results. can we capture the flame graphs for these runs? |
Description
In reference to the experiments done in here: #1049, I was found out that when we flip to exact search and use the doc values do the exact search we spend like 60-70% of the time in de-serialization of vectors from binary format to float array. Hence as part of that there was few solutions provided, capturing them as part of this issue. Experiment details:
Experiment 1 (Baseline to Find latency for doing distance calculation on the different size dataset with different dimensions)
Below are the steps that are performed to run this experiment:Experiment 2(Doing de-serialization on vectors and then doing distance calculation)
Below are the steps that are performed to run this experiment:
So, the query latency include:
Combining the Results of Experiment 1 and Experiment 2
Time Spent in De-Serialization
--
Experiments Observations
Time Spent In De-Serialization in %age
--
Experiment Conclusion
Possible solutions:
Benefits
This will feature will not only boost the filtered search but will also help in reducing the latency for script score query for k-NN where we read these vectors to perform the exact search.
This will also provide us some gains during the merge operation of segments as we read the binary doc values there too.
The text was updated successfully, but these errors were encountered: