Integrating GPU based Vector Search using cuVS #14131

chatman · 2025-01-10T15:27:51Z

Description

NVIDIA's cuVS library (https://github.com/rapidsai/cuvs) is a state-of-the-art vector search library. It supports fast indexing and search using GPUs.
This pull request is to integrate this library based on a custom KnnVectorFormat and Codec into the Lucene sandbox.
cuVS library is a C library. There is an in-progress Java API (to be released soon): Initial cut for a cuVS Java API rapidsai/cuvs#450. This uses Project Panama for integration.

This is an in-progress PR at the moment. Here is a way to test it out:

Clone the cuvs repository from the PR branch.
./build.sh libcuvs && ./build.sh java
(The above will install the cuvs-java artifacts in local Maven repository)
Compile and use this branch in an IDE.

TODO:

TestCuVS works via IDE, but not via gradle (some native access security issues).
Make this branch work with released version of cuvs-java, once it is released.
Add more tests.
Publish benchmarks.

This work is mainly done by @narangvivek10, @punAhuja and me, along with help from @cjnolet.

chatman · 2025-01-10T15:31:30Z

FYI @uschindler, @ChrisHegarty, @dsmiley, @msokolov

dweiss · 2025-01-10T21:12:51Z

build-tools/build-infra/build.gradle

@@ -22,6 +22,7 @@ plugins {
 }

 repositories {
+  mavenLocal()


Remove mavenLocal before merging, if it happens. There will be issues with it - some are very cryptic and hard to diagnose (like different artifact hashes). It's going to be a major headache if it's left in.

Sure, Dawid, will add a nocommit comment there. I had added nocommits earlier, but had to remove them to make "./gradlew check" work. Is there a way to have it check everything other than the nocommits?

I think you can skip the entire task that looks for nocommits (and other things) by running ./gradlew check -x validateSourcePatterns

navneet1v · 2025-01-10T23:27:12Z

@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS.

navneet1v · 2025-01-10T23:35:21Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsWriter.java

+            .withNumWriterThreads(cuvsWriterThreads)
+            .withIntermediateGraphDegree(intGraphDegree)
+            .withGraphDegree(graphDegree)
+            .withCagraGraphBuildAlgo(CagraGraphBuildAlgo.NN_DESCENT)


I have some experience with building the Cagra index, and I think NN_DESCENT is faster in cagra index creation but it has a high GPU memory footprint. Should we use IVF_PQ here? Or can we have a hybrid approach where if doc count is < a specific number then we use NN_DESCENT else IVF_PQ.

Unfortunately, we encountered some crashes while trying IVF_PQ, hence using NN_DESCENT as the default. Maybe we can make it configurable once those crashes can be thoroughly verified.

benwtrent

This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:

tests for queries
tests for the format
preventing bad behavior (e.g. using byte[])

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

benwtrent · 2025-01-14T12:39:01Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsWriter.java

+        float vectors[][] = new float[field.vectors.size()][field.vectors.get(0).length];
+        for (int i = 0; i < vectors.length; i++) {
+          for (int j = 0; j < vectors[i].length; j++) {
+            vectors[i][j] = field.vectors.get(i)[j];
+          }
+        }
+
+        cagraIndexBytes = createCagraIndex(vectors, new ArrayList<Integer>(field.vectors.keySet()));
+        bruteForceIndexBytes = createBruteForceIndex(vectors);
+        hnswIndexBytes = createHnswIndex(vectors);


From what I can tell, you load all the vectors onto heap. This is frankly untenable in production.

The amount of GC & JVM heap would be enormous on medium size indices (10M+).

benwtrent · 2025-01-14T12:40:03Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsWriter.java

+    cagraIndexForHnsw =
+        new CagraIndex.Builder(resources).withDataset(vectors).withIndexParams(indexParams).build();


The cagra index builder should instead accept a file handle or something. Having to feed it on-heap vectors is not good.

benwtrent · 2025-01-14T12:40:45Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsWriter.java

+    File tmpFile =
+        File.createTempFile(
+            "tmpindex", "cag"); // TODO: Should we make this a file with random names?


This tmp file name basically means you cannot have different fields with different settings running at the same time? Seems like a very bad idea.

benwtrent · 2025-01-14T12:42:58Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsWriter.java

+        this.segmentWriteState.segmentInfo.getId(),
+        this.segmentWriteState.segmentSuffix);
+
+    CuVSSegmentFile cuVSFile = new CuVSSegmentFile(new SegmentOutputStream(cuVSIndex, 100000));


Why not just use all the built in inputs/outputs? Seems weird to have a unique buffered output.

benwtrent · 2025-01-14T12:45:04Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsReader.java

+    for (StackFrame s : stackTrace) {
+      if (s.toString().startsWith("org.apache.lucene.index.IndexWriter.merge")) {
+        isMergeCase = true;
+        // log.info("Reader opening on merge call");
+        break;


You should instead use getMergeInstance which allows you do set merge settings, or whatever for a given reader.

benwtrent · 2025-01-14T12:46:21Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsReader.java

+  private Map<String, List<CuVSIndex>> loadCuVSIndex(ZipInputStream zis, boolean isMergeCase)
+      throws Throwable {
+    Map<String, List<CuVSIndex>> ret = new HashMap<String, List<CuVSIndex>>();
+    Map<String, CagraIndex> cagraIndexes = new HashMap<String, CagraIndex>();
+    Map<String, BruteForceIndex> bruteforceIndexes = new HashMap<String, BruteForceIndex>();
+    Map<String, HnswIndex> hnswIndexes = new HashMap<String, HnswIndex>();
+    Map<String, List<Integer>> mappings = new HashMap<String, List<Integer>>();
+    Map<String, List<float[]>> vectors = new HashMap<String, List<float[]>>();


From what I can tell, every time this reader is opened, you load everything on heap and THEN build the cuvsindex by scratch?

Why not serialize the cuvindex itself? Doesn't it have a file format or something?

chatman · 2025-01-17T18:38:54Z

This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:
* tests for queries

* tests for the format

* preventing bad behavior (e.g. using `byte[]`)
I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

Indeed, Ben. This is WIP at the moment. More tests are WIP. As for loading the entire index in byte[], this is something that we're working with the NVIDIA/cuVS team to see if streaming can be supported (right now it is not).

chatman · 2025-01-17T18:40:26Z

@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS.

Yes, exactly.

chatman · 2025-01-17T18:47:24Z

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

This is something we need to work out as we want this to be. Here are my thoughts at the moment, and things we need consensus on:

Right now, the cuvs-java dependency comes from local Maven and it should come from Maven central once the artifacts are available.
TODO: If a system doesn't have cuda or GPUs, these codepaths should gracefully fail and indicate that support not available.
Continuous testing can be enabled on GPU enabled Jenkins instances (we can have a discussion around that).
To validate the integration at API level, some mock tests (that simulate the same functionality using the CPU) can be added to the cuvs-java API.
We can discuss whether shipping this by default with release artifacts is a problem or not.

uschindler

Next to Dawid's comments about the dependencies and the code, I have some major problems with:

cleanup the API and do not make everything public. Only the Codec and format class needs to be public, all other pkg-private.
remove public fields that are modifiable.
make all fields final, if possible.
there is an issue with a static field which gets initialized in the ctor. This looks wrong!
don't swallow exceptions and log errors. If you have no CUVS, fail hard. It makes no sense to proceed, because without the native library and a graphics adapter the whole codec won't work.

Anyways, I am happy to see this and that the underlying library was moved to Panama. So it was really a pleasure to talk to your colleagues last summer at berlinbuzzwords!

uschindler · 2025-01-18T17:21:36Z

lucene/sandbox/build.gradle

 dependencies {
  moduleApi project(':lucene:core')
  moduleApi project(':lucene:queries')
  moduleApi project(':lucene:facet')
  moduleTestImplementation project(':lucene:test-framework')
+  moduleImplementation deps.commons.lang3


It would really be great to remove the commons-lang3 (bullshit) library. Sorry for this, but this library is mostly useless. Sometimes some of its rarely used funtions can be replicated (if they are of good use for other parts). But in general the broken null handling in this library (it ignores nulls) and lots of legacy code that can be much easier written with more modern Java 21 APIs do not justify its usage.

My suggestion: Actually the code here only uses SerializationUtils.deserialize() and SerializationUtils.serialize(). Maybe just copy those to the Utils class of the package as it seems to be of not much use elsewhere in Lucene - and remove this dependency.

uschindler · 2025-01-18T17:22:32Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CagraFieldVectorsWriter.java

+  public final String fieldName;
+  public final ConcurrentHashMap<Integer, float[]> vectors =
+      new ConcurrentHashMap<Integer, float[]>();
+  public int fieldVectorDimension = -1;


this should be final and only initialized by ctor.

uschindler · 2025-01-18T17:24:03Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSCodec.java

+    try {
+      format = new CuVSVectorsFormat(1, 128, 64, MergeStrategy.NON_TRIVIAL_MERGE);
+      setKnnFormat(format);
+    } catch (LibraryNotFoundException ex) {


This should fail hard. If cuvs is not available it should throw exception and not just log a severe error. Instead this throws NPE later when the knn format is retrieved.

uschindler · 2025-01-18T17:24:53Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSCodec.java

+    return knnFormat;
+  }
+
+  public void setKnnFormat(KnnVectorsFormat format) {


No setters please, codecs should be unmodifiable! It should initialize on ctor or fail if library cannot be loaded.

uschindler · 2025-01-18T17:25:31Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSSegmentFile.java

+public class CuVSSegmentFile implements AutoCloseable {
+  private final ZipOutputStream zos;
+
+  private Set<String> filesAdded = new HashSet<String>();


this must be final.

uschindler · 2025-01-18T17:29:20Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsFormat.java

+  public final int cuvsWriterThreads;
+  public final int intGraphDegree;
+  public final int graphDegree;
+  public MergeStrategy mergeStrategy;


should also be final and why is this one and the previous ones public?

uschindler · 2025-01-18T17:31:17Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsReader.java

+  // protected Logger log = Logger.getLogger(getClass().getName());
+
+  IndexInput vectorDataReader = null;
+  public String fileName = null;


all fields should not be public, and if they need to be public they should be final at least. I assume they can be package private or private.

uschindler · 2025-01-18T17:31:46Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/CuVSVectorsWriter.java

+
+  // protected Logger log = Logger.getLogger(getClass().getName());
+
+  private List<CagraFieldVectorsWriter> fieldVectorWriters = new ArrayList<>();


check which fields can be final.

uschindler · 2025-01-18T17:32:05Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/PerLeafCuVSKnnCollector.java

+ */
+public class PerLeafCuVSKnnCollector implements KnnCollector {
+
+  public List<ScoreDoc> scoreDocs;


again why is all this public?

uschindler · 2025-01-18T17:32:52Z

lucene/sandbox/src/java/org/apache/lucene/sandbox/vectorsearch/Util.java

+/**
+ * Some Utils used in CuVS integration
+ */
+public class Util {


this class is internal, make it pkg private

ChrisHegarty · 2025-01-20T15:13:37Z

lucene/sandbox/src/java/module-info.java

@@ -20,6 +20,9 @@
  requires org.apache.lucene.core;
  requires org.apache.lucene.queries;
  requires org.apache.lucene.facet;
+  requires java.logging;
+  requires com.nvidia.cuvs;


This makes cuVS a non-optional dependency. I would have expected to see this feature as optional, i.e. if not present you get a nice error message or something. I guess it kinda depends on how the cuVS and native implementation are tied together? That is, can one use or even instantiate types from the cuVS Java API without the native library being present?

I don't see this in the snapshot module a problem. The code itsself should be written in a way that the codec can only be used, if the native library is there. Currently the code is a bit broken as it does not fail hard. I'd like to change this (see my review),
But if the dependency is on Maven central, it should work.
Of course we could make it also optional in module system, but then you would get CNFE when loading codecs or related classes, which would happen on SPI lookup.

@uschindler Agree - this feature should be optional. If the Java API is present and loadable without the native library - has at least one callable method, which we can call and catch, then that is fine. I'll try it out.

The current code catches this, but just logs a warning and then fails later (due to NPE). I argued that the Codec should fail hard or at least fails when it is to be used.

This must be refactored before release:

Allow to load the Codec, but delay initialization of native library until the codec is actually used to create components.

Fail hard when an index with the custom codec is loaded and theres no native access.

When looking at the code I think best wozuld be the following: Create a small static "holder" class (inside the Codec impl) which has a static initializer loading and setting up the CUVS code. This should fail hard with some LinkageError if library is not there. The holder class should have getters for some CUVS entry points.

All getter methods to create formats and readers in the Codec delegate to the getters in the holder and pass through any exceptions. In addition, all CUVS classes should be package private (I mentioned this already) and only let the codec and a essential config classes public.

The same for the postings formats and other public components loaded via SPI.

Ishan Chattopadhyaya and others added 4 commits January 7, 2025 21:17

Initial cut of CuVS into Lucene as a Codec in sandbox

b8a1162

Test fixes

0e9f6d4

fix for getFloatVectorValues

a95f084

Fixing precommit, ECJ, Rat, spotless, forbiddenApis etc.

9f0d3dd

dweiss reviewed Jan 10, 2025

View reviewed changes

navneet1v reviewed Jan 10, 2025

View reviewed changes

benwtrent requested changes Jan 14, 2025

View reviewed changes

Adding Javadocs to some public methods

4bca45f

uschindler requested changes Jan 18, 2025

View reviewed changes

ChrisHegarty reviewed Jan 20, 2025

View reviewed changes

mikemccand mentioned this pull request Jan 21, 2025

Missing cuVS Version Information for Provided JAR File SearchScale/lucene-cuvs#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating GPU based Vector Search using cuVS #14131

Integrating GPU based Vector Search using cuVS #14131

chatman commented Jan 10, 2025

chatman commented Jan 10, 2025

dweiss Jan 10, 2025

chatman Jan 17, 2025

dweiss Jan 17, 2025

navneet1v commented Jan 10, 2025

navneet1v Jan 10, 2025

chatman Jan 17, 2025

benwtrent left a comment

benwtrent Jan 14, 2025

benwtrent Jan 14, 2025

benwtrent Jan 14, 2025

benwtrent Jan 14, 2025

benwtrent Jan 14, 2025

benwtrent Jan 14, 2025

chatman commented Jan 17, 2025

chatman commented Jan 17, 2025

chatman commented Jan 17, 2025 •

edited

Loading

uschindler left a comment •

edited

Loading

uschindler Jan 18, 2025 •

edited

Loading

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

uschindler Jan 18, 2025

ChrisHegarty Jan 20, 2025

uschindler Jan 20, 2025 •

edited

Loading

ChrisHegarty Jan 20, 2025

uschindler Jan 20, 2025

uschindler Jan 20, 2025

uschindler Jan 20, 2025

		cagraIndexForHnsw =
		new CagraIndex.Builder(resources).withDataset(vectors).withIndexParams(indexParams).build();


		// protected Logger log = Logger.getLogger(getClass().getName());

		private List<CagraFieldVectorsWriter> fieldVectorWriters = new ArrayList<>();

Integrating GPU based Vector Search using cuVS #14131

Are you sure you want to change the base?

Integrating GPU based Vector Search using cuVS #14131

Conversation

chatman commented Jan 10, 2025

Description

chatman commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

navneet1v commented Jan 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chatman commented Jan 17, 2025

chatman commented Jan 17, 2025

chatman commented Jan 17, 2025 • edited Loading

uschindler left a comment • edited Loading

Choose a reason for hiding this comment

uschindler Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uschindler Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chatman commented Jan 17, 2025 •

edited

Loading

uschindler left a comment •

edited

Loading

uschindler Jan 18, 2025 •

edited

Loading

uschindler Jan 20, 2025 •

edited

Loading