Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating GPU based Vector Search using cuVS #14131

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

chatman
Copy link
Contributor

@chatman chatman commented Jan 10, 2025

Description

  • NVIDIA's cuVS library (https://github.com/rapidsai/cuvs) is a state-of-the-art vector search library. It supports fast indexing and search using GPUs.
  • This pull request is to integrate this library based on a custom KnnVectorFormat and Codec into the Lucene sandbox.
  • cuVS library is a C library. There is an in-progress Java API (to be released soon): Initial cut for a cuVS Java API rapidsai/cuvs#450. This uses Project Panama for integration.

This is an in-progress PR at the moment. Here is a way to test it out:

  • Clone the cuvs repository from the PR branch.
  • ./build.sh libcuvs && ./build.sh java
  • (The above will install the cuvs-java artifacts in local Maven repository)
  • Compile and use this branch in an IDE.

TODO:

  • TestCuVS works via IDE, but not via gradle (some native access security issues).
  • Make this branch work with released version of cuvs-java, once it is released.
  • Add more tests.
  • Publish benchmarks.

This work is mainly done by @narangvivek10, @punAhuja and me, along with help from @cjnolet.

@chatman
Copy link
Contributor Author

chatman commented Jan 10, 2025

@@ -22,6 +22,7 @@ plugins {
}

repositories {
mavenLocal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove mavenLocal before merging, if it happens. There will be issues with it - some are very cryptic and hard to diagnose (like different artifact hashes). It's going to be a major headache if it's left in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, Dawid, will add a nocommit comment there. I had added nocommits earlier, but had to remove them to make "./gradlew check" work. Is there a way to have it check everything other than the nocommits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can skip the entire task that looks for nocommits (and other things) by running ./gradlew check -x validateSourcePatterns

@navneet1v
Copy link
Contributor

@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS.

.withNumWriterThreads(cuvsWriterThreads)
.withIntermediateGraphDegree(intGraphDegree)
.withGraphDegree(graphDegree)
.withCagraGraphBuildAlgo(CagraGraphBuildAlgo.NN_DESCENT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some experience with building the Cagra index, and I think NN_DESCENT is faster in cagra index creation but it has a high GPU memory footprint. Should we use IVF_PQ here? Or can we have a hybrid approach where if doc count is < a specific number then we use NN_DESCENT else IVF_PQ.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we encountered some crashes while trying IVF_PQ, hence using NN_DESCENT as the default. Maybe we can make it configurable once those crashes can be thoroughly verified.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:

  • tests for queries
  • tests for the format
  • preventing bad behavior (e.g. using byte[])

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

Comment on lines +204 to +213
float vectors[][] = new float[field.vectors.size()][field.vectors.get(0).length];
for (int i = 0; i < vectors.length; i++) {
for (int j = 0; j < vectors[i].length; j++) {
vectors[i][j] = field.vectors.get(i)[j];
}
}

cagraIndexBytes = createCagraIndex(vectors, new ArrayList<Integer>(field.vectors.keySet()));
bruteForceIndexBytes = createBruteForceIndex(vectors);
hnswIndexBytes = createHnswIndex(vectors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, you load all the vectors onto heap. This is frankly untenable in production.

The amount of GC & JVM heap would be enormous on medium size indices (10M+).

Comment on lines +166 to +167
cagraIndexForHnsw =
new CagraIndex.Builder(resources).withDataset(vectors).withIndexParams(indexParams).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cagra index builder should instead accept a file handle or something. Having to feed it on-heap vectors is not good.

Comment on lines +128 to +130
File tmpFile =
File.createTempFile(
"tmpindex", "cag"); // TODO: Should we make this a file with random names?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tmp file name basically means you cannot have different fields with different settings running at the same time? Seems like a very bad idea.

this.segmentWriteState.segmentInfo.getId(),
this.segmentWriteState.segmentSuffix);

CuVSSegmentFile cuVSFile = new CuVSSegmentFile(new SegmentOutputStream(cuVSIndex, 100000));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use all the built in inputs/outputs? Seems weird to have a unique buffered output.

Comment on lines +97 to +101
for (StackFrame s : stackTrace) {
if (s.toString().startsWith("org.apache.lucene.index.IndexWriter.merge")) {
isMergeCase = true;
// log.info("Reader opening on merge call");
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should instead use getMergeInstance which allows you do set merge settings, or whatever for a given reader.

Comment on lines +116 to +123
private Map<String, List<CuVSIndex>> loadCuVSIndex(ZipInputStream zis, boolean isMergeCase)
throws Throwable {
Map<String, List<CuVSIndex>> ret = new HashMap<String, List<CuVSIndex>>();
Map<String, CagraIndex> cagraIndexes = new HashMap<String, CagraIndex>();
Map<String, BruteForceIndex> bruteforceIndexes = new HashMap<String, BruteForceIndex>();
Map<String, HnswIndex> hnswIndexes = new HashMap<String, HnswIndex>();
Map<String, List<Integer>> mappings = new HashMap<String, List<Integer>>();
Map<String, List<float[]>> vectors = new HashMap<String, List<float[]>>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, every time this reader is opened, you load everything on heap and THEN build the cuvsindex by scratch?

Why not serialize the cuvindex itself? Doesn't it have a file format or something?

@chatman
Copy link
Contributor Author

chatman commented Jan 17, 2025

This seems pretty far from ready yet. I left some comments on some glaring issues. However, there are other things like:

* tests for queries

* tests for the format

* preventing bad behavior (e.g. using `byte[]`)

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

Indeed, Ben. This is WIP at the moment. More tests are WIP. As for loading the entire index in byte[], this is something that we're working with the NVIDIA/cuVS team to see if streaming can be supported (right now it is not).

@chatman
Copy link
Contributor Author

chatman commented Jan 17, 2025

@chatman thanks for creating the PR. This looks very interesting. is the idea here is the Lucene library will on a GPU machine and running the CUVS.

Yes, exactly.

@chatman
Copy link
Contributor Author

chatman commented Jan 17, 2025

I haven't touched on the validity of having an NVIDIA only GPU backed index in Lucene sandbox directly. The new dependencies are huge. IDK if whomever downloads and builds lucene then now have to download and build with these? I am unsure how the sandbox stuff works.

This is something we need to work out as we want this to be. Here are my thoughts at the moment, and things we need consensus on:

  • Right now, the cuvs-java dependency comes from local Maven and it should come from Maven central once the artifacts are available.
  • TODO: If a system doesn't have cuda or GPUs, these codepaths should gracefully fail and indicate that support not available.
  • Continuous testing can be enabled on GPU enabled Jenkins instances (we can have a discussion around that).
  • To validate the integration at API level, some mock tests (that simulate the same functionality using the CPU) can be added to the cuvs-java API.
  • We can discuss whether shipping this by default with release artifacts is a problem or not.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next to Dawid's comments about the dependencies and the code, I have some major problems with:

  • cleanup the API and do not make everything public. Only the Codec and format class needs to be public, all other pkg-private.
  • remove public fields that are modifiable.
  • make all fields final, if possible.
  • there is an issue with a static field which gets initialized in the ctor. This looks wrong!
  • don't swallow exceptions and log errors. If you have no CUVS, fail hard. It makes no sense to proceed, because without the native library and a graphics adapter the whole codec won't work.

Anyways, I am happy to see this and that the underlying library was moved to Panama. So it was really a pleasure to talk to your colleagues last summer at berlinbuzzwords!

dependencies {
moduleApi project(':lucene:core')
moduleApi project(':lucene:queries')
moduleApi project(':lucene:facet')
moduleTestImplementation project(':lucene:test-framework')
moduleImplementation deps.commons.lang3
Copy link
Contributor

@uschindler uschindler Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would really be great to remove the commons-lang3 (bullshit) library. Sorry for this, but this library is mostly useless. Sometimes some of its rarely used funtions can be replicated (if they are of good use for other parts). But in general the broken null handling in this library (it ignores nulls) and lots of legacy code that can be much easier written with more modern Java 21 APIs do not justify its usage.

My suggestion: Actually the code here only uses SerializationUtils.deserialize() and SerializationUtils.serialize(). Maybe just copy those to the Utils class of the package as it seems to be of not much use elsewhere in Lucene - and remove this dependency.

public final String fieldName;
public final ConcurrentHashMap<Integer, float[]> vectors =
new ConcurrentHashMap<Integer, float[]>();
public int fieldVectorDimension = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be final and only initialized by ctor.

try {
format = new CuVSVectorsFormat(1, 128, 64, MergeStrategy.NON_TRIVIAL_MERGE);
setKnnFormat(format);
} catch (LibraryNotFoundException ex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fail hard. If cuvs is not available it should throw exception and not just log a severe error. Instead this throws NPE later when the knn format is retrieved.

return knnFormat;
}

public void setKnnFormat(KnnVectorsFormat format) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No setters please, codecs should be unmodifiable! It should initialize on ctor or fail if library cannot be loaded.

public class CuVSSegmentFile implements AutoCloseable {
private final ZipOutputStream zos;

private Set<String> filesAdded = new HashSet<String>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this must be final.

public final int cuvsWriterThreads;
public final int intGraphDegree;
public final int graphDegree;
public MergeStrategy mergeStrategy;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also be final and why is this one and the previous ones public?

// protected Logger log = Logger.getLogger(getClass().getName());

IndexInput vectorDataReader = null;
public String fileName = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all fields should not be public, and if they need to be public they should be final at least. I assume they can be package private or private.


// protected Logger log = Logger.getLogger(getClass().getName());

private List<CagraFieldVectorsWriter> fieldVectorWriters = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check which fields can be final.

*/
public class PerLeafCuVSKnnCollector implements KnnCollector {

public List<ScoreDoc> scoreDocs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again why is all this public?

/**
* Some Utils used in CuVS integration
*/
public class Util {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is internal, make it pkg private

@@ -20,6 +20,9 @@
requires org.apache.lucene.core;
requires org.apache.lucene.queries;
requires org.apache.lucene.facet;
requires java.logging;
requires com.nvidia.cuvs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes cuVS a non-optional dependency. I would have expected to see this feature as optional, i.e. if not present you get a nice error message or something. I guess it kinda depends on how the cuVS and native implementation are tied together? That is, can one use or even instantiate types from the cuVS Java API without the native library being present?

Copy link
Contributor

@uschindler uschindler Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this in the snapshot module a problem. The code itsself should be written in a way that the codec can only be used, if the native library is there. Currently the code is a bit broken as it does not fail hard. I'd like to change this (see my review),
But if the dependency is on Maven central, it should work.
Of course we could make it also optional in module system, but then you would get CNFE when loading codecs or related classes, which would happen on SPI lookup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uschindler Agree - this feature should be optional. If the Java API is present and loadable without the native library - has at least one callable method, which we can call and catch, then that is fine. I'll try it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code catches this, but just logs a warning and then fails later (due to NPE). I argued that the Codec should fail hard or at least fails when it is to be used.

This must be refactored before release:

  • Allow to load the Codec, but delay initialization of native library until the codec is actually used to create components.
  • Fail hard when an index with the custom codec is loaded and theres no native access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking at the code I think best wozuld be the following: Create a small static "holder" class (inside the Codec impl) which has a static initializer loading and setting up the CUVS code. This should fail hard with some LinkageError if library is not there. The holder class should have getters for some CUVS entry points.

All getter methods to create formats and readers in the Codec delegate to the getters in the holder and pass through any exceptions. In addition, all CUVS classes should be package private (I mentioned this already) and only let the codec and a essential config classes public.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same for the postings formats and other public components loaded via SPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants