Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faiss::gpu::runMatrixMult failure #34

Closed
hellolovetiger opened this issue Mar 10, 2017 · 37 comments
Closed

faiss::gpu::runMatrixMult failure #34

hellolovetiger opened this issue Mar 10, 2017 · 37 comments

Comments

@hellolovetiger
Copy link

The full log:
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted (core dumped)

I have successfully run demo_ivfpq_indexing_gpu, which I think the faiss was installed successfully.

@wickedfoo
Copy link
Contributor

Possibility that you ran out of GPU memory?

@wickedfoo
Copy link
Contributor

What were you trying to run?

@hellolovetiger
Copy link
Author

hellolovetiger commented Mar 10, 2017

train data shape: (2000000, 1000)
base data shape: (20000000, 1000)
query data shape: (1000000, 1000)
data type: float32

my code:

index = faiss.index_factory(d, "OPQ16_512,IVF4096,PQ16")
co = faiss.GpuClonerOptions()
co.usePrecomputed = False
index = faiss.index_cpu_to_gpu(res, 0, index, co)

index.train(xt)
del xt

index.add(xb)  # error happends here

My GPU memory is 8GB. I just tried the bench bench_gpu_sift1m.py, the same error.

@wickedfoo
Copy link
Contributor

index.add(xb) # error happends here

instead of giving all of the (20000000, 1000) at once, try giving it in chunks of (10000, 1000) or so.
This is a issue that will be fixed at some point, the GPU side is less friendly unless you handle chunking the input beforehand, but eventually we'll handle that automatically.

@wickedfoo
Copy link
Contributor

Only GpuIndexFlat* handles passing large amounts of data all at once for add or search at present.

@hellolovetiger
Copy link
Author

I see. Actually I used numpy.memmap to load the data. Sorry, could you give me some guidance on how to chunk the input data that can be loaded with index.add?

@hellolovetiger
Copy link
Author

hellolovetiger commented Mar 10, 2017

Also, I notice that my GPU memory occupation in training is always about 20%. That's strange.

@hellolovetiger
Copy link
Author

hellolovetiger commented Mar 10, 2017

Just made some changes on the bench code bench_gpu_sift1m.py, still the same error. Populating top 10000 not work, either. Seems it is not memory issue. Maybe there is something wrong with the CUBLAS. By the way, do you have a plan to publish an official docker image to avoid some problems caused by installation?

#################################################################
#  Approximate search experiment
#################################################################

print "============ Approximate search"

index = faiss.index_factory(d, "IVF4096,PQ64")

# faster, uses more memory
# index = faiss.index_factory(d, "IVF16384,Flat")

co = faiss.GpuClonerOptions()

# here we are using a 64-byte PQ, so we must set the lookup tables to
# 16 bit float (this is due to the limited temporary memory).
co.useFloat16 = True

index = faiss.index_cpu_to_gpu(res, 0, index, co)

print "train"

index.train(xt)

print "add vectors to index"

index.add(xb[:10000])

@mdouze
Copy link
Contributor

mdouze commented Mar 10, 2017

Hi
Note that the code above will not work for 1000-dim data (because 1000 is not a multiple of 64).
We do not have plans for a Docker image.

@hellolovetiger
Copy link
Author

hellolovetiger commented Mar 10, 2017

Hi, mdouze. Above code is from bench_gpu_sift1m.py. I used the data from http://corpus-texmex.irisa.fr/, following the instruction in https://github.com/facebookresearch/faiss/tree/master/benchs. I just wanted to check if the bench code works. Turn out to be the same error with my own.

@mdouze
Copy link
Contributor

mdouze commented Mar 10, 2017

Ok, so this is the exact script bench_gpu_sift1m.py applied to the SIFT1M dataset and not your 20M*1000-dim dataset, correct?
On which type of GPU are you running this?

@hellolovetiger
Copy link
Author

Yes for your first question.
My GPU is GeForce GTX 1080

@mdouze
Copy link
Contributor

mdouze commented Mar 10, 2017

It could be the same bug as issue #8. Unfortunately we do not have the hardware to reproduce it, so we would be grateful if you could narrow down the error for us:

  • Does it still crash in the add?
  • If yes, could you add fewer vectors until it does not crash any more?
  • could you set co.usePrecomputed = false and test again?
  • could you reduce the 2 numbers in "IVF4096,PQ64" by powers of two until it does not crash any more?

@wickedfoo
Copy link
Contributor

You can also try running cuda-memcheck on the bench_gpu_sift1m.py to see if anything gets printed out that does not look like the following:

========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaPointerGetAttributes. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2eea03]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x126239]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x16e44]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d066]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d1e2]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1889f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x194e5]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xb504c]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x2332f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x260d0]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf8cb]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b35]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf415]
=========
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x2eea03]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x11de53]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x16e65]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d066]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1d1e2]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x1889f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x194e5]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xb504c]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x2332f]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0x260d0]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf8cb]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b35]
=========     Host Frame:test/demo_ivfpq_indexing_gpu [0xf415]

Another thing is to try resetting the GPU via nvidia-smi and trying again.

Also, you could try and investigate which CUDA shared libraries it is trying to load, to see if there is a mismatch if you have multiple CUDA SDK versions installed.

@wickedfoo
Copy link
Contributor

Also, I notice that my GPU memory occupation in training is always about 20%. That's strange.

Faiss GPU reserves about 18% of available GPU memory up front for scratch space. This amount is controllable via StandardGpuResources, but it will run slower if you decrease it by a lot (due to cudaMalloc/cudaFree overhead). 1-2 GB of scratch space seems to be appropriate for most workloads.

@hellolovetiger
Copy link
Author

For your questions:

  • could you add fewer vectors until it does not crash any more?
    It will always crash no matter how small the number of vectors is.
  • could you set co.usePrecomputed = false and test again?
    It works. But it doesn't work for my own code. I will give more tries.
  • could you reduce the 2 numbers in "IVF4096,PQ64" by powers of two until it does not crash any more?
    It will fail if setting co.usePrecomputed = True

Some other infos:
ldd gpu/test/demo_ivfpq_indexing_gpu ==>

linux-vdso.so.1 =>  (0x00007ffcc0066000)
libopenblas.so.0 => /usr/lib/libopenblas.so.0 (0x00007fa709dfd000)
liblapack.so.3 => /usr/lib/liblapack.so.3 (0x00007fa709661000)
libcublas.so.8.0 => /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcublas.so.8.0 (0x00007fa706cb1000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fa706aa9000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa70688b000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa706687000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa706383000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa70607d000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fa705e6e000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa705c58000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa705893000)
/lib64/ld-linux-x86-64.so.2 (0x00007fa70b606000)
libblas.so.3 => /usr/lib/libblas.so.3 (0x00007fa70408a000)
libgfortran.so.3 => /usr/lib/x86_64-linux-gnu/libgfortran.so.3 (0x00007fa703d70000)
libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007fa703b34000)

@hellolovetiger
Copy link
Author

@wickedfoo

  • here is cuda-memcheck result (setting co.usePrecomputed = True):
============ Approximate search
train
add vectors to index
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted========= Error: process didn't terminate successfully
========= Internal error (7)
========= No CUDA-MEMCHECK results found
  • resetting the GPU via nvidia-smi doesn't work
  • There is only one CUDA SDK: V8.0.44

@wickedfoo
Copy link
Contributor

Are you compiling with clang or gcc?

@hellolovetiger
Copy link
Author

gcc

@mdouze mdouze added the bug label Mar 12, 2017
@mdouze
Copy link
Contributor

mdouze commented Mar 12, 2017

I believe this is related to the GPU, which is similar to issue #8

@yhpku
Copy link

yhpku commented Mar 13, 2017

I meet the same problem. My GPU is TITAN X. I want to index 1000000 512 dimension vectors using faiss.GpuIndexFlatL2. Then it will meet this issue. But if I cut the number 1000000 to 500000, it will be normal. It seems the max number of vectors is 500000. Because 60*0000 vectors will also cause this problem. The following is my code:
d = 1000000 # dimension
nb = 512 # database size
nq = 1000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
xc = xb[0:1000, :].copy()
xc[:, 0] += 0.02

res = faiss.StandardGpuResources()

index = faiss.GpuIndexFlatL2(res, 0, d, False)   # build the GPU index
# index = faiss.IndexFlatL2(d)   # build the index
print index.is_trained
index.add(xb)

print index.ntotal
print (' build the index time %f ms' % ((time.time() - time_1) * 1000))

time_1 = time.time()
k = 1                       # we want to see 4 nearest neighbors
D, I = index.search(xc[:1], k)
print (' search time %f ms' % ((time.time() - time_1) * 1000))

@hellolovetiger
Copy link
Author

@yhpku , Thanks. I tried GTX 1080 and Titan X, both failed. Seems yours is caused by OOM. IndexFlatL2 will load all the data all at once for add or search. So, maybe 500000 is the upper limitation for Titan X. You can try IndexIVFPQ, which compresses the stored vectors with a lossy compression.

@mdouze
Copy link
Contributor

mdouze commented Mar 13, 2017

Hi @yhpku, in the code above you use 512 vectors in 1M dimensions. Is this what you want?

@yhpku
Copy link

yhpku commented Mar 13, 2017

@mdouze,that's not. I means 1M vectors in 512 dimensions

@mdouze
Copy link
Contributor

mdouze commented Mar 13, 2017

@hellolovetiger, Titan X should work. Does bench_gpu_sift1m.py crash on Titan X? What error?

@mdouze
Copy link
Contributor

mdouze commented Mar 13, 2017

@yhpku, please fix your code then.

@hellolovetiger
Copy link
Author

On Titan X,
For demo_ivfpq_indexing_gpu, the error is:

Adding the vectors to the index
Segmentation fault (core dumped)

For bench_gpu_sift1m.py,

============ Approximate search
train
WARNING clustering 100000 points to 4096 centroids: please provide at least 159744 training points
add vectors to index
Segmentation fault (core dumped)

The error will be gone if setting co.usePrecomputed = False

For my own code:

#train data shape: (2000000, 1000)
#base data shape: (20000000, 1000)
#query data shape: (1000000, 1000)
#data type: float32

index = faiss.index_factory(d, "OPQ16_512,IVF1024,PQ16")
co = faiss.GpuClonerOptions()
co.useFloat16 = False
co.usePrecomputed = False
co.indicesOptions = faiss.INDICES_CPU
index = faiss.index_cpu_to_gpu(res, 0, index, co)

index.train(xt)
del xt

index.add(xb)  # error happends here

The error is:

WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 5669326848 B, highwater 5669326848 B)
Faiss assertion err == CUBLAS_STATUS_SUCCESS failed in void faiss::gpu::runMatrixMult(faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, faiss::gpu::Tensor<T, 2, true>&, bool, float, float, cublasHandle_t, cudaStream_t) [with T = float; cublasHandle_t = cublasContext*; cudaStream_t = CUstream_st*] at utils/MatrixMult.cu:141Aborted (core dumped)

When I cut the base data from 20M to 3M, the error becomes:

WARN: increase temp memory to avoid cudaMalloc, or decrease query/add size (alloc 6144000000 B, highwater 6144000000 B)
Faiss assertion err == cudaSuccess failed in char* faiss::gpu::StackDeviceMemory::Stack::getAlloc(size_t, cudaStream_t) at utils/StackDeviceMemory.cpp:71Aborted (core dumped)

Seems it becomes a memory issue.
By the way, GpuIndexIVFPQ will still encounter memory issue if the base vectors is too big?

@wickedfoo
Copy link
Contributor

@hellolovetiger,

You are running out of GPU memory. Do not try and add so many vectors at once. 3M * 1000 * sizeof(float) is 12 GB.

Try adding the vectors in chunks of 10000 to 50000 instead.

@wickedfoo
Copy link
Contributor

After adding to the index, the vectors will then be compressed via PQ, and then you can add more. But, before compression, each vector takes 4000 bytes of memory ( = 1000 * sizeof(float)), not 16 bytes (PQ16).

@wickedfoo
Copy link
Contributor

Problems with attempting to add large CPU resident vectors all at once will be fixed internally at some point. But in the meantime you will have to incrementally add them.

@hellolovetiger
Copy link
Author

Got it. Thanks, @wickedfoo . It is better to add these infos to wiki. 😃

@yhpku
Copy link

yhpku commented Mar 14, 2017

@mdouze ,I am sorry , this is a typing error. The actual code is as follows. And the error output is, "Faiss assertion err == cudaSuccess failed in faiss::gpu::StackDeviceMemory::Stack::~Stack() at utils/StackDeviceMemory.cpp:54Aborted (core dumped)".

time_1 = time.time()
d = 512                           # dimension
nb = 700000                      # database size
nq = 1000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.
xc = xb[0:1000, :].copy()
xc[:, 0] += 0.02
res = faiss.StandardGpuResources()
index = faiss.GpuIndexFlatL2(res, 0, d, False)   # build the GPU index
print index.is_trained
index.add(xb)
print index.ntotal
print (' build the index time %f ms' % ((time.time() - time_1) * 1000))
time_1 = time.time()
D, I = index.search(xc[:2], 1)
print (D)
print (I)
print (' search time %f ms' % ((time.time() - time_1) * 1000))

@mdouze
Copy link
Contributor

mdouze commented Mar 29, 2017

Closing this issue now, because the discussion derived. Please open a new one if it is blocking.

@mdouze mdouze closed this as completed Mar 29, 2017
@anty-zhang
Copy link

anty-zhang commented Nov 29, 2019

Recently, I started to use faiss and met the same problem. I found many issues and tried almost all the solutions mentioned above, but failed to find a solution.

At last, I found different CUDA versions shown by nvcc and nvdia-smi, so I adjust the nvcc verion to match the nvidia-smi, and luckily it works at last. So, Note that the nvcc version must be consistent with the nvdia-smi version.

my mismatch nvcc and nvdia-smi

If you met the same problem throgh compile faiss, this may help you.

choose the best CUDA Toolkit version is here.
the difference between nvcc and nvidia-smi is here.

my env
my makefile

@zhangxinyu-xyz
Copy link

Recently, I started to use faiss and met the same problem. I found many issues and tried almost all the solutions mentioned above, but failed to find a solution.

At last, I found different CUDA versions shown by nvcc and nvdia-smi, so I adjust the nvcc verion to match the nvidia-smi, and luckily it works at last. So, Note that the nvcc version must be consistent with the nvdia-smi version.

my mismatch nvcc and nvdia-smi

If you met the same problem throgh compile faiss, this may help you.

choose the best CUDA Toolkit version is here. the difference between nvcc and nvidia-smi is here.

my env my makefile

You are lucky. Unfortunately, it does not work when I tried to use the faiss-gpu on cuda 11.1.

@tigert1998
Copy link

conda install -c conda-forge faiss-gpu
This fix it for me.

@sayfulloh11
Copy link

conda install -c conda-forge faiss-gpu

Hi,
I tried the same command
thank you it resolved my problem.

mqnfred pushed a commit to mqnfred/faiss that referenced this issue Oct 23, 2023
…ex-macro

Use impl_concurrent_index macro
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants