-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Faiss on the GPU
Some of the most useful indexes are implemented on GPU. The speedup is 5x-10x for a single GPU and multiple GPUs are supported with near-linear speedup.
GPU support is via CUDA. The machine should contain at least one CUDA-capable device of minimum compute capability 3.5 (Kepler and up, K40 included). Warp shuffles (CC 3.0+) and read-only texture caching via ld.nc
/__ldg
(CC 3.5+) are the more exotic hardware features used. float16 support requires compiling with CUDA 7.5+.
The GPU Index
-es can accommodate both host and device pointers as input to add()
and search()
. If the inputs to add()
and search()
are already on the same GPU as the index, then no copies are performed and the execution is fastest. Otherwise, a CPU -> GPU copy (or cross-device if the input is resident on a different GPU than the index) will be performed, with a copy back for any results when done (e.g., for search()
).
The GPU Index
-es should also be usable as drop-in replacements for anything that expects a CPU Index
; copies are handled as needed. For example, a GpuIndexFlatL2
can be used in place of an IndexFlatL2
and will provide significant acceleration. The GpuIndexIVFFlat
and GpuIndexIVFPQ
deliberately do not support the full APIs of their CPU counterparts IndexIVF
, IndexIVFFlat
and IndexIVFPQ
(e.g., direct manipulation of the inverted lists), but most features of the base Index
API are supported, as well as all of the important indexing features.
Converting from/to GPU is enabled with index_gpu_to_cpu
, index_cpu_to_gpu
and index_cpu_to_gpu_multiple
. The two latter ones take an optional GpuClonerOptions
object, that can be used to adjust the way the GPU stores the objects. The defaults are appropriate for cases where memory is not a constraint. The fields can be adjusted when memory is scarce.
The index types IndexFlat
, IndexIVFFlat
and IndexIVFPQ
are implemented on the GPU, as GpuIndexFlat
, GpuIndexIVFFlat
and GpuIndexIVFPQ
. In addition to their normal arguments, they take a resource object as input, along with index storage configuration options and float16/float32 configuration parameters.
The resource object contains needed resources for each GPU in use, including an allocation of temporary scratch space (by default, about 2 GB on a 12 GB GPU), cuBLAS handles and CUDA streams. The temporary scratch space via the GpuResources
object is important for speed and to avoid unnecessary GPU/GPU and GPU/CPU synchronizations via cudaFree
. All faiss GPU code strives to be allocation-free on the GPU, assuming temporary state (intermediate results of calculations and the like) can fit into the scratch space. The temporary space reservation can be adjusted to a fraction or absolute amount of GPU memory and even set to 0 bytes, though if too small, you may notice slowdowns due to cudaMalloc
and cudaFree
. The high water mark used of the scratch space can be inquired from the resources object, and so can be adjusted to suit actual needs. All GPU work is ordered with respect to the stream specified in the GpuResources
object, not necessarily the default (aka null) stream.
There are 4 different user indices storage options available. User indices are integer values associated with vectors in the database. INDICES_64_BIT
stores the user-provided indices (for indexed vectors) as a 64 bit signed integer on the GPU itself. INDICES_32_BIT
stores the indices as 32 bit signed integers on the GPU, for a case where one knows that all indices fit in that range. The end-user API will still upcast the values to long
. INDICES_CPU
avoids storing any indices information on the GPU, and instead records it on the CPU. This is useful when GPU space is at a premium, but will involve GPU/CPU copies and lookups. INDICES_IVF
is useful only for composing a GPU index with other indices that can interpret an inverted file ID and offset.
float16 or float32 precision options affect the storage of data in the database (as with GpuIndexIVFFlat
), or the storage and processing of intermediate data (as with GpuIndexIVFPQ
). This option is available on all supported GPUs, from Kepler (K40) to Maxwell and up. float16 is often much more performant and will reduce GPU memory consumption, at the possible cost of accuracy. Recall@N seems mostly unaffected by float16, but estimated distances are affected. GPU architectures with native float16 (aka CUDA half
) math support, like Pascal, are taken advantage of to provide additional speed. The GpuIndexFlat
in float16 mode on Pascal will use Hgemm
; SgemmEx
is used on other hardware.
k
and nprobe
must be <= 1024 for all indices.
For GpuIndexIVFPQ
, code sizes per encoded vector allowed are 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 48, 56, 64 and 96 bytes.
Memory is generally a scarcer resource on GPU, so there are things to take care of when using the GPU indexes:
-
GpuIndexIVFPQ
with 56 bytes per code or more requires use of the float16 IVFPQ mode due to shared memory limitations (e.g., 48 x 2^8 x sizeof(float) is 49152 bytes); -
precomputed tables for
GpuIndexIVFPQ
may take up a substantial amount of memory. If you see cudaMalloc errors, disable precomputed tables; -
the indices corresponding to inverted file entries can be stored on CPU rather than on GPU, use
indices_options = INDICES_CPU
-
when memory is really tight, geometric reallocation of the inverted lists can overflow the memory. To avoid this (and generally to increase add speed), call
reserveVecs
on aGpuIndexIVFPQ
orGpuIndexIVFFlat
if you know how big the index will be. -
the
StandardGpuResources
object reserves 18% of GPU memory for temporary calculation space by default. If this is too much, the size can be reduced, at the possible cost of speed.
When converting from a CPU index with index_cpu_to_gpu
, the default GpuClonerOptions
object is tuned to maximize speed at the cost of memory usage.
In Python, the GPU classes are available if GPU is compiled in.
See benchs/bench_gpu_sift1m.py for a usage example.
Multiple device support can be obtained by:
-
copying the dataset over several GPUs and splitting searches over those datasets with an
IndexProxy
. -
splitting the dataset over the GPUs with an
IndexShards
.
GPU faiss varies between 5x - 10x faster than the corresponding CPU implementation on a single GPU (see benchmarks and performance information). If multiple GPUs are available in a machine, near linear speedup over a single GPU (6 - 7x with 8 GPUs) can be obtained by replicating over multiple GPUs. Sharding instead of replication (i.e., each of N GPUs holds 1/N of the database) results in some speedup, but is more sub-linear.
All indices (except for possibly GpuIndexFlat*
) are mostly memory bandwidth (global or shared) limited rather than arithmetic throughput limited. As with CPU, performance is best with batch queries rather than query size 1. Performance suffers with larger k
nearest neighbor selection values, especially above 512 or so.
Faiss building blocks: clustering, PCA, quantization
Index IO, cloning and hyper parameter tuning
Threads and asynchronous calls
Inverted list objects and scanners
Indexes that do not fit in RAM
Brute force search without an index
Fast accumulation of PQ and AQ codes (FastScan)
Setting search parameters for one query
Binary hashing index benchmark