Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Separate index into individual list arrays for IVF-Flat #1176

Closed
tfeher opened this issue Jan 25, 2023 · 1 comment
Closed

[FEA] Separate index into individual list arrays for IVF-Flat #1176

tfeher opened this issue Jan 25, 2023 · 1 comment
Labels
FAISS feature request New feature or request

Comments

@tfeher
Copy link
Contributor

tfeher commented Jan 25, 2023

Is your feature request related to a problem? Please describe.
Currently both IVF-Flat an IVF-PQ stores its index as a single array of all vectors. This has a drawback when we extend the index: we need to recreate the whole index to insert the new values.

See #1170 (comment) for general background information

Describe the solution you'd like

  1. Store IVF-Flat index in separate array for each cluster
  2. Store IVF-PQ index in separate array for each cluster
  3. Amortize cost of growing the array

Details: IVF-Flat

  • IVF-flat index structure is described

    * Inverted list data [size, dim].
    *
    * The data consists of the dataset rows, grouped by their labels (into clusters/lists).
    * Within each list (cluster), the data is grouped into blocks of `kIndexGroupSize` interleaved
    * vectors. Note, the total index length is slightly larger than the source dataset length,
    * because each cluster is padded by `kIndexGroupSize` elements.
    *
    * Interleaving pattern:
    * within groups of `kIndexGroupSize` rows, the data is interleaved with the block size equal to
    * `veclen * sizeof(T)`. That is, a chunk of `veclen` consecutive components of one row is
    * followed by a chunk of the same size of the next row, and so on.
    *
    * __Example__: veclen = 2, dim = 6, kIndexGroupSize = 32, list_size = 31
    *
    * x[ 0, 0], x[ 0, 1], x[ 1, 0], x[ 1, 1], ... x[14, 0], x[14, 1], x[15, 0], x[15, 1],
    * x[16, 0], x[16, 1], x[17, 0], x[17, 1], ... x[30, 0], x[30, 1], - , - ,
    * x[ 0, 2], x[ 0, 3], x[ 1, 2], x[ 1, 3], ... x[14, 2], x[14, 3], x[15, 2], x[15, 3],
    * x[16, 2], x[16, 3], x[17, 2], x[17, 3], ... x[30, 2], x[30, 3], - , - ,
    * x[ 0, 4], x[ 0, 5], x[ 1, 4], x[ 1, 5], ... x[14, 4], x[14, 5], x[15, 4], x[15, 5],
    * x[16, 4], x[16, 5], x[17, 4], x[17, 5], ... x[30, 4], x[30, 5], - , - ,

  • The concatenated list of vectors are stored in data_.

  • We maintain the corresponding indices_ from the original dataset .

  • The clusters are described with sizes and offsets

Proposed changes IVF-Flat
Build

  • The dataset could be stored in a std::vector<device_mdarray<T, extent_2d<IdxT>, row_major>>, where each cluster would have a separate mdarray to store its vectors.
  • Vectors are added to the dataset in the extend method.
  • Instead of calculating new offsets, and moving old data for all clusters, we need to move the cluster data only for the culsters that get new vectors.
  • If we store the clusters in an mdarray that have some empty rows at the end as reserve capacity, then the previous step is not needed, the new vectors can be copied directly after the existing ones (next step by build_index_kernel).
  • build_index_kernel is called on the lists that are getting extended.

Search

Additional steps:

Amortize costs of index growth

  • If needed, then we probably want a user parameter that would control how much extra space do we allocate
@tfeher tfeher added the feature request New feature or request label Jan 25, 2023
@tfeher tfeher changed the title [FEA] Separate index into individual list arrays for IVF-Flat and IVF-PQ [FEA] Separate index into individual list arrays for IVF-Flat Jan 25, 2023
@cjnolet cjnolet added the FAISS label Jan 30, 2023
@achirkin
Copy link
Contributor

Resolved via #1271

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FAISS feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants