Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Large memory overhead while subsampling vectors for IVF methods #2141

Closed
tfeher opened this issue Jan 31, 2024 · 1 comment
Closed
Assignees
Labels
bug Something isn't working Vector Search

Comments

@tfeher
Copy link
Contributor

tfeher commented Jan 31, 2024

Describe the bug

The IVF methods use random subsampling to create a training set for k-means clustering.

The random sampling algorithm allocates several temporary buffers:

rmm::device_uvector<WeightsT> expWts(len, stream);
rmm::device_uvector<WeightsT> sortedWts(len, stream);
rmm::device_uvector<IdxT> inIdx(len, stream);
rmm::device_uvector<IdxT> outIdxBuff(len, stream);

On top of this sortPairs would also allocate temporary buffers, with size O(len)

Here len is the size (number of vectors) in the whole dataset. When we are indexing DEEP-1B, then len=1e9, and the temporary space becomes prohibitive.

Steps/Code to reproduce bug
Run DEEP-1B test with IVF-PQ, with subsample ratio 10. It will run OOM.

Expected behavior
Subsample vectors with minimal memory overhead.

@tfeher tfeher added bug Something isn't working Vector Search labels Jan 31, 2024
@tfeher tfeher self-assigned this Jan 31, 2024
@tfeher
Copy link
Contributor Author

tfeher commented Jan 31, 2024

A quick fix for this problem is to revert the following PRs:

I am investigating whether we have an alternative quick fix to reduce the memory overhead for the subsampling.

rapids-bot bot pushed a commit that referenced this issue Mar 19, 2024
The random sampling of IVF methods was reverted (#2144) due to large memory utilization #2141.

This PR improves the memory consumption of subsamling: it is O(n_train) where n_train is the size of the subsampled dataset.

This PR adds the following new APIs:
- random::excess_sampling (todo may just call as sample_without_replacement)
- matrix::sample_rows
- matrix::gather for host input matrix

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - Artem M. Chirkin (https://github.com/achirkin)
  - Ben Frederickson (https://github.com/benfred)

URL: #2155
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Vector Search
Projects
None yet
Development

No branches or pull requests

1 participant