Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsampling for IVF-PQ codebook generation #2052

Merged
merged 25 commits into from
Jan 25, 2024

Conversation

abc99lr
Copy link
Contributor

@abc99lr abc99lr commented Dec 8, 2023

This PR address #1901 by subsampling the input dataset for PQ codebook training to reduce the runtime.

Currently, a similar strategy is applied to per_cluster method, but not to the default per_subset method. This PR fixes this gap. Similar to the subsampling mechanism of the per_cluster method, we pick at minimum 256*max(pq_book_size, pq_dim) number of input rows for training each code book.

size_t big_enough = 256ul * std::max<size_t>(index.pq_book_size(), index.pq_dim());

The following performance numbers are generated using Deep-100M dataset. After subsampling, the search time and accuracy are not impacted (within +-5%) except one case where I saw 9% performance drop on search (using 10K batch for search). More extensive benchmarking across datasets seems to be needed for justification.

Dataset n_iter n_list pq_bits pq_dim ratio Original time (s) Subsampling (s) Speedup [subsampling]
Deep-100M 25 50000 4 96 10 129 89.5 1.44
Deep-100M 25 50000 5 96 10 128 89.4 1.43
Deep-100M 25 50000 6 96 10 131 90 1.46
Deep-100M 25 50000 7 96 10 129 91.1 1.42
Deep-100M 25 50000 8 96 10 149 93.4 1.60

Note, after subsampling, the PQ codebook generation is no longer a bottleneck in the IVF-PQ index building. More optimizations on PQ codebook generation seem unnecessary. Although we could in theory apply the custom kernel approach (#2050)
with subsampling, my early tests show the current GEMM approach performs better than the custom kernel after subsampling.

Using multiple stream could improve the performance further by overlapping kernels for different pq_dim, given kernels are small after subsampling and may not fully utilize GPU. However, as mention above, since the entire PQ codebook is fast, this optimization may not be worthwhile.

TODO

  • Benchmark the performance/accuracy impacts on multiple datasets

@abc99lr abc99lr requested a review from a team as a code owner December 8, 2023 22:45
Copy link

copy-pr-bot bot commented Dec 8, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Dec 8, 2023
@tfeher tfeher added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 13, 2023
@tfeher
Copy link
Contributor

tfeher commented Dec 13, 2023

/ok to test

@tfeher
Copy link
Contributor

tfeher commented Dec 13, 2023

Thanks Rui for the PR! @achirkin could you have look at the proposed subsampling step?

@github-actions github-actions bot added the python label Jan 6, 2024
@abc99lr
Copy link
Contributor Author

abc99lr commented Jan 9, 2024

Tested the performance of this PR based on the first-level subsampling PR (#2077) on Deep-100M dataset with different build parameters. All the tests are done on A100-80GB-PCIe GPU.

pq_codebook_ratio controls the amount of sampling. The dataset fraction used for codebook training is 1/pq_codebook_ratio, similar to the definition of the current ratio variable that controls the amount of the first-level subsampling.

Here is the table for build performance. With the codebook subsampling, we can see about 30%-50% speedup, depending on the amount of subsampling user choose. Here, the 30%-50% speedup is achieved with using 10%-20% of input (after the initial subsampling) for codebook training.

iter nlist pq_bits pq_codebook_ratio pq_dim ratio GPU build time (s)
25 50k 5 1 96 10 130.548
25 50k 5 5 96 10 99.0929
25 50k 5 10 96 10 95.2036
25 50k 8 1 96 10 155.4
25 50k 8 5 96 10 107.101
25 50k 8 10 96 10 101.726
25 50k 5 1 64 10 132.206
25 50k 5 5 64 10 99.123
25 50k 5 10 64 10 95.1221
25 50k 8 1 64 10 141.418
25 50k 8 5 64 10 104.241
25 50k 8 10 64 10 100.206

The search performance are shown in the tables below. The maximum recall difference compared to no codebook subsampling is about 0.38%, which means a slightly recall increase with PQ codebook subsampling. This suggests it's more like a run-to-run variation. I am going to rerun the tests to eliminate the effect of run-to-run variation (going to update this PR afterwards). All the search results below are without refinement.

n_list=50K, pq_dim=96, pq_bits=5, n_iter=25

  recall recall recall recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling QPS QPS QPS
n_probe pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10
20 0.80905 0.80814 0.80889 0.00% -0.11% -0.02% 368557 367210 367712
30 0.84183 0.84217 0.84103 0.00% 0.04% -0.10% 271789 269725 270331
40 0.85914 0.8602 0.85879 0.00% 0.12% -0.04% 221708 220423 220543
50 0.86985 0.87108 0.86936 0.00% 0.14% -0.06% 188799 188308 188017
100 0.89127 0.89194 0.89099 0.00% 0.08% -0.03% 110754 110733 110534
200 0.90081 0.90054 0.89973 0.00% -0.03% -0.12% 63032.5 62878 62976.1
1000 0.90511 0.90504 0.90417 0.00% -0.01% -0.10% 16745.9 16686.8 16644.8
2000 0.90528 0.9052 0.90434 0.00% -0.01% -0.10% 9055.15 9013.77 9035.94
5000 0.90532 0.90525 0.90437 0.00% -0.01% -0.10% 3967.44 3950.08 3959.11
10000 0.90532 0.90525 0.90438 0.00% -0.01% -0.10% 2106.13 2100.26 2106.36

n_list=50K, pq_dim=64, pq_bits=5, n_iter=25

  recall recall recall recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling QPS QPS QPS
n_probe pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10
20 0.67986 0.68094 0.67951 0.00% 0.16% -0.05% 454869 453663 453699
30 0.70131 0.70281 0.70159 0.00% 0.21% 0.04% 343085 342176 341031
40 0.71249 0.71401 0.71279 0.00% 0.21% 0.04% 283827 282973 283161
50 0.71891 0.72164 0.71963 0.00% 0.38% 0.10% 244791 244737 243534
100 0.73223 0.73455 0.73301 0.00% 0.32% 0.11% 146721 146617 146339
200 0.73783 0.73965 0.73801 0.00% 0.25% 0.02% 83077.8 83123.6 83018
1000 0.74076 0.74232 0.74088 0.00% 0.21% 0.02% 21814.6 21781.5 21796.6
2000 0.74085 0.74246 0.74092 0.00% 0.22% 0.01% 11644.9 11627.7 11612.7
5000 0.74087 0.74246 0.74093 0.00% 0.21% 0.01% 5041.84 5044.43 5023.59
10000 0.74088 0.74246 0.74093 0.00% 0.21% 0.01% 2664.48 2666.91 2656.13

n_list=50K, pq_dim=96, pq_bits=8, n_iter=25

  recall recall recall recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling QPS QPS QPS
n_probe pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10
20 0.84522 0.84603 0.84618 0.00% 0.10% 0.11% 149813 149708 149599
30 0.88647 0.8877 0.88793 0.00% 0.14% 0.16% 105652 105561 105556
40 0.90938 0.9106 0.91088 0.00% 0.13% 0.16% 82747.5 82683 82658.3
50 0.92431 0.9247 0.92551 0.00% 0.04% 0.13% 68229.1 68197.3 68164.9
100 0.95335 0.95436 0.95347 0.00% 0.11% 0.01% 36994.9 36963.2 36967.3
200 0.96703 0.96733 0.96637 0.00% 0.03% -0.07% 19682.4 19673.2 19671.5
1000 0.97381 0.97389 0.97334 0.00% 0.01% -0.05% 4520.13 4510.28 4514.8
2000 0.97397 0.97416 0.97366 0.00% 0.02% -0.03% 2350.32 2352.48 2347.96
5000 0.97405 0.97423 0.97373 0.00% 0.02% -0.03% 982.055 980.334 980.967
10000 0.97406 0.97423 0.97373 0.00% 0.02% -0.03% 505.058 505.138 504.427

n_list=50K, pq_dim=64, pq_bits=8, n_iter=25

  recall recall recall recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling recall_diff vs no codebook subsampling QPS QPS QPS
n_probe pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10
20 0.79791 0.79931 0.80015 0.00% 0.18% 0.28% 199479 199347 199316
30 0.83106 0.83291 0.83217 0.00% 0.22% 0.13% 146485 146500 146392
40 0.84826 0.8495 0.84924 0.00% 0.15% 0.12% 115829 115823 115772
50 0.85815 0.86038 0.85999 0.00% 0.26% 0.21% 96181.5 96139.9 96116.3
100 0.87928 0.88015 0.88063 0.00% 0.10% 0.15% 53281.1 53268.3 53269.6
200 0.88845 0.88935 0.88959 0.00% 0.10% 0.13% 28562.7 28556.1 28563.8
1000 0.89299 0.89405 0.89416 0.00% 0.12% 0.13% 6651.5 6650.96 6653.26
2000 0.89313 0.89415 0.89435 0.00% 0.11% 0.14% 3453.13 3463.49 3455.71
5000 0.89319 0.89416 0.89438 0.00% 0.11% 0.13% 1438.96 1436.86 1436.74
10000 0.89319 0.89416 0.89438 0.00% 0.11% 0.13% 735.696 735.859 735.35

@tfeher
Copy link
Contributor

tfeher commented Jan 12, 2024

Thanks @abc99lr for the measurements! The additional subsampling for PQ codebooks gives a nice improvement in IVF-PQ build time, and I am excited about this change!

In many cases we see less than 0.05% diff in recall, and that looks perfect. But there are also other cases where we have larger than 0.1%, in those cases we would like to understand whether it is due to run-to-run variation. I am running additional test with PR #2077 and we will compare the diffs to that.

@abc99lr
Copy link
Contributor Author

abc99lr commented Jan 13, 2024

Updates on run-to-run variance. I reran the code (both build and search) three times. And find even without this PR, the run-to-run variance is 0.37%. Please see the following tables for recall difference compared to the first run.

The tests below are with Deep-100M dataset, tested on A100-80GB-PCIe with n_list=50K and n_iter=25.

2nd run vs 1st run:

  pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5 pq_dim=64, pq_bits=5 pq_dim=64, pq_bits=5 pq_dim=64, pq_bits=5 pq_dim=96, pq_bits=8 pq_dim=96, pq_bits=8 pq_dim=96, pq_bits=8 pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5
n_probe pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10
20 0.11% 0.13% 0.05% -0.05% -0.31% -0.24% 0.17% 0.11% -0.04% 0.02% -0.06% -0.02%
30 0.02% -0.03% 0.22% -0.16% -0.30% -0.39% 0.10% 0.06% -0.01% 0.05% -0.17% 0.05%
40 0.02% 0.03% 0.15% -0.20% -0.37% -0.41% 0.11% 0.10% -0.04% 0.01% -0.05% 0.07%
50 0.02% 0.10% 0.18% -0.23% -0.47% -0.47% 0.03% 0.05% -0.05% 0.00% -0.17% 0.09%
100 -0.12% 0.16% 0.11% -0.29% -0.45% -0.49% 0.08% -0.03% -0.01% -0.08% -0.05% 0.03%
200 -0.13% 0.23% 0.14% -0.30% -0.33% -0.42% -0.02% 0.03% 0.07% -0.06% -0.09% -0.06%
1000 -0.11% 0.25% 0.09% -0.31% -0.36% -0.42% -0.03% 0.04% 0.05% -0.08% -0.10% -0.08%
2000 -0.11% 0.25% 0.09% -0.30% -0.36% -0.41% -0.03% 0.04% 0.05% -0.08% -0.09% -0.08%
5000 -0.11% 0.25% 0.10% -0.31% -0.36% -0.41% -0.03% 0.04% 0.05% -0.08% -0.09% -0.08%
10000 -0.11% 0.25% 0.09% -0.31% -0.36% -0.41% -0.03% 0.04% 0.04% -0.08% -0.09% -0.08%

3rd run vs 1st run:

  pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5 pq_dim=64, pq_bits=5 pq_dim=64, pq_bits=5 pq_dim=64, pq_bits=5 pq_dim=96, pq_bits=8 pq_dim=96, pq_bits=8 pq_dim=96, pq_bits=8 pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5 pq_dim=96, pq_bits=5
n_probe pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10 pq_codebook_ratio=1 pq_codebook_ratio=5 pq_codebook_ratio=10
20 -0.31% -0.08% -0.09% -0.10% -0.27% 0.03% 0.14% 0.15% 0.16% 0.22% -0.01% -0.18%
30 -0.15% -0.16% 0.00% -0.14% -0.32% -0.06% 0.22% 0.00% 0.02% 0.14% -0.17% -0.08%
40 -0.12% -0.13% 0.04% -0.14% -0.35% -0.11% 0.23% -0.05% -0.04% 0.31% 0.01% 0.00%
50 -0.09% -0.07% 0.10% -0.03% -0.41% -0.14% 0.15% 0.02% -0.11% 0.37% -0.02% -0.08%
100 -0.08% -0.05% 0.02% -0.14% -0.51% -0.26% 0.17% -0.07% 0.02% 0.25% 0.11% -0.12%
200 -0.16% 0.00% 0.06% -0.17% -0.44% -0.20% 0.10% -0.04% 0.03% 0.14% 0.12% -0.20%
1000 -0.10% -0.01% 0.08% -0.17% -0.45% -0.22% 0.08% -0.01% 0.04% 0.14% 0.10% -0.22%
2000 -0.10% -0.01% 0.07% -0.17% -0.46% -0.21% 0.09% -0.01% 0.04% 0.14% 0.10% -0.21%
5000 -0.10% -0.01% 0.07% -0.17% -0.45% -0.21% 0.09% -0.01% 0.04% 0.14% 0.10% -0.20%
10000 -0.10% -0.01% 0.07% -0.17% -0.45% -0.21% 0.09% -0.01% 0.04% 0.14% 0.10% -0.20%

I think the 0.38% difference we saw with this PR is acceptable, if we can see similar run-to-run variance with #2077.

The results also show that the run-to-run variance is higher when pd_dim=64 and pq_bits=5, compared to other combinations.

@abc99lr abc99lr changed the title [WIP] Subsampling for per_subset method of IVF-PQ codebook generation [REVIEW] Subsampling for per_subset method of IVF-PQ codebook generation Jan 13, 2024
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rui for the update. The results look great. I have seen similar recall variations, and I think that looks good as well. Just a few small things.

cpp/bench/ann/src/raft/raft_ann_bench_param_parser.h Outdated Show resolved Hide resolved
cpp/include/raft/neighbors/ivf_pq_types.hpp Outdated Show resolved Hide resolved
python/pylibraft/pylibraft/neighbors/ivf_pq/ivf_pq.pyx Outdated Show resolved Hide resolved
@abc99lr abc99lr requested review from achirkin and tfeher January 24, 2024 19:19
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rui for the update! The PR looks good to me!

@cjnolet
Copy link
Member

cjnolet commented Jan 24, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Jan 25, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Jan 25, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Jan 25, 2024

/merge

@abc99lr
Copy link
Contributor Author

abc99lr commented Jan 25, 2024

Hi @achirkin , I think the change you requested has been fixed. Could you please approve this PR?

@cjnolet cjnolet dismissed achirkin’s stale review January 25, 2024 04:02

Dismissing to get this in before code freeze. Rui has addressed the request.

@cjnolet
Copy link
Member

cjnolet commented Jan 25, 2024

/ok to test

@cjnolet cjnolet removed request for a team and achirkin January 25, 2024 04:03
@cjnolet
Copy link
Member

cjnolet commented Jan 25, 2024

/merge

@rapids-bot rapids-bot bot merged commit e272176 into rapidsai:branch-24.02 Jan 25, 2024
61 checks passed
@achirkin
Copy link
Contributor

Sorry for being late, but yes, LGTM! :)

cjnolet added a commit to cjnolet/raft that referenced this pull request Jan 31, 2024
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this pull request Aug 1, 2024
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #122
divyegala pushed a commit to divyegala/cuvs that referenced this pull request Aug 7, 2024
Random sampling of training set for IVF methods was reverted in rapidsai/raft#2144 due to the large memory usage of the subsample method.

Since then, PR rapidsai/raft#2155 has implemented a new random sampling method with improved memory utilization.  Using that we can now enable random sampling of IVF methods (rapidsai/raft#2052 and rapidsai/raft#2077).

Random subsampling has measurable overhead for IVF-Flat, therefore it is only enabled for IVF-PQ.

Authors:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
Development

Successfully merging this pull request may close these issues.

6 participants