perf: in-register lookup table & SIMD for 4bit PQ #3178

BubbleCal · 2024-11-26T05:00:58Z

4bit PQ is 3x faster than before:

16000,l2,PQ=96x4,DIM=1536
                        time:   [187.17 µs 187.95 µs 188.52 µs]
                        change: [-65.789% -65.641% -65.520%] (p = 0.00 < 0.10)
                        Performance has improved.

16000,cosine,PQ=96x4,DIM=1536
                        time:   [214.16 µs 214.52 µs 214.89 µs]
                        change: [-62.748% -62.594% -62.442%] (p = 0.00 < 0.10)
                        Performance has improved.

16000,dot,PQ=96x4,DIM=1536
                        time:   [190.12 µs 191.27 µs 192.22 µs]
                        change: [-65.496% -65.303% -65.086%] (p = 0.00 < 0.10)
                        Performance has improved.

post 8bit PQ results here for comparing, in short 4bit PQ is about 2x faster with the same index params:

compute_distances: 16000,l2,PQ=96,DIM=1536
                        time:   [405.11 µs 405.72 µs 406.92 µs]
                        change: [-0.2844% +0.1588% +0.6035%] (p = 0.50 > 0.10)
                        No change in performance detected.

compute_distances: 16000,cosine,PQ=96,DIM=1536
                        time:   [419.98 µs 421.05 µs 421.99 µs]
                        change: [-0.2540% +0.1098% +0.4928%] (p = 0.59 > 0.10)
                        No change in performance detected.

compute_distances: 16000,dot,PQ=96,DIM=1536
                        time:   [432.08 µs 433.63 µs 435.69 µs]
                        change: [-25.522% -25.243% -24.938%] (p = 0.00 < 0.10)
                        Performance has improved.

Signed-off-by: BubbleCal <[email protected]>

…-dist-table

Signed-off-by: BubbleCal <[email protected]>

…-dist-table

Signed-off-by: BubbleCal <[email protected]>

codecov-commenter · 2024-11-28T07:31:50Z

Codecov Report

Attention: Patch coverage is 57.07071% with 170 lines in your changes missing coverage. Please review.

Project coverage is 78.51%. Comparing base (6e84834) to head (5fc527c).

Files with missing lines	Patch %	Lines
rust/lance-linalg/src/simd/u8.rs	47.35%	169 Missing ⚠️
rust/lance-index/src/vector/pq/storage.rs	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3178      +/-   ##
==========================================
- Coverage   78.62%   78.51%   -0.12%     
==========================================
  Files         243      244       +1     
  Lines       82889    83213     +324     
  Branches    82889    83213     +324     
==========================================
+ Hits        65170    65331     +161     
- Misses      14933    15099     +166     
+ Partials     2786     2783       -3

Flag	Coverage Δ
unittests	`78.51% <57.07%> (-0.12%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <[email protected]>

eddyxu · 2024-11-29T13:16:15Z

rust/lance-index/src/vector/pq/distance.rs

+    // let qmax = distance_table
+    //     .chunks(NUM_CENTROIDS)
+    //     .tuple_windows()
+    //     .map(|(a, b)| {


delete those ?

eddyxu · 2024-11-29T13:16:40Z

rust/lance-index/src/vector/pq/distance.rs

    distances
 }

+// Quantize the distance table to u8
+// returns quantized_distance_table
+// used for only 4bit PQ so num_centroids must be 16


Can you add comment about what are the returns

eddyxu · 2024-11-29T13:19:00Z

rust/lance-index/src/vector/pq/distance.rs

@@ -278,7 +294,7 @@ mod tests {
        let pq_codes = Vec::from_iter((0..num_vectors * num_sub_vectors).map(|v| v as u8));
        let pq_codes = UInt8Array::from_iter_values(pq_codes);
        let transposed_codes = transpose(&pq_codes, num_vectors, num_sub_vectors);
-        let distances = compute_l2_distance(
+        let distances = compute_pq_distance(


We don't use dot anymore ?

compute_l2_distance and compute_dot_distance are the same, so keep only one.
the diff is at building distance table

eddyxu · 2024-11-29T15:18:08Z

rust/lance-linalg/src/simd/u8.rs

+#[derive(Clone, Copy)]
+pub struct u8x16(pub __m128i);
+
+/// 16 of 32-bit `f32` values. Use 512-bit SIMD if possible.


This is copied from simd/f32?

eddyxu · 2024-11-29T15:18:50Z

rust/lance-linalg/src/simd/u8.rs

+    }
+
+    #[inline]
+    pub fn right_shift_4(self) -> Self {


does this api compatible with portable_simd?

didn't see bit shifting operation in portable_simd

eddyxu · 2024-11-29T15:19:09Z

rust/lance-linalg/src/simd/u8.rs

+        }
+        #[cfg(target_arch = "loongarch64")]
+        unsafe {
+            Self(lasx_xvfrsh_b(self.0, 4))


huh you figured out how to use longarch?

lol no, no way to test it, let me remove all loongarch code for u8x16

eddyxu · 2024-11-29T15:19:29Z

rust/lance-linalg/src/simd/u8.rs

+        unsafe {
+            Self(vandq_u8(self.0, vdupq_n_u8(mask)))
+        }
+        #[cfg(target_arch = "loongarch64")]


Shall we always have a fallback for non simd route?

eddyxu · 2024-11-29T15:20:58Z

rust/lance-linalg/src/simd/u8.rs

+    fn reduce_min(&self) -> u8 {
+        #[cfg(target_arch = "x86_64")]
+        unsafe {
+            let low = _mm_and_si128(self.0, _mm_set1_epi8(0xFF_u8 as i8));


this is only using sse1? Curious whether there are avx2 related coding to make this even faster.

didn't find a avx2 intrinsic to do this, but reduce_min is not used for now

Lets just delete reduce_sum and reduce_min if they are not used.

eddyxu · 2024-11-29T15:21:22Z

rust/lance-linalg/src/simd/u8.rs

+        #[cfg(target_arch = "aarch64")]
+        unsafe {
+            Self(vminq_u8(self.0, rhs.0))
+        }


lets always have a fallback route

eddyxu · 2024-11-29T15:22:20Z

rust/lance/src/index/vector/ivf/v2.rs

-    #[case(4, DistanceType::L2, 0.9)]
-    #[case(4, DistanceType::Cosine, 0.9)]
-    #[case(4, DistanceType::Dot, 0.8)]
+    #[case(4, DistanceType::L2, 0.75)]


You mentioned the new algorithm can have decent recall? Should we bump this up

chebbyChefNEQ · 2024-11-29T15:38:45Z

rust/lance-index/src/vector/pq/distance.rs

    let num_vectors = code.len() * 2 / num_sub_vectors;
-    let mut distances = vec![0.0_f32; num_vectors];
+    // store the distances in u32 to avoid overflow


Signed-off-by: BubbleCal <[email protected]>

eddyxu · 2024-12-04T17:02:47Z

rust/lance-index/src/vector/pq/distance.rs

+            debug_assert_eq!(dist_table.as_array(), origin_dist_table.as_array());
+
+            // compute next distances
+            let next_indices = vec_indices.right_shift_4();


Should we just implement a Shr for u8x16? This interface looks weird.

eddyxu · 2024-12-04T17:07:56Z

rust/lance-linalg/src/simd/u8.rs

+    fn shuffle(&self, indices: u8x16) -> Self {
+        #[cfg(target_arch = "x86_64")]
+        unsafe {
+            Self(_mm_shuffle_epi8(self.0, indices.0))


Would it be faster if we can use https://doc.rust-lang.org/beta/core/arch/x86_64/fn._mm256_shuffle_epi8.html (u8x32)

Yeah I believe so,
Chose u8x16 because it fit in arm register size

we can implement u8x32 as 2 of 128bit register on arm? Just in general this can speed up x86 old cpu a lot, similar to https://github.com/lancedb/lance/blob/main/rust/lance-linalg/src/simd/f32.rs#L462

doable, let's do this next PR? that would need to change the computation logic as well, because there are only 16 centroids in distance_table for each sub vector.

eddyxu · 2024-12-04T17:09:31Z

rust/lance-index/src/vector/pq/distance.rs

+                .into_iter()
+                .zip(distances.iter_mut())
+                .for_each(|(d, sum)| {
+                    *sum += d as f32;


is this reduce_sum?

no, sum is distances[i]

Signed-off-by: BubbleCal <[email protected]>

…-dist-table

Signed-off-by: BubbleCal <[email protected]>

perf: in-register lookup table & SIMD for 4bit PQ

48bac16

Signed-off-by: BubbleCal <[email protected]>

github-actions bot added the performance label Nov 26, 2024

BubbleCal added 18 commits November 26, 2024 13:03

Merge branch 'main' of https://github.com/lancedb/lance into quantize…

37e8902

…-dist-table

quantize distance table

e24084d

Signed-off-by: BubbleCal <[email protected]>

done

607f16d

Signed-off-by: BubbleCal <[email protected]>

fix

4504fbe

Signed-off-by: BubbleCal <[email protected]>

fix

6c13fc1

Signed-off-by: BubbleCal <[email protected]>

fix

506d052

Signed-off-by: BubbleCal <[email protected]>

fix

aef5b11

Signed-off-by: BubbleCal <[email protected]>

fix

a2bc897

Signed-off-by: BubbleCal <[email protected]>

fix

22843d3

Signed-off-by: BubbleCal <[email protected]>

fix

9449c9d

Signed-off-by: BubbleCal <[email protected]>

fix x86 shift

23ffa79

Signed-off-by: BubbleCal <[email protected]>

fix x86 u8x16 multiply

31f53ed

Signed-off-by: BubbleCal <[email protected]>

fix ut

8040a46

Signed-off-by: BubbleCal <[email protected]>

fix ut

e5f8279

Signed-off-by: BubbleCal <[email protected]>

normalize vectors for tests

f5371d8

Signed-off-by: BubbleCal <[email protected]>

fix ut

b8a2016

Signed-off-by: BubbleCal <[email protected]>

Merge branch 'main' of https://github.com/lancedb/lance into quantize…

3a23bc8

…-dist-table

lower recall requirement

4406c08

Signed-off-by: BubbleCal <[email protected]>

BubbleCal requested review from eddyxu and chebbyChefNEQ November 28, 2024 07:27

BubbleCal marked this pull request as ready for review November 28, 2024 08:15

BubbleCal added 3 commits November 29, 2024 19:45

fix recall

e22c214

Signed-off-by: BubbleCal <[email protected]>

fix iport

6f7cac9

Signed-off-by: BubbleCal <[email protected]>

fix clippy

48f5989

Signed-off-by: BubbleCal <[email protected]>

eddyxu reviewed Nov 29, 2024

View reviewed changes

chebbyChefNEQ reviewed Nov 29, 2024

View reviewed changes

BubbleCal added 2 commits December 1, 2024 12:38

fix

3548651

Signed-off-by: BubbleCal <[email protected]>

fallback non-simd impl

aad20da

Signed-off-by: BubbleCal <[email protected]>

BubbleCal requested review from chebbyChefNEQ and eddyxu December 2, 2024 04:36

eddyxu reviewed Dec 4, 2024

View reviewed changes

BubbleCal added 2 commits December 5, 2024 13:05

impl right shift

c650893

Signed-off-by: BubbleCal <[email protected]>

fix

7d927cd

Signed-off-by: BubbleCal <[email protected]>

github-actions bot added the python label Dec 5, 2024

BubbleCal added 4 commits December 5, 2024 13:16

Merge branch 'main' of https://github.com/lancedb/lance into quantize…

1fdb2bc

…-dist-table

fix

7bfca29

Signed-off-by: BubbleCal <[email protected]>

fix

d93e4f9

Signed-off-by: BubbleCal <[email protected]>

fix

5fc527c

Signed-off-by: BubbleCal <[email protected]>

BubbleCal requested a review from eddyxu December 5, 2024 07:01

eddyxu approved these changes Dec 5, 2024

View reviewed changes

BubbleCal merged commit 6c7b9fd into lancedb:main Dec 5, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: in-register lookup table & SIMD for 4bit PQ #3178

perf: in-register lookup table & SIMD for 4bit PQ #3178

BubbleCal commented Nov 26, 2024 •

edited

Loading

codecov-commenter commented Nov 28, 2024 •

edited

Loading

eddyxu Nov 29, 2024

BubbleCal Dec 2, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 2, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 1, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 2, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 1, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 1, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 2, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 1, 2024

eddyxu Dec 4, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 2, 2024

eddyxu Nov 29, 2024

BubbleCal Dec 2, 2024

chebbyChefNEQ Nov 29, 2024

BubbleCal Dec 2, 2024

eddyxu Dec 4, 2024

eddyxu Dec 4, 2024

BubbleCal Dec 5, 2024

eddyxu Dec 5, 2024

BubbleCal Dec 5, 2024

eddyxu Dec 4, 2024

BubbleCal Dec 5, 2024

perf: in-register lookup table & SIMD for 4bit PQ #3178

perf: in-register lookup table & SIMD for 4bit PQ #3178

Conversation

BubbleCal commented Nov 26, 2024 • edited Loading

codecov-commenter commented Nov 28, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BubbleCal commented Nov 26, 2024 •

edited

Loading

codecov-commenter commented Nov 28, 2024 •

edited

Loading