fix: filter out null values when sampling for index training #3404

wjones127 · 2025-01-21T23:57:13Z

We were not filtering out null values when sampling. Because we often call array.values() on Arrow arrays, which ignores the null bitmap, we are often silently treating the nulls as zeros (or possibly undefined values). Only thing that caught these nulls is an assertion. However, residualization occurring with L2 and Cosine often meant that these values were transformed and null information was lost before the assertion, which is why it got past previous unit tests.

This PR adds more assertions validating there aren't nulls, and makes sure the sampling code handles null vectors.

Closes #3402
Closes #3400

codecov-commenter · 2025-01-22T00:59:13Z

Codecov Report

Attention: Patch coverage is 92.57143% with 13 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (58c5e27) to head (59a239d).

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/utils.rs	90.09%	5 Missing and 6 partials ⚠️
rust/lance/src/index/vector/builder.rs	85.71%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3404      +/-   ##
==========================================
+ Coverage   78.81%   78.85%   +0.04%     
==========================================
  Files         250      250              
  Lines       91306    91475     +169     
  Branches    91306    91475     +169     
==========================================
+ Hits        71963    72135     +172     
+ Misses      16390    16379      -11     
- Partials     2953     2961       +8

Flag	Coverage Δ
unittests	`78.85% <92.57%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rust/lance/src/index/vector/ivf.rs

BubbleCal · 2025-01-22T03:14:40Z

rust/lance/src/index/vector/ivf.rs

@@ -2215,6 +2222,77 @@ mod tests {
            .await;
    }

+    #[rstest]
+    #[tokio::test]
+    async fn test_create_index_nulls(


i'm thinking should we add some tests for verifying recall? then we can know whether flat search handles nulls well.

it might be good to modify this test https://github.com/lancedb/lance/blob/main/rust/lance/src/index/vector/ivf/v2.rs to contain half rows with nulls

I was thinking, is there a way to count rows that are present in the index? I assume if it’s null then we don’t write it to the index file, right?

I have updated the test so it asserts we can use search to get all the non-null vectors back. But I am not getting the results I expect. I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.

it seems no such way to count that now, it could be easy for v3 index by counting the num rows of storage file.

I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.

@BubbleCal Could you help me make sense of the output of this test? https://github.com/lancedb/lance/actions/runs/12918160780/job/36026117407?pr=3404

I was expecting search to only return non-null rows, but it seems like we are getting some null vectors in the results.

wjones127 · 2025-01-27T19:43:51Z

rust/lance/src/index/vector/utils.rs

+        // Need to filter out null values
+        // Use a scan to collect row ids. Then sample from the row ids. Then do take.
+        let row_addrs = dataset
+            .scan()
+            .filter_expr(datafusion_expr::col(column).is_not_null())
+            .with_row_address()
+            .project::<&str>(&[])?
+            .try_into_batch()
+            .await?;


@westonpace How expensive do you think this query is? This is filtering for non-null vectors and getting the row ids. Do you think there are easy optimizations we could do? If so, I'd like to capture that in a ticket.

westonpace

I'm probably just missing something but I don't see where we are sampling.

westonpace · 2025-01-27T19:50:38Z

rust/lance/src/index/vector/pq.rs

@@ -447,6 +447,7 @@ pub async fn build_pq_model(
        "Finished loading training data in {:02} seconds",
        start.elapsed().as_secs_f32()
    );
+    debug_assert_eq!(training_data.logical_null_count(), 0);


Maybe just assert_eq? This shouldn't be a critical section. Better safe than sorry.

westonpace · 2025-01-27T19:53:22Z

rust/lance/src/index/vector/utils.rs

+        // Use a scan to collect row ids. Then sample from the row ids. Then do take.
+        let row_addrs = dataset
+            .scan()
+            .filter_expr(datafusion_expr::col(column).is_not_null())


What am I missing? At the moment there is no cheap way to scan if a column is/is not null. So this filter will load the entire column into memory? Why do scan->filter->take and not just scan->filter?

Sorry, forgot the sampling bit. It sounds like the best thing for now is scan->filter + reservoir sampling?

westonpace · 2025-01-27T20:02:16Z

rust/lance/src/index/vector/utils.rs

        let projection = dataset.schema().project(&[column])?;
        let batch = dataset.sample(sample_size_hint, &projection).await?;
        info!(
            "Sample training data: retrieved {} rows by sampling",
            batch.num_rows()
        );
        batch
+    } else if num_rows > sample_size_hint && is_nullable {


Hmm...shouldn't you be using sample_size_hint? In this branch there are way more rows than we need. E.g. to train a dataset with 1B rows we need 30K partitions and so sample_size_hint will be ~8M. It looks like you're going to read all 1B vectors. Also, I don't see any randomization.

Did you mean to shuffle row_addrs?

FWIW, in the python, we do this (

lance/python/python/lance/sampler.py

Line 137 in 58c5e27

for shard in shards:

):

Create a randomized take stream that will eventually take the entire dataset

Pull from the take stream and filter out nulls in-memory until we have sample_size_hint rows.

Stop pulling from the take stream

To speed up the "randomized take stream" we actually stream random "contiguous shards" that are sized to give us at least 2K take operations if there are no nulls (IIRC)

westonpace

Thanks, random_ranges looks useful. Maybe we can simplify the python impls at some point in the future.

github-actions bot added the bug Something isn't working label Jan 21, 2025

wjones127 marked this pull request as ready for review January 22, 2025 00:25

wjones127 requested review from eddyxu, BubbleCal and westonpace January 22, 2025 00:34

BubbleCal reviewed Jan 22, 2025

View reviewed changes

rust/lance/src/index/vector/ivf.rs Outdated Show resolved Hide resolved

BubbleCal reviewed Jan 22, 2025

View reviewed changes

BubbleCal approved these changes Jan 23, 2025

View reviewed changes

wjones127 added 4 commits January 27, 2025 11:09

fix: filter out null values when sampling for index training

6ed336b

fix duplicate column

299bc5f

pr feedback

2247db2

remove null values when creating index

9bbd79f

wjones127 force-pushed the fix/no-sampling-null-vectors branch from 3393029 to 9bbd79f Compare January 27, 2025 19:12

wjones127 commented Jan 27, 2025

View reviewed changes

westonpace reviewed Jan 27, 2025

View reviewed changes

wjones127 added 3 commits January 27, 2025 14:38

sample properly

0a5d1b6

cleanup

31aae7a

revert

59a239d

westonpace approved these changes Jan 28, 2025

View reviewed changes

wjones127 merged commit bfacd7c into lancedb:main Jan 28, 2025
26 of 27 checks passed

wjones127 deleted the fix/no-sampling-null-vectors branch January 28, 2025 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: filter out null values when sampling for index training #3404

fix: filter out null values when sampling for index training #3404

wjones127 commented Jan 21, 2025 •

edited

Loading

codecov-commenter commented Jan 22, 2025 •

edited

Loading

BubbleCal Jan 22, 2025

wjones127 Jan 22, 2025

wjones127 Jan 22, 2025

BubbleCal Jan 23, 2025

wjones127 Jan 24, 2025

wjones127 Jan 27, 2025

westonpace left a comment

westonpace Jan 27, 2025

westonpace Jan 27, 2025

wjones127 Jan 27, 2025

westonpace Jan 27, 2025

westonpace left a comment

fix: filter out null values when sampling for index training #3404

fix: filter out null values when sampling for index training #3404

Conversation

wjones127 commented Jan 21, 2025 • edited Loading

codecov-commenter commented Jan 22, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

wjones127 commented Jan 21, 2025 •

edited

Loading

codecov-commenter commented Jan 22, 2025 •

edited

Loading