Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: filter out null values when sampling for index training #3404

Merged
merged 7 commits into from
Jan 28, 2025

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jan 21, 2025

We were not filtering out null values when sampling. Because we often call array.values() on Arrow arrays, which ignores the null bitmap, we are often silently treating the nulls as zeros (or possibly undefined values). Only thing that caught these nulls is an assertion. However, residualization occurring with L2 and Cosine often meant that these values were transformed and null information was lost before the assertion, which is why it got past previous unit tests.

This PR adds more assertions validating there aren't nulls, and makes sure the sampling code handles null vectors.

Closes #3402
Closes #3400

@github-actions github-actions bot added the bug Something isn't working label Jan 21, 2025
@wjones127 wjones127 marked this pull request as ready for review January 22, 2025 00:25
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 92.57143% with 13 lines in your changes missing coverage. Please review.

Project coverage is 78.85%. Comparing base (58c5e27) to head (59a239d).

Files with missing lines Patch % Lines
rust/lance/src/index/vector/utils.rs 90.09% 5 Missing and 6 partials ⚠️
rust/lance/src/index/vector/builder.rs 85.71% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3404      +/-   ##
==========================================
+ Coverage   78.81%   78.85%   +0.04%     
==========================================
  Files         250      250              
  Lines       91306    91475     +169     
  Branches    91306    91475     +169     
==========================================
+ Hits        71963    72135     +172     
+ Misses      16390    16379      -11     
- Partials     2953     2961       +8     
Flag Coverage Δ
unittests 78.85% <92.57%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -2215,6 +2222,77 @@ mod tests {
.await;
}

#[rstest]
#[tokio::test]
async fn test_create_index_nulls(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm thinking should we add some tests for verifying recall? then we can know whether flat search handles nulls well.

it might be good to modify this test https://github.com/lancedb/lance/blob/main/rust/lance/src/index/vector/ivf/v2.rs to contain half rows with nulls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, is there a way to count rows that are present in the index? I assume if it’s null then we don’t write it to the index file, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the test so it asserts we can use search to get all the non-null vectors back. But I am not getting the results I expect. I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems no such way to count that now, it could be easy for v3 index by counting the num rows of storage file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use your advice to know what the expected behavior of these indices should be when there are lots of null vectors.

@BubbleCal Could you help me make sense of the output of this test? https://github.com/lancedb/lance/actions/runs/12918160780/job/36026117407?pr=3404

I was expecting search to only return non-null rows, but it seems like we are getting some null vectors in the results.

@wjones127 wjones127 force-pushed the fix/no-sampling-null-vectors branch from 3393029 to 9bbd79f Compare January 27, 2025 19:12
Comment on lines 132 to 140
// Need to filter out null values
// Use a scan to collect row ids. Then sample from the row ids. Then do take.
let row_addrs = dataset
.scan()
.filter_expr(datafusion_expr::col(column).is_not_null())
.with_row_address()
.project::<&str>(&[])?
.try_into_batch()
.await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace How expensive do you think this query is? This is filtering for non-null vectors and getting the row ids. Do you think there are easy optimizations we could do? If so, I'd like to capture that in a ticket.

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably just missing something but I don't see where we are sampling.

@@ -447,6 +447,7 @@ pub async fn build_pq_model(
"Finished loading training data in {:02} seconds",
start.elapsed().as_secs_f32()
);
debug_assert_eq!(training_data.logical_null_count(), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just assert_eq? This shouldn't be a critical section. Better safe than sorry.

// Use a scan to collect row ids. Then sample from the row ids. Then do take.
let row_addrs = dataset
.scan()
.filter_expr(datafusion_expr::col(column).is_not_null())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I missing? At the moment there is no cheap way to scan if a column is/is not null. So this filter will load the entire column into memory? Why do scan->filter->take and not just scan->filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot the sampling bit. It sounds like the best thing for now is scan->filter + reservoir sampling?

let projection = dataset.schema().project(&[column])?;
let batch = dataset.sample(sample_size_hint, &projection).await?;
info!(
"Sample training data: retrieved {} rows by sampling",
batch.num_rows()
);
batch
} else if num_rows > sample_size_hint && is_nullable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...shouldn't you be using sample_size_hint? In this branch there are way more rows than we need. E.g. to train a dataset with 1B rows we need 30K partitions and so sample_size_hint will be ~8M. It looks like you're going to read all 1B vectors. Also, I don't see any randomization.

Did you mean to shuffle row_addrs?

FWIW, in the python, we do this (

for shard in shards:
):

  • Create a randomized take stream that will eventually take the entire dataset
  • Pull from the take stream and filter out nulls in-memory until we have sample_size_hint rows.
  • Stop pulling from the take stream

To speed up the "randomized take stream" we actually stream random "contiguous shards" that are sized to give us at least 2K take operations if there are no nulls (IIRC)

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, random_ranges looks useful. Maybe we can simplify the python impls at some point in the future.

@wjones127 wjones127 merged commit bfacd7c into lancedb:main Jan 28, 2025
26 of 27 checks passed
@wjones127 wjones127 deleted the fix/no-sampling-null-vectors branch January 28, 2025 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants