Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: add deletion files to Lance format #920

Merged
merged 6 commits into from
Jun 1, 2023

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented May 31, 2023

Closes #908

@wjones127 wjones127 force-pushed the wjones127/deletion-definition branch from 86e8efd to 742dfa8 Compare May 31, 2023 18:46
@wjones127 wjones127 marked this pull request as ready for review May 31, 2023 18:47
Copy link
Contributor

@chebbyChefNEQ chebbyChefNEQ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I never reviewed rust before. I will do a second pass in the morning just to make sure I'm not missing something obvious.

fn deletion_arrow_schema() -> Arc<Schema> {
Arc::new(Schema::new(vec![Field::new(
"row_id",
DataType::UInt32,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential tangent: maybe we want 64 bit uint to be safe? Just so that we don't have a "small" limit of number of rows we can store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this isn't well documented, but the row ids elsewhere are a concatenation of the fragment id and the within-fragment row id (which is the position / offset of the row in the file).

lance/rust/src/io/reader.rs

Lines 120 to 123 in 505dbcd

/// Compute row id from `fragment_id` and the `offset` of the row in the fragment.
fn compute_row_id(fragment_id: u64, offset: i32) -> u64 {
(fragment_id << 32) + offset as u64
}

So I think we already have a limit on the number of rows we can store per file based on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. thank you for explaining!

@wjones127 wjones127 requested a review from eddyxu June 1, 2023 16:14
@wjones127 wjones127 merged commit a410200 into main Jun 1, 2023
@wjones127 wjones127 deleted the wjones127/deletion-definition branch June 1, 2023 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

First pass at deletion support
2 participants