Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for "take" operation to balanced storage #3079

Merged
merged 6 commits into from
Nov 5, 2024

Conversation

westonpace
Copy link
Contributor

@westonpace westonpace commented Nov 1, 2024

Adds support for "take" to datasets that have balanced storage

Requires balanced storage datasets to use stable ids

Refactors RowAddress slightly to remove id from the method names (e.g. row_id -> row_offset) to help avoid future confusion.

Fixes an intermittent panic in SharedStream (this panic had been seen before but now that we are using SharedStream more it was showing up more frequently)

Begins aligning on new terminology for balanced storage (the old "blob storage" terminology is confusing because other things are also called "blobs":

  • Each field has a storage class
  • There is (still) a storage class called "blob" for large fields
  • These fields are called "sibling fields" and "local/sibling" should be used when contrasting between fields with the default storage class and any other storage class. In a future PR I'll clean up the append/write impl to use this terminology

Closes #3030

@westonpace westonpace marked this pull request as draft November 1, 2024 14:03
@github-actions github-actions bot added enhancement New feature or request python labels Nov 1, 2024
@westonpace
Copy link
Contributor Author

westonpace commented Nov 1, 2024

Leaving in draft until #3064 merges

@westonpace westonpace force-pushed the feat/take-blob-col branch 2 times, most recently from afeb3f5 to 5285b86 Compare November 1, 2024 20:10
@westonpace westonpace marked this pull request as ready for review November 1, 2024 20:10
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 26.18182% with 203 lines in your changes missing coverage. Please review.

Project coverage is 77.16%. Comparing base (ceaf49c) to head (0c75cc9).

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 5.83% 128 Missing and 1 partial ⚠️
rust/lance/src/dataset/take.rs 23.61% 51 Missing and 4 partials ⚠️
rust/lance/src/index/vector/ivf/builder.rs 0.00% 6 Missing ⚠️
rust/lance-core/src/datatypes/schema.rs 72.22% 5 Missing ⚠️
rust/lance-core/src/utils/address.rs 60.00% 4 Missing ⚠️
rust/lance/src/dataset/optimize.rs 0.00% 2 Missing ⚠️
rust/lance-datafusion/src/projection.rs 95.00% 0 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/flat.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3079      +/-   ##
==========================================
- Coverage   77.31%   77.16%   -0.15%     
==========================================
  Files         240      240              
  Lines       79322    79561     +239     
  Branches    79322    79561     +239     
==========================================
+ Hits        61326    61393      +67     
- Misses      14821    15014     +193     
+ Partials     3175     3154      -21     
Flag Coverage Δ
unittests 77.16% <26.18%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +121 to +125
# TODO: Once https://github.com/lancedb/lance/pull/3041 merges we will
# want to test partial appends. We need to make sure an append of
# non-blob data is supported. In order to do this we need to make
# sure a blob tx is created that marks the row ids as used so that
# the two row id sequences stay in sync.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for testing the interaction

Comment on lines +158 to +161
/// Splits the schema into two schemas, one with default storage class fields and the other with blob storage class fields.
/// If there are no blob storage class fields, the second schema will be `None`.
/// The order of fields is preserved.
pub fn partition_by_storage_class(&self) -> (Self, Option<Self>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can only top-level fields be StorageClass::Blob?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is simplest. I should probably add that restriction. We can always add support for sub-fields in the future.

@westonpace westonpace merged commit 83439ef into lancedb:main Nov 5, 2024
23 of 24 checks passed
@westonpace westonpace self-assigned this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support take_blobs when the field is a remote field
3 participants