Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data #3064

Merged
merged 3 commits into from
Nov 1, 2024

Conversation

westonpace
Copy link
Contributor

@westonpace westonpace commented Oct 29, 2024

I'm open to terms other than "storage class" as well. I also think we might want to use something other than "blob" as it can be easy to confuse the concepts of "blob encoding" (e.g. low page overhead, file-like API) and "blob storage class" (fewer rows per file). In fact, the thresholds can even be different with the blob encoding threshold being 8-16MB but the blob storage class threshold being closer to 128KB

Closes #3029

@github-actions github-actions bot added enhancement New feature or request python labels Oct 29, 2024
@codecov-commenter
Copy link

codecov-commenter commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 56.73624% with 228 lines in your changes missing coverage. Please review.

Project coverage is 77.40%. Comparing base (9e18322) to head (3d5505e).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/io/commit.rs 32.00% 65 Missing and 3 partials ⚠️
rust/lance/src/dataset/transaction.rs 39.70% 36 Missing and 5 partials ⚠️
rust/lance/src/dataset.rs 48.43% 32 Missing and 1 partial ⚠️
rust/lance/src/dataset/blob.rs 25.00% 25 Missing and 2 partials ⚠️
rust/lance/src/dataset/schema_evolution.rs 50.00% 16 Missing ⚠️
rust/lance-core/src/datatypes/field.rs 64.10% 12 Missing and 2 partials ⚠️
rust/lance/src/dataset/write.rs 85.07% 2 Missing and 8 partials ⚠️
rust/lance/src/dataset/scanner.rs 50.00% 8 Missing and 1 partial ⚠️
rust/lance/src/dataset/write/update.rs 75.00% 4 Missing ⚠️
rust/lance/src/io/exec/utils.rs 0.00% 3 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3064      +/-   ##
==========================================
- Coverage   77.57%   77.40%   -0.18%     
==========================================
  Files         240      240              
  Lines       78683    79578     +895     
  Branches    78683    79578     +895     
==========================================
+ Hits        61040    61596     +556     
- Misses      14498    14820     +322     
- Partials     3145     3162      +17     
Flag Coverage Δ
unittests 77.40% <56.73%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…class are kept in a separate (nested)

blob dataset.  This allows use to eventually change max_bytes_per_file to be much smaller.  "Blob columns" will
have relatively few rows per file.  "Scalar" columns will have many rows per file.  This allows us to optimize
query performance and compaction times.
…rs (e.g. schema) as part of the 'data bytes'
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good. I think I'm fine with storage class, though not sure what other values we will want there.

Comment on lines +183 to +184
/// "" - The default storage class.
/// "blob" - The field is compacted into fewer rows per fragment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the criteria for this entirely based on size of row values? or do you imagine there will eventually be other considerations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Size is the only thing I can think of and I don't expect us to need more than two storage classes for the next year or two.

It's possible we have a special "foreign" storage class or something like that for values stored in remote datasets but I think that would be best solved in a different way.

At the moment the only other class I can really think of is for extremely large values where file-per-value makes more sense.

I also hadn't quite done the math yet. We want to balance:

  • We can fit 8MiB worth of values into a file (min bound on num_values)
  • We don't have more than 10GiB of values in a file (max bound on num_values)

For example, if we start with int32 we need at least 2Mi values (ideally more actually due to compression). Given 10GiB this gives us a surprisingly small max size of 5KiB (anything larger will occupy more than 10GiB for 2Mi values).

If we are generous and make the cutoff at 64KiB then we need to make sure we have at least 128 values per file which should allow us up to 160MIB per value.

If we find that we can't be generous and we really do want the cutoff at...say...8KiB then we need at least 1024 values per file which means no more than 10MiB per value which is starting to get to the point where we might want three storage classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we find that we can't be generous and we really do want the cutoff at...say...8KiB

This logic is based on the 10GiB limit which is a somewhat arbitrary "let's pretend we have at most 100 or so 'big' columns and we want to keep files under 1TiB" criteria which seems a little conservative. At 64KiB we're fine as long as we only have 1 or 2 big columns or can tolerate files over 1TiB.

@westonpace westonpace merged commit c9f8a49 into lancedb:main Nov 1, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support adding data to blob fields
3 participants