feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data #3064

westonpace · 2024-10-29T17:24:43Z

I'm open to terms other than "storage class" as well. I also think we might want to use something other than "blob" as it can be easy to confuse the concepts of "blob encoding" (e.g. low page overhead, file-like API) and "blob storage class" (fewer rows per file). In fact, the thresholds can even be different with the blob encoding threshold being 8-16MB but the blob storage class threshold being closer to 128KB

Closes #3029

codecov-commenter · 2024-10-29T17:47:35Z

Codecov Report

Attention: Patch coverage is 56.73624% with 228 lines in your changes missing coverage. Please review.

Project coverage is 77.40%. Comparing base (9e18322) to head (3d5505e).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/io/commit.rs	32.00%	65 Missing and 3 partials ⚠️
rust/lance/src/dataset/transaction.rs	39.70%	36 Missing and 5 partials ⚠️
rust/lance/src/dataset.rs	48.43%	32 Missing and 1 partial ⚠️
rust/lance/src/dataset/blob.rs	25.00%	25 Missing and 2 partials ⚠️
rust/lance/src/dataset/schema_evolution.rs	50.00%	16 Missing ⚠️
rust/lance-core/src/datatypes/field.rs	64.10%	12 Missing and 2 partials ⚠️
rust/lance/src/dataset/write.rs	85.07%	2 Missing and 8 partials ⚠️
rust/lance/src/dataset/scanner.rs	50.00%	8 Missing and 1 partial ⚠️
rust/lance/src/dataset/write/update.rs	75.00%	4 Missing ⚠️
rust/lance/src/io/exec/utils.rs	0.00%	3 Missing ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3064      +/-   ##
==========================================
- Coverage   77.57%   77.40%   -0.18%     
==========================================
  Files         240      240              
  Lines       78683    79578     +895     
  Branches    78683    79578     +895     
==========================================
+ Hits        61040    61596     +556     
- Misses      14498    14820     +322     
- Partials     3145     3162      +17

Flag	Coverage Δ
unittests	`77.40% <56.73%> (-0.18%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…class are kept in a separate (nested) blob dataset. This allows use to eventually change max_bytes_per_file to be much smaller. "Blob columns" will have relatively few rows per file. "Scalar" columns will have many rows per file. This allows us to optimize query performance and compaction times.

…rs (e.g. schema) as part of the 'data bytes'

wjones127

Seems good. I think I'm fine with storage class, though not sure what other values we will want there.

wjones127 · 2024-11-01T16:47:08Z

protos/file.proto

+  /// "" - The default storage class.
+  /// "blob" - The field is compacted into fewer rows per fragment.


is the criteria for this entirely based on size of row values? or do you imagine there will eventually be other considerations?

Size is the only thing I can think of and I don't expect us to need more than two storage classes for the next year or two.

It's possible we have a special "foreign" storage class or something like that for values stored in remote datasets but I think that would be best solved in a different way.

At the moment the only other class I can really think of is for extremely large values where file-per-value makes more sense.

I also hadn't quite done the math yet. We want to balance:

We can fit 8MiB worth of values into a file (min bound on num_values)

We don't have more than 10GiB of values in a file (max bound on num_values)

For example, if we start with int32 we need at least 2Mi values (ideally more actually due to compression). Given 10GiB this gives us a surprisingly small max size of 5KiB (anything larger will occupy more than 10GiB for 2Mi values).

If we are generous and make the cutoff at 64KiB then we need to make sure we have at least 128 values per file which should allow us up to 160MIB per value.

If we find that we can't be generous and we really do want the cutoff at...say...8KiB then we need at least 1024 values per file which means no more than 10MiB per value which is starting to get to the point where we might want three storage classes.

If we find that we can't be generous and we really do want the cutoff at...say...8KiB

This logic is based on the 10GiB limit which is a somewhat arbitrary "let's pretend we have at most 100 or so 'big' columns and we want to keep files under 1TiB" criteria which seems a little conservative. At 64KiB we're fine as long as we only have 1 or 2 big columns or can tolerate files over 1TiB.

github-actions bot added enhancement New feature or request python labels Oct 29, 2024

westonpace added 3 commits October 31, 2024 14:14

Address clippy suggestions

2653c37

Cleanup file metadata stats reporting as it was counting global buffe…

3d5505e

…rs (e.g. schema) as part of the 'data bytes'

westonpace force-pushed the feat/add-blob-col branch from df10405 to 3d5505e Compare October 31, 2024 21:28

westonpace requested a review from wjones127 October 31, 2024 22:03

westonpace mentioned this pull request Nov 1, 2024

feat: add support for "take" operation to balanced storage #3079

Merged

wjones127 approved these changes Nov 1, 2024

View reviewed changes

westonpace merged commit c9f8a49 into lancedb:main Nov 1, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data #3064

feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data #3064

westonpace commented Oct 29, 2024 •

edited

Loading

codecov-commenter commented Oct 29, 2024 •

edited

Loading

wjones127 left a comment

wjones127 Nov 1, 2024

westonpace Nov 1, 2024

westonpace Nov 1, 2024

		/// "" - The default storage class.
		/// "blob" - The field is compacted into fewer rows per fragment.

feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data #3064

feat: introduce concept of "storage class" wtih separate dataset for "blob" storage data #3064

Conversation

westonpace commented Oct 29, 2024 • edited Loading

codecov-commenter commented Oct 29, 2024 • edited Loading

Codecov Report

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 Nov 1, 2024

Choose a reason for hiding this comment

westonpace Nov 1, 2024

Choose a reason for hiding this comment

westonpace Nov 1, 2024

Choose a reason for hiding this comment

westonpace commented Oct 29, 2024 •

edited

Loading

codecov-commenter commented Oct 29, 2024 •

edited

Loading