-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add encoder utilities for pushdown #2388
Conversation
@wjones127 I used the same |
Two things that I think
Though would be happy to simplify if we can. |
assert_eq!(zone_maps_buffer.parts.len(), 1); | ||
let zone_maps_buffer = zone_maps_buffer.parts.into_iter().next().unwrap(); | ||
// TODO: Once reading is available we can check the contents of the zone maps buffer | ||
// TODO: Test out the different types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One type that always has special considerations are floats. Will want to document how we want to handle +/-0, Infinity, NaN. For V1, we wrote down the semantics here: https://lancedb.github.io/lance/format.html#statistic-values
|
||
impl ZoneMapsFieldEncoder { | ||
fn new_map(&mut self) -> Result<()> { | ||
// TODO: We should be truncating the min/max values here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncating binary/string values seems like a good idea. There are special considerations for the max to make sure you actually get a greater value after truncation.
6bd93e8
to
68cc3d8
Compare
@wjones127 alright, between the truncation and the float ordering I'm convinced I'll switch to using the statistics collector in a future PR. This way users don't have to have breaking changes and I think your logic is simpler when it comes to NaN (ignore). |
106c284
to
980534b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2388 +/- ##
==========================================
- Coverage 79.98% 79.77% -0.22%
==========================================
Files 200 202 +2
Lines 54713 55219 +506
Branches 54713 55219 +506
==========================================
+ Hits 43764 44051 +287
- Misses 8402 8605 +203
- Partials 2547 2563 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
/// Serializes into a lance file, without the schema. | ||
/// | ||
/// The schema must be provided to deserialize the buffer | ||
fn try_to_mini_lance(&self) -> Result<Bytes>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why try_to_self_described_lance
/try_to_mini_lance
vs something more descriptive like try_to_self_with_schema
/try_to_lance_no_schema
? Are mini/self-described common in data format parlance, or is there eventually things other than schema we'll emit from mini vs. self-described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why try_to_self_described_lance/try_to_mini_lance vs something more descriptive like
I could be argued either way. My original goal was to describe the "why" and not the "what". As a potential API user if I see "to_lance_with_schema" and "to_lance_no_schema" I might not know why I would pick one or the other. Still, there are definitely advantages to being more explicit with what is happening.
"self described" -> the resulting buffer can be deserialized without any extra information.
"mini" -> the smallest possible serialization.
Are mini/self-described common in data format parlance
Not that I'm aware of.
there eventually things other than schema we'll emit from mini vs. self-described?
I don't know of anything at the moment.
980534b
to
8729262
Compare
…ting encoded batch to bytes Reverted back to logic to handle lance field id mapping Fix python bindings with new option Another round of clippy suggestions Fix unit test mistake
8729262
to
eff1918
Compare
This adds a new field encoder (ZoneMapFieldEncoder) that calculates pushdown statistics and places them in the metadata. It also changes the encoder so that it the choice of encoding is configurable. This makes it possible for extensions to register custom encodings. The zone maps encoder is an example of this as it is placed in a special crate for "encodings that rely on datafusion". It also adds some utilities for converting an `EncodedBatch` to `Bytes` according to the lance file format. This makes it possible to go from `RecordBatch` to `Bytes` using the lance file format. There is not much testing for the zone maps encoder. More will come when we add support for reading zone maps but I want to keep this PR simple for now.
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [lance](https://togithub.com/lancedb/lance) | dependencies | minor | `0.10.16` -> `0.12.0` | --- ### Release Notes <details> <summary>lancedb/lance (lance)</summary> ### [`v0.12.1`](https://togithub.com/lancedb/lance/releases/tag/v0.12.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.12.0...v0.12.1) <!-- Release notes generated using configuration in .github/release.yml at v0.12.1 --> #### What's Changed ##### Bug Fixes 🐛 - fix: incorrect chunking was making lance datasets use too much RAM by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2438](https://togithub.com/lancedb/lance/pull/2438) **Full Changelog**: lancedb/lance@v0.12.0...v0.12.1 ### [`v0.12.0`](https://togithub.com/lancedb/lance/releases/tag/v0.12.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.1...v0.12.0) <!-- Release notes generated using configuration in .github/release.yml at v0.12.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat: change dataset uri to return full qualified url instead of object store path by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2416](https://togithub.com/lancedb/lance/pull/2416) ##### New Features 🎉 - feat: new shuffler by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2404](https://togithub.com/lancedb/lance/pull/2404) - feat: new index builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2401](https://togithub.com/lancedb/lance/pull/2401) - feat: stable row id manifest changes by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2363](https://togithub.com/lancedb/lance/pull/2363) - feat: once a table has been created with v1 or v2 format then it should always use that format by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2435](https://togithub.com/lancedb/lance/pull/2435) ##### Bug Fixes 🐛 - fix: fix file writer which was not writing page buffers in the correct order by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2413](https://togithub.com/lancedb/lance/pull/2413) ##### Other Changes - refactor: refactor logical decoders into "field decoders" by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2407](https://togithub.com/lancedb/lance/pull/2407) - refactor: rename use_experimental_writer to use_legacy_format by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2433](https://togithub.com/lancedb/lance/pull/2433) - refactor: minor refactor to allow I/O scheduler to be cloned in page schedulers by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2432](https://togithub.com/lancedb/lance/pull/2432) **Full Changelog**: lancedb/lance@v0.11.1...v0.12.0 ### [`v0.11.1`](https://togithub.com/lancedb/lance/releases/tag/v0.11.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.0...v0.11.1) <!-- Release notes generated using configuration in .github/release.yml at v0.11.1 --> #### What's Changed ##### New Features 🎉 - feat(java): support jdk8 by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2362](https://togithub.com/lancedb/lance/pull/2362) - feat: support kmode with hamming distance by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2366](https://togithub.com/lancedb/lance/pull/2366) - feat: row id index structures (experimental) by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2303](https://togithub.com/lancedb/lance/pull/2303) - feat: update merge_insert to add statistics for inserted, updated, deleted rows by [@​raunaks13](https://togithub.com/raunaks13) in [https://github.com/lancedb/lance/pull/2357](https://togithub.com/lancedb/lance/pull/2357) - feat: define Flat index as a scan over VectorStorage by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2380](https://togithub.com/lancedb/lance/pull/2380) - feat: add some schema utility methods to the v2 reader/writer by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2389](https://togithub.com/lancedb/lance/pull/2389) - feat: general compression for value page buffer by [@​niyue](https://togithub.com/niyue) in [https://github.com/lancedb/lance/pull/2368](https://togithub.com/lancedb/lance/pull/2368) - feat: make the index cache size (in bytes) available by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2381](https://togithub.com/lancedb/lance/pull/2381) - feat: add special uri scheme to use CloudFileReader for local fs by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2402](https://togithub.com/lancedb/lance/pull/2402) - feat: add encoder utilities for pushdown by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2388](https://togithub.com/lancedb/lance/pull/2388) ##### Bug Fixes 🐛 - fix: concat batches before writing to avoid small IO slow down by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2384](https://togithub.com/lancedb/lance/pull/2384) - fix: low recall if the num partitions is more than num rows by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2386](https://togithub.com/lancedb/lance/pull/2386) - fix: f32 reduce_min for x86 by [@​heiher](https://togithub.com/heiher) in [https://github.com/lancedb/lance/pull/2385](https://togithub.com/lancedb/lance/pull/2385) - fix: fix incorrect validation logic in updater by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2408](https://togithub.com/lancedb/lance/pull/2408) ##### Performance Improvements 🚀 - perf: make VectorStorage and DistCalculator static to generate better code by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2355](https://togithub.com/lancedb/lance/pull/2355) - perf: optimize IO path for reading manifest by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2396](https://togithub.com/lancedb/lance/pull/2396) ##### Other Changes - refactor: make proto conversion fallible and not copy by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2371](https://togithub.com/lancedb/lance/pull/2371) - refactor: separate take and schema evolution impls to own files by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2372](https://togithub.com/lancedb/lance/pull/2372) - Revert "fix: concat batches before writing to avoid small IO slow down ([#​2384](https://togithub.com/lancedb/lance/issues/2384))" by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2387](https://togithub.com/lancedb/lance/pull/2387) - refactor: shuffle around v2 metadata sections to allow read-on-demand statistics by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2400](https://togithub.com/lancedb/lance/pull/2400) #### New Contributors - [@​niyue](https://togithub.com/niyue) made their first contribution in [https://github.com/lancedb/lance/pull/2368](https://togithub.com/lancedb/lance/pull/2368) - [@​heiher](https://togithub.com/heiher) made their first contribution in [https://github.com/lancedb/lance/pull/2385](https://togithub.com/lancedb/lance/pull/2385) **Full Changelog**: lancedb/lance@v0.11.0...v0.11.1 ### [`v0.11.0`](https://togithub.com/lancedb/lance/releases/tag/v0.11.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.10.18...v0.11.0) <!-- Release notes generated using configuration in .github/release.yml at v0.11.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat(rust)!: use BoxedError in Error::IO by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [https://github.com/lancedb/lance/pull/2329](https://togithub.com/lancedb/lance/pull/2329) ##### New Features 🎉 - feat: add v2 support to fragment merge / update paths by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2311](https://togithub.com/lancedb/lance/pull/2311) - feat: add priority to I/O scheduler by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2315](https://togithub.com/lancedb/lance/pull/2315) - feat: add take_rows operation to the v2 file reader's python bindings by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2331](https://togithub.com/lancedb/lance/pull/2331) - feat: added example for reading and writing dataset in rust by [@​raunaks13](https://togithub.com/raunaks13) in [https://github.com/lancedb/lance/pull/2349](https://togithub.com/lancedb/lance/pull/2349) - feat: new HNSW implementation by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2353](https://togithub.com/lancedb/lance/pull/2353) - feat: add fragment take / fixed-size-binary support to v2 format by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2354](https://togithub.com/lancedb/lance/pull/2354) ##### Bug Fixes 🐛 - fix: recognize a simple expression like 'is_foo' as a scalar index query by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2356](https://togithub.com/lancedb/lance/pull/2356) - fix: rework list encoder to handle list-struct by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2344](https://togithub.com/lancedb/lance/pull/2344) - fix: minor bug fixes for v2 by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2361](https://togithub.com/lancedb/lance/pull/2361) ##### Documentation 📚 - docs: clearify comments in table.proto -> message DataFragment -> physical_rows by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [https://github.com/lancedb/lance/pull/2346](https://togithub.com/lancedb/lance/pull/2346) ##### Performance Improvements 🚀 - perf: use the file metadata cache in scalar indices by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2330](https://togithub.com/lancedb/lance/pull/2330) ##### Other Changes - chore: remove `m_max` and `use_heuristic` params from HNSW builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2336](https://togithub.com/lancedb/lance/pull/2336) - fix(java): fix JNI jar loader issue by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2340](https://togithub.com/lancedb/lance/pull/2340) - ci: fix labeler permissions by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2348](https://togithub.com/lancedb/lance/pull/2348) - fix: rework decoding to fix bugs in nested struct decoding by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2337](https://togithub.com/lancedb/lance/pull/2337) #### New Contributors - [@​broccoliSpicy](https://togithub.com/broccoliSpicy) made their first contribution in [https://github.com/lancedb/lance/pull/2346](https://togithub.com/lancedb/lance/pull/2346) - [@​raunaks13](https://togithub.com/raunaks13) made their first contribution in [https://github.com/lancedb/lance/pull/2349](https://togithub.com/lancedb/lance/pull/2349) **Full Changelog**: lancedb/lance@v0.10.18...v0.11.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/spiraldb/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [lance](https://togithub.com/lancedb/lance) | dependencies | minor | `0.10.16` -> `0.12.0` | --- ### Release Notes <details> <summary>lancedb/lance (lance)</summary> ### [`v0.12.1`](https://togithub.com/lancedb/lance/releases/tag/v0.12.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.12.0...v0.12.1) <!-- Release notes generated using configuration in .github/release.yml at v0.12.1 --> #### What's Changed ##### Bug Fixes 🐛 - fix: incorrect chunking was making lance datasets use too much RAM by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2438](https://togithub.com/lancedb/lance/pull/2438) **Full Changelog**: lancedb/lance@v0.12.0...v0.12.1 ### [`v0.12.0`](https://togithub.com/lancedb/lance/releases/tag/v0.12.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.1...v0.12.0) <!-- Release notes generated using configuration in .github/release.yml at v0.12.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat: change dataset uri to return full qualified url instead of object store path by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2416](https://togithub.com/lancedb/lance/pull/2416) ##### New Features 🎉 - feat: new shuffler by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2404](https://togithub.com/lancedb/lance/pull/2404) - feat: new index builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2401](https://togithub.com/lancedb/lance/pull/2401) - feat: stable row id manifest changes by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2363](https://togithub.com/lancedb/lance/pull/2363) - feat: once a table has been created with v1 or v2 format then it should always use that format by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2435](https://togithub.com/lancedb/lance/pull/2435) ##### Bug Fixes 🐛 - fix: fix file writer which was not writing page buffers in the correct order by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2413](https://togithub.com/lancedb/lance/pull/2413) ##### Other Changes - refactor: refactor logical decoders into "field decoders" by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2407](https://togithub.com/lancedb/lance/pull/2407) - refactor: rename use_experimental_writer to use_legacy_format by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2433](https://togithub.com/lancedb/lance/pull/2433) - refactor: minor refactor to allow I/O scheduler to be cloned in page schedulers by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2432](https://togithub.com/lancedb/lance/pull/2432) **Full Changelog**: lancedb/lance@v0.11.1...v0.12.0 ### [`v0.11.1`](https://togithub.com/lancedb/lance/releases/tag/v0.11.1) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.11.0...v0.11.1) <!-- Release notes generated using configuration in .github/release.yml at v0.11.1 --> #### What's Changed ##### New Features 🎉 - feat(java): support jdk8 by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2362](https://togithub.com/lancedb/lance/pull/2362) - feat: support kmode with hamming distance by [@​eddyxu](https://togithub.com/eddyxu) in [https://github.com/lancedb/lance/pull/2366](https://togithub.com/lancedb/lance/pull/2366) - feat: row id index structures (experimental) by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2303](https://togithub.com/lancedb/lance/pull/2303) - feat: update merge_insert to add statistics for inserted, updated, deleted rows by [@​raunaks13](https://togithub.com/raunaks13) in [https://github.com/lancedb/lance/pull/2357](https://togithub.com/lancedb/lance/pull/2357) - feat: define Flat index as a scan over VectorStorage by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2380](https://togithub.com/lancedb/lance/pull/2380) - feat: add some schema utility methods to the v2 reader/writer by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2389](https://togithub.com/lancedb/lance/pull/2389) - feat: general compression for value page buffer by [@​niyue](https://togithub.com/niyue) in [https://github.com/lancedb/lance/pull/2368](https://togithub.com/lancedb/lance/pull/2368) - feat: make the index cache size (in bytes) available by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2381](https://togithub.com/lancedb/lance/pull/2381) - feat: add special uri scheme to use CloudFileReader for local fs by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2402](https://togithub.com/lancedb/lance/pull/2402) - feat: add encoder utilities for pushdown by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2388](https://togithub.com/lancedb/lance/pull/2388) ##### Bug Fixes 🐛 - fix: concat batches before writing to avoid small IO slow down by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2384](https://togithub.com/lancedb/lance/pull/2384) - fix: low recall if the num partitions is more than num rows by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2386](https://togithub.com/lancedb/lance/pull/2386) - fix: f32 reduce_min for x86 by [@​heiher](https://togithub.com/heiher) in [https://github.com/lancedb/lance/pull/2385](https://togithub.com/lancedb/lance/pull/2385) - fix: fix incorrect validation logic in updater by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2408](https://togithub.com/lancedb/lance/pull/2408) ##### Performance Improvements 🚀 - perf: make VectorStorage and DistCalculator static to generate better code by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2355](https://togithub.com/lancedb/lance/pull/2355) - perf: optimize IO path for reading manifest by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2396](https://togithub.com/lancedb/lance/pull/2396) ##### Other Changes - refactor: make proto conversion fallible and not copy by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2371](https://togithub.com/lancedb/lance/pull/2371) - refactor: separate take and schema evolution impls to own files by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2372](https://togithub.com/lancedb/lance/pull/2372) - Revert "fix: concat batches before writing to avoid small IO slow down ([#​2384](https://togithub.com/lancedb/lance/issues/2384))" by [@​chebbyChefNEQ](https://togithub.com/chebbyChefNEQ) in [https://github.com/lancedb/lance/pull/2387](https://togithub.com/lancedb/lance/pull/2387) - refactor: shuffle around v2 metadata sections to allow read-on-demand statistics by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2400](https://togithub.com/lancedb/lance/pull/2400) #### New Contributors - [@​niyue](https://togithub.com/niyue) made their first contribution in [https://github.com/lancedb/lance/pull/2368](https://togithub.com/lancedb/lance/pull/2368) - [@​heiher](https://togithub.com/heiher) made their first contribution in [https://github.com/lancedb/lance/pull/2385](https://togithub.com/lancedb/lance/pull/2385) **Full Changelog**: lancedb/lance@v0.11.0...v0.11.1 ### [`v0.11.0`](https://togithub.com/lancedb/lance/releases/tag/v0.11.0) [Compare Source](https://togithub.com/lancedb/lance/compare/v0.10.18...v0.11.0) <!-- Release notes generated using configuration in .github/release.yml at v0.11.0 --> #### What's Changed ##### Breaking Changes 🛠 - feat(rust)!: use BoxedError in Error::IO by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [https://github.com/lancedb/lance/pull/2329](https://togithub.com/lancedb/lance/pull/2329) ##### New Features 🎉 - feat: add v2 support to fragment merge / update paths by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2311](https://togithub.com/lancedb/lance/pull/2311) - feat: add priority to I/O scheduler by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2315](https://togithub.com/lancedb/lance/pull/2315) - feat: add take_rows operation to the v2 file reader's python bindings by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2331](https://togithub.com/lancedb/lance/pull/2331) - feat: added example for reading and writing dataset in rust by [@​raunaks13](https://togithub.com/raunaks13) in [https://github.com/lancedb/lance/pull/2349](https://togithub.com/lancedb/lance/pull/2349) - feat: new HNSW implementation by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2353](https://togithub.com/lancedb/lance/pull/2353) - feat: add fragment take / fixed-size-binary support to v2 format by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2354](https://togithub.com/lancedb/lance/pull/2354) ##### Bug Fixes 🐛 - fix: recognize a simple expression like 'is_foo' as a scalar index query by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2356](https://togithub.com/lancedb/lance/pull/2356) - fix: rework list encoder to handle list-struct by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2344](https://togithub.com/lancedb/lance/pull/2344) - fix: minor bug fixes for v2 by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2361](https://togithub.com/lancedb/lance/pull/2361) ##### Documentation 📚 - docs: clearify comments in table.proto -> message DataFragment -> physical_rows by [@​broccoliSpicy](https://togithub.com/broccoliSpicy) in [https://github.com/lancedb/lance/pull/2346](https://togithub.com/lancedb/lance/pull/2346) ##### Performance Improvements 🚀 - perf: use the file metadata cache in scalar indices by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2330](https://togithub.com/lancedb/lance/pull/2330) ##### Other Changes - chore: remove `m_max` and `use_heuristic` params from HNSW builder by [@​BubbleCal](https://togithub.com/BubbleCal) in [https://github.com/lancedb/lance/pull/2336](https://togithub.com/lancedb/lance/pull/2336) - fix(java): fix JNI jar loader issue by [@​LuQQiu](https://togithub.com/LuQQiu) in [https://github.com/lancedb/lance/pull/2340](https://togithub.com/lancedb/lance/pull/2340) - ci: fix labeler permissions by [@​wjones127](https://togithub.com/wjones127) in [https://github.com/lancedb/lance/pull/2348](https://togithub.com/lancedb/lance/pull/2348) - fix: rework decoding to fix bugs in nested struct decoding by [@​westonpace](https://togithub.com/westonpace) in [https://github.com/lancedb/lance/pull/2337](https://togithub.com/lancedb/lance/pull/2337) #### New Contributors - [@​broccoliSpicy](https://togithub.com/broccoliSpicy) made their first contribution in [https://github.com/lancedb/lance/pull/2346](https://togithub.com/lancedb/lance/pull/2346) - [@​raunaks13](https://togithub.com/raunaks13) made their first contribution in [https://github.com/lancedb/lance/pull/2349](https://togithub.com/lancedb/lance/pull/2349) **Full Changelog**: lancedb/lance@v0.10.18...v0.11.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/spiraldb/vortex). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This adds a new field encoder (ZoneMapFieldEncoder) that calculates pushdown statistics and places them in the metadata.
It also changes the encoder so that it the choice of encoding is configurable. This makes it possible for extensions to register custom encodings. The zone maps encoder is an example of this as it is placed in a special crate for "encodings that rely on datafusion".
It also adds some utilities for converting an
EncodedBatch
toBytes
according to the lance file format. This makes it possible to go fromRecordBatch
toBytes
using the lance file format.There is not much testing for the zone maps encoder. More will come when we add support for reading zone maps but I want to keep this PR simple for now.