-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add file statistics #3232
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3232 +/- ##
==========================================
- Coverage 78.49% 78.48% -0.02%
==========================================
Files 245 245
Lines 85065 85088 +23
Branches 85065 85088 +23
==========================================
+ Hits 66776 66783 +7
- Misses 15474 15492 +18
+ Partials 2815 2813 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor suggestions. Otherwise looks reasonable :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems fine, mostly just documentation suggestions and one simplification.
rust/lance-file/src/v2/reader.rs
Outdated
|
||
// The file statistics. | ||
pub file_statistics: Arc<FileStatistics>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably calculate these on demand. I think 90% of the time users won't be accessing these and the cost to calculate them is trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to calculate on demand
@@ -146,6 +147,9 @@ def metadata(self) -> LanceFileMetadata: | |||
""" | |||
return self._reader.metadata() | |||
|
|||
def file_statistics(self) -> LanceFileStatistics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a docstring for this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
python/src/file.rs
Outdated
#[pyclass(get_all)] | ||
#[derive(Clone, Debug, Serialize)] | ||
pub struct LanceFileStatistics { | ||
columns: Vec<LanceColumnStatistics>, | ||
} | ||
|
||
#[pymethods] | ||
impl LanceFileStatistics { | ||
fn __repr__(&self) -> String { | ||
let column_reprs: Vec<String> = self.columns.iter().map(|col| col.__repr__()).collect(); | ||
format!("LanceFileStatistics(columns=[{}])", column_reprs.join(", ")) | ||
} | ||
} | ||
|
||
#[pyclass(get_all)] | ||
#[derive(Clone, Debug, Serialize)] | ||
pub struct LanceColumnStatistics { | ||
num_pages: usize, | ||
size_bytes: u64, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since LanceColumnStatistics
and LanceFileStatistics
are exposed directly to the user (being re-exported in file.py) we should add docstrings. You can copy the ones I suggested in the rust lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks.
Co-authored-by: Weston Pace <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: we could make the repr smaller by removing the Lance
prefix from the class names. It's unnecessary.
thanks, fixed. |
8e6a538
to
17623b2
Compare
This PR adds a python method to get the
size in bytes
andnumber of pages
of each column in a Lance file.these statistics are calculated on demand.
#3221