Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file statistics #3232

Merged
merged 14 commits into from
Dec 13, 2024
Merged

feat: add file statistics #3232

merged 14 commits into from
Dec 13, 2024

Conversation

broccoliSpicy
Copy link
Contributor

@broccoliSpicy broccoliSpicy commented Dec 11, 2024

This PR adds a python method to get the size in bytes and number of pages of each column in a Lance file.

these statistics are calculated on demand.

#3221

@github-actions github-actions bot added enhancement New feature or request python labels Dec 11, 2024
@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.

Project coverage is 78.48%. Comparing base (99ae761) to head (d0ef2a9).

Files with missing lines Patch % Lines
rust/lance-file/src/v2/reader.rs 0.00% 23 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3232      +/-   ##
==========================================
- Coverage   78.49%   78.48%   -0.02%     
==========================================
  Files         245      245              
  Lines       85065    85088      +23     
  Branches    85065    85088      +23     
==========================================
+ Hits        66776    66783       +7     
- Misses      15474    15492      +18     
+ Partials     2815     2813       -2     
Flag Coverage Δ
unittests 78.48% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions. Otherwise looks reasonable :)

python/src/file.rs Outdated Show resolved Hide resolved
rust/lance-file/src/v2/reader.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems fine, mostly just documentation suggestions and one simplification.

rust/lance-file/src/v2/reader.rs Outdated Show resolved Hide resolved
Comment on lines 94 to 96

// The file statistics.
pub file_statistics: Arc<FileStatistics>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably calculate these on demand. I think 90% of the time users won't be accessing these and the cost to calculate them is trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to calculate on demand

@@ -146,6 +147,9 @@ def metadata(self) -> LanceFileMetadata:
"""
return self._reader.metadata()

def file_statistics(self) -> LanceFileStatistics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a docstring for this method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines 118 to 137
#[pyclass(get_all)]
#[derive(Clone, Debug, Serialize)]
pub struct LanceFileStatistics {
columns: Vec<LanceColumnStatistics>,
}

#[pymethods]
impl LanceFileStatistics {
fn __repr__(&self) -> String {
let column_reprs: Vec<String> = self.columns.iter().map(|col| col.__repr__()).collect();
format!("LanceFileStatistics(columns=[{}])", column_reprs.join(", "))
}
}

#[pyclass(get_all)]
#[derive(Clone, Debug, Serialize)]
pub struct LanceColumnStatistics {
num_pages: usize,
size_bytes: u64,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since LanceColumnStatistics and LanceFileStatistics are exposed directly to the user (being re-exported in file.py) we should add docstrings. You can copy the ones I suggested in the rust lib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: we could make the repr smaller by removing the Lance prefix from the class names. It's unnecessary.

python/src/file.rs Show resolved Hide resolved
python/src/file.rs Outdated Show resolved Hide resolved
python/src/file.rs Outdated Show resolved Hide resolved
@broccoliSpicy
Copy link
Contributor Author

small nit: we could make the repr smaller by removing the Lance prefix from the class names. It's unnecessary.

thanks, fixed.

@broccoliSpicy broccoliSpicy merged commit 679b93c into lancedb:main Dec 13, 2024
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants