Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] [ScanOperator-Follow-Ons-2] Refactor MicroPartition to have non-optional TableMetadata #1563

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Nov 3, 2023

Also refactors MicroPartition::new into explicit ::new_loaded and ::new_unloaded variants:

  1. Helps us enforce that TableStatistics is Some when the state is TableState::Unloaded at construction-time
  2. Cleans up client code, because TableState no longer needs to be exposed externally and we can hide the calculations for TableMetadata inside of the ::new_loaded constructor
  3. ::from_scan_task_batch is very explicit and clean. It tries ::new_unloaded first, but falls back on ::new_loaded if the ScanTaskBatch is not provided with both metadata/statistics

@jaychia jaychia force-pushed the jay/scan-operator-integration-metadata branch from 5389eca to 2a9ba32 Compare November 3, 2023 01:23
@jaychia jaychia merged commit 39492ad into clark/scan-operator-integration Nov 3, 2023
2 checks passed
@jaychia jaychia deleted the jay/scan-operator-integration-metadata branch November 3, 2023 01:28
jaychia added a commit that referenced this pull request Nov 7, 2023
This PR adds an e2e integration for the new `ScanOperator` for reading
from external sources, integrating with logical plan building, logical
-> physical plan translation, physical plan scheduling, physical task
execution, and the actual `MicroPartition`-based reading.

## TODOs (possibly before merging)

- [ ] Implement Python I/O backend at `MicroPartition` level.
- [ ] Implement reads for non-Parquet formats at `MicroPartition` level.
- [x] Consolidate filter/limit pushdowns to use the same `Pushdown`
struct.
- [x] Look to reinstate non-optional `TableMetadata` at the
`MicroPartition` level. (#1563)
- [x] Look to reinstate non-optional `TableStatistics` when data is
unloaded at the `MicroPartition` level. (#1563)
- [x] Integrate with globbing `ScanOperator` implementation. (#1564)
- [ ] Support different row group selection per Parquet file (currently
applies a single row group selection to all files in a scan task
bundle).
- [ ] Misc. cleanup.
- [ ] (?) Add basic validation that `ScanTask` configurations are
compatible when merging into a `ScanTaskBatch` bundle.

---------

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
Co-authored-by: Jay Chia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant