[FEAT][ScanOperator 1/3] Add MVP e2e `ScanOperator` integration. #1559

clarkzinzow · 2023-11-01T02:18:56Z

This PR adds an e2e integration for the new ScanOperator for reading from external sources, integrating with logical plan building, logical -> physical plan translation, physical plan scheduling, physical task execution, and the actual MicroPartition-based reading.

TODOs (possibly before merging)

Implement Python I/O backend at MicroPartition level.
Implement reads for non-Parquet formats at MicroPartition level.
Consolidate filter/limit pushdowns to use the same Pushdown struct.
Look to reinstate non-optional TableMetadata at the MicroPartition level. ([CHORE] [ScanOperator-Follow-Ons-2] Refactor MicroPartition to have non-optional TableMetadata #1563)
Look to reinstate non-optional TableStatistics when data is unloaded at the MicroPartition level. ([CHORE] [ScanOperator-Follow-Ons-2] Refactor MicroPartition to have non-optional TableMetadata #1563)
Integrate with globbing ScanOperator implementation. ([CHORE] [ScanOperator-Follow-Ons-3] Integrate GlobScanOperator with new scan node builder #1564)
Support different row group selection per Parquet file (currently applies a single row group selection to all files in a scan task bundle).
Misc. cleanup.
(?) Add basic validation that ScanTask configurations are compatible when merging into a ScanTaskBatch bundle.

samster25

Did a first pass review! Great work so far :)

samster25 · 2023-11-01T02:44:32Z

daft/io/_csv.py

@@ -67,7 +67,7 @@ def read_csv(
    )
    file_format_config = FileFormatConfig.from_csv_config(csv_config)
    if use_native_downloader:
-        storage_config = StorageConfig.native(NativeStorageConfig(io_config))
+        storage_config = StorageConfig.native(NativeStorageConfig(True, io_config))


@jaychia Did we decide to only use the multithreading backend for the python runner or are we just gonna call yolo and use it for ray too?

I do have a hardcoded default for Ray to false, but only for the Parquet read (https://github.com/Eventual-Inc/Daft/blob/main/daft/io/_parquet.py#L53-L55). We can do it for CSV as well?

src/daft-micropartition/src/micropartition.rs

src/daft-plan/src/physical_ops/scan.rs

src/daft-scan/src/lib.rs

samster25 · 2023-11-01T03:03:31Z

src/daft-scan/src/lib.rs

+#[derive(Debug, Clone, PartialEq, Eq, Hash)]
+pub struct Pushdowns {
+    /// Optional filters to apply to the source data.
+    pub filters: Option<Arc<Vec<ExprRef>>>,


should these be HashSets instead since ordering shouldn't matter for equality and hashing?

src/daft-scan/src/python.rs

…block merging into main

…e instead of on `dyn ScanOperator` trait object (#1562) Performs equality on the Arc pointer instead of relying on logic to check the underlying `dyn ScanOperator` trait object for equality, which could be tricky because of some custom vtable logic. Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

…on-optional TableMetadata (#1563) Also refactors `MicroPartition::new` into explicit `::new_loaded` and `::new_unloaded` variants: 1. Helps us enforce that `TableStatistics` is `Some` when the state is `TableState::Unloaded` at construction-time 2. Cleans up client code, because `TableState` no longer needs to be exposed externally and we can hide the calculations for TableMetadata inside of the `::new_loaded` constructor 3. `::from_scan_task_batch` is very explicit and clean. It tries `::new_unloaded` first, but falls back on `::new_loaded` if the ScanTaskBatch is not provided with both metadata/statistics Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

…ew scan node builder (#1564) Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

…lity PRs

codecov · 2023-11-03T06:36:32Z

Codecov Report

Merging #1559 (c113d5a) into main (c8fe883) will decrease coverage by 0.36%.
The diff coverage is 50.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1559      +/-   ##
==========================================
- Coverage   85.21%   84.85%   -0.36%     
==========================================
  Files          54       54              
  Lines        5121     5165      +44     
==========================================
+ Hits         4364     4383      +19     
- Misses        757      782      +25

Files	Coverage Δ
daft/execution/execution_step.py	`92.83% <ø> (ø)`
daft/io/_csv.py	`94.73% <100.00%> (ø)`
daft/io/_parquet.py	`100.00% <100.00%> (ø)`
daft/table/table_io.py	`95.83% <ø> (ø)`
daft/table/table.py	`81.97% <75.00%> (-0.10%)`	⬇️
daft/table/micropartition.py	`89.06% <60.00%> (-0.78%)`	⬇️
daft/logical/builder.py	`86.72% <50.00%> (-2.17%)`	⬇️
daft/execution/rust_physical_plan_shim.py	`87.50% <52.94%> (-10.69%)`	⬇️
daft/io/common.py	`65.00% <26.66%> (-23.89%)`	⬇️

samster25 · 2023-11-06T16:03:07Z

daft/io/common.py

+    # This environment variable will make Daft use the new "v2 scans" and MicroPartitions when building Daft logical plans
+    if os.getenv("DAFT_V2_SCANS", "0") == "1":
+        assert (
+            os.getenv("DAFT_MICROPARTITIONS", "0") == "1"


Let's just override this to be 1 on default if DAFT_V2_SCANS is set

Yeah unfortunately we use os.getenv("DAFT_MICROPARTITIONS", "0") == "1" at import time to hotswap our Table implementation, so by the time we hit this code it might be "too late" to override it

daft/io/common.py

src/daft-micropartition/src/micropartition.rs

src/daft-plan/src/physical_plan.rs

src/daft-scan/Cargo.toml

src/daft-scan/src/glob.rs

…e GlobScanOperator

clarkzinzow · 2023-11-06T19:31:40Z

src/daft-plan/src/logical_ops/source.rs

@@ -14,6 +15,8 @@ pub struct Source {
    /// Information about the source data location.
    pub source_info: Arc<SourceInfo>,

+    // TODO(Clark): Replace these pushdown fields with the Pushdown struct, where the Pushdown struct would exist
+    // on the LegacyExternalInfo struct in SourceInfo.


@jaychia Btw the TODO in the PR description "Consolidate filter/limit pushdowns to use the same Pushdown struct." is referring to this, which still should be done but definitely doesn't need to block the merging of the PR IMO.

clarkzinzow · 2023-11-06T19:36:22Z

src/daft-scan/src/lib.rs

+    pub file_format_config: Arc<FileFormatConfig>,
+    pub schema: SchemaRef,
+    pub storage_config: Arc<StorageConfig>,
+    // TODO(Clark): Directly use the Pushdowns struct as part of the ScanTask struct?


@jaychia The TODO in the PR description "Consolidate filter/limit pushdowns to use the same Pushdown struct." was also referring to this, where we could use the Pushdowns struct as a pushdowns field on the ScanTask.

src/daft-scan/src/glob.rs

…ves ScanTaskBatch (#1565) 1. Replaces `ScanTask` with `ScanTaskBatch`, which mostly like `ScanTask` except that it is multi-file 2. This helps us deduplicate information representation between `ScanTask` and `ScanTaskBatch` --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

Adds various fixes to daft v2 scans: 1. Fix for how we handle schema hints in the Python logical scan node builder 2. Fix for our ScanGlobOperator not correctly prefixing local paths with `file://` schemes 3. Adds a branch in micropartitions for materializing from a ScanTask that is a CSV 4. Implements column pruning pushdown rules, and correct handling during ScanTask materialization --------- Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

Co-authored-by: Clark Zinzow <[email protected]>

jaychia · 2023-11-06T20:28:16Z

daft/execution/rust_physical_plan_shim.py

 from daft.table import Table

 PartitionT = TypeVar("PartitionT")


+def scan_with_tasks(
+    scan_tasks: list[ScanTask],


NOTE: I changed this from a single ScanTaskBatch to avoid having to coalesce all the tasks into one fat task

clarkzinzow

@jaychia LGTM overall, can't approve my own PR so leaving this as a comment!

src/daft-micropartition/src/micropartition.rs

…tion` reads (#1578) This PR adds support for the Python I/O layer to `MicroPartition` reads, which thereby adds support for reading `MicroPartition`s from JSON files with the scan operator path. As a driveby, this PR also fixes a column ordering bug when out-of-order column projections are provided to our native CSV reader. This PR is stacked on top of #1559

clarkzinzow requested a review from jaychia November 1, 2023 02:18

clarkzinzow assigned jaychia Nov 1, 2023

samster25 reviewed Nov 1, 2023

View reviewed changes

clarkzinzow added 2 commits November 1, 2023 17:04

Add MVP ScanOperator integration.

422e7b2

Implement pointer-based equality and hashing for ScanOperator.

9768c21

jaychia force-pushed the clark/scan-operator-integration branch from b091508 to 9768c21 Compare November 2, 2023 00:04

github-actions bot added the enhancement New feature or request label Nov 2, 2023

Jay Chia and others added 7 commits November 2, 2023 17:21

Swap to using a new environment variable $DAFT_V2_SCANS instead to un…

636faa6

…block merging into main

[CHORE] [ScanOperator-Follow-Ons-3] Integrate GlobScanOperator with n…

2740a54

…ew scan node builder (#1564) Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

Fix build in conflicting changes in GlobScanOperator and pointer equa…

acec8ec

…lity PRs

Fix schema casting bug

b8b6fff

Fix Rust optimization tests because they are repr-based

63e7146

nits

59263e9

jaychia changed the title ~~[FEAT] [Scan Operator] [New Query Planner] Add MVP e2e ScanOperator integration.~~ [FEAT][ScanOperator 1/3] Add MVP e2e ScanOperator integration. Nov 4, 2023

samster25 approved these changes Nov 6, 2023

View reviewed changes

Jay Chia added 2 commits November 6, 2023 10:18

Add daft-common-io to python deps

7aae2fe

Add deduplication of effort for retrieving IOClient and Runtime in th…

effaa97

…e GlobScanOperator

This was referenced Nov 6, 2023

Allow GlobScanOperator to take a list of glob paths #1576

Closed

[BUG][ScanOperator 3/3] Incremental bugfixes for daft v2 scans #1566

Merged

clarkzinzow commented Nov 6, 2023

View reviewed changes

src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved

clarkzinzow commented Nov 6, 2023

View reviewed changes

src/daft-scan/src/glob.rs Outdated Show resolved Hide resolved

jaychia and others added 4 commits November 6, 2023 12:18

Update src/daft-scan/src/glob.rs nits

24d6a46

Co-authored-by: Clark Zinzow <[email protected]>

Update src/daft-scan/src/glob.rs nits

2d3579b

Co-authored-by: Clark Zinzow <[email protected]>

jaychia reviewed Nov 6, 2023

View reviewed changes

clarkzinzow commented Nov 7, 2023

View reviewed changes

src/daft-micropartition/src/micropartition.rs Outdated Show resolved Hide resolved

clarkzinzow mentioned this pull request Nov 7, 2023

[FEAT] [Scan Operator] Add Python I/O support (+ JSON) to MicroPartition reads #1578

Merged

Only perform parquet fast-path in from_scan_task if using NativeConfig

c113d5a

jaychia merged commit e176f2e into main Nov 7, 2023
36 of 37 checks passed

jaychia deleted the clark/scan-operator-integration branch November 7, 2023 03:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT][ScanOperator 1/3] Add MVP e2e `ScanOperator` integration. #1559

[FEAT][ScanOperator 1/3] Add MVP e2e `ScanOperator` integration. #1559

clarkzinzow commented Nov 1, 2023 •

edited

Loading

samster25 left a comment

samster25 Nov 1, 2023

jaychia Nov 1, 2023

samster25 Nov 1, 2023

codecov bot commented Nov 3, 2023 •

edited

Loading

samster25 Nov 6, 2023

jaychia Nov 6, 2023

clarkzinzow Nov 6, 2023

clarkzinzow Nov 6, 2023

jaychia Nov 6, 2023

clarkzinzow left a comment

[FEAT][ScanOperator 1/3] Add MVP e2e ScanOperator integration. #1559

[FEAT][ScanOperator 1/3] Add MVP e2e ScanOperator integration. #1559

Conversation

clarkzinzow commented Nov 1, 2023 • edited Loading

TODOs (possibly before merging)

samster25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 3, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

[FEAT][ScanOperator 1/3] Add MVP e2e `ScanOperator` integration. #1559

[FEAT][ScanOperator 1/3] Add MVP e2e `ScanOperator` integration. #1559

clarkzinzow commented Nov 1, 2023 •

edited

Loading

codecov bot commented Nov 3, 2023 •

edited

Loading