wip: allow scanning data in out of order #874

wjones127 · 2023-05-16T23:46:53Z

Closes #861.

TODO:

Fix various lifetime issues
Make sure ordered case still has buffering
Expose in pylance
Profile performance for both ordered and unordered cases (deferred for a follow up)

wjones127 · 2023-05-17T18:54:21Z

Seeing 2x performance in the existing benchmark:

$ git checkout main
$ cargo bench --bench scan -- --save-baseline main
Scan full dataset       time:   [4.2708 ms 4.2930 ms 4.3176 ms]
                        change: [-3.3599% -0.7548% +1.9459%] (p = 0.62 > 0.10)
                        No change in performance detected.

$ git checkout wjones127/861-buffer-unordered
$ cargo bench --bench scan -- --baseline main
Scan full dataset       time:   [2.1718 ms 2.1794 ms 2.1882 ms]
                        change: [-49.475% -48.707% -47.613%] (p = 0.00 < 0.10)
                        Performance has improved.

In a follow up, I'd like to add a benchmark for scanning multiple files, so we can compare the performance of ordered and unordered scans.

wjones127 · 2023-05-17T19:18:29Z

python/python/lance/dataset.py

@@ -235,6 +235,7 @@ def to_batches(
        offset: Optional[int] = None,
        nearest: Optional[dict] = None,
        batch_readahead: Optional[int] = None,
+        ordered_scan: bool = True,


What do we think of this name? Is there another we'd prefer?

Sounds reasonable. 👍

Or should we name this as batch_ordered? Will people think of this flag as ORDER BY kind of sorting semantic

I see, batch_ordered to parallel batch_readahead. The only downside of that is I'm thinking that ordered_scan will mean that both fragments and batches will be scanned in order. For example, if the user passes fragments in a specific order, that order is respected as part of the scan.

Actually, I might add fragment_readahead, which will only apply to out-of-order scans. Arrow C++ has a similar option

I'm thinking scan_in_order might be a less ambiguous name.

eddyxu · 2023-05-17T19:25:18Z

python/python/lance/dataset.py

@@ -235,6 +235,7 @@ def to_batches(
        offset: Optional[int] = None,
        nearest: Optional[dict] = None,
        batch_readahead: Optional[int] = None,
+        ordered_scan: bool = True,


Sounds reasonable. 👍

Or should we name this as batch_ordered? Will people think of this flag as ORDER BY kind of sorting semantic

eddyxu · 2023-05-17T19:25:52Z

python/python/lance/dataset.py

+    def ordered_scan(self, ordered_scan: bool = True) -> ScannerBuilder:
+        """
+        Whether to scan the dataset in order. If set to False, the scanner may
+        read fragments concurrently and yield batches out of order.


could you also mention that "out of order" scan might yield to better performance , so that users know what the benefit of this flag is?

eddyxu · 2023-05-17T19:28:47Z

rust/src/dataset/scanner.rs

@@ -84,6 +84,9 @@ pub struct Scanner {
    /// Scan the dataset with a meta column: "_rowid"
    with_row_id: bool,

+    /// Whether to scan in deterministic order (default: true)
+    ordered: bool,


should we use batch_ordered or ordered_scan here, in case people confuses it with ORDER BY SQL semantic.

eddyxu · 2023-05-17T19:29:45Z

rust/src/io/exec/scan.rs

+            .step_by(read_size)
+            .map(move |start| (batch_id, start..min(start + read_size, rows_in_batch)))
+    });
+    let batch_stream = stream::iter(read_params_iter).map(move |(batch_id, range)| {


this is smart!

eddyxu · 2023-05-17T19:46:49Z

Seeing 2x performance in the existing benchmark:

$ git checkout main
$ cargo bench --bench scan -- --save-baseline main
Scan full dataset       time:   [4.2708 ms 4.2930 ms 4.3176 ms]
                        change: [-3.3599% -0.7548% +1.9459%] (p = 0.62 > 0.10)
                        No change in performance detected.

$ git checkout wjones127/861-buffer-unordered
$ cargo bench --bench scan -- --baseline main
Scan full dataset       time:   [2.1718 ms 2.1794 ms 2.1882 ms]
                        change: [-49.475% -48.707% -47.613%] (p = 0.00 < 0.10)
                        Performance has improved.

In a follow up, I'd like to add a benchmark for scanning multiple files, so we can compare the performance of ordered and unordered scans.

Would be curious to see the performance difference between scanning local file from laptop and scan from S3.

eddyxu

Some questions and needs some clarification of documents. not blocker.
The rest LGTM.

eddyxu · 2023-05-18T00:33:40Z

rust/src/io/exec/scan.rs

+                .await
+                .map_err(|e| DataFusionError::from(e))
+        }
+    });


Do you need a buffered/unbuffered here to control the number of parallelism?

That's configured in the body of try_new(), at where your other comment is. This method just defines the stream of futures.

eddyxu · 2023-05-18T00:36:52Z

rust/src/io/exec/scan.rs

+                    open_file(file_fragment, project_schema.clone(), with_row_id)
+                })
+                .map_ok(move |reader| {
+                    scan_batches(reader, read_size).buffer_unordered(batch_readahead)


So here it means that it reads batch_readhead * fragment_readhead in memory?

How many actual threads are created in this case? It is not clear to me.

Might want to add some document to describe the expectation of # of threads and # of batches in flight

So here it means that it reads batch_readhead * fragment_readhead in memory?

I think you are right, but that's also not what we want. Took a while to figure out how to rearrange it, but I think my latest commit should make it so batch_readahead is enforced across fragments, so no matter how many fragments are read concurrently only batch_readahead batches will be buffered. (I put the buffer_unordered after the flatten.)

How many actual threads are created in this case? It is not clear to me.

As I understand it, tokio will dynamically schedule them across it's thread pool, which defaults to 1 thread per core. From the tokio docs:

The multi-thread scheduler executes futures on a thread pool, using a work-stealing strategy. By default, it will start a worker thread for each CPU core available on the system.

Might want to add some document to describe the expectation of # of threads and # of batches in flight

Agreed. This code can be a little confusing. I've added some comments.

but I think my latest commit should make it so batch_readahead is enforced across fragments,

I feel it is ok to make batch_readahead just limited to one open fragment, which could be the least surprise with the official pyarrow API ?

https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html

As I understand it, tokio will dynamically schedule them across it's thread pool, which defaults to 1 thread per core.

Oh yes, my previous comment was not clear. I was wondering whether buffer_unordered + try_flatten_unordered will lead to the desired batch_readahead * fragment_readahead. Especially try_flatten_unordered will control the number of fragments opened.

it seems as desired after reading the doc. lets merge them :)

wjones127 added 4 commits May 16, 2023 16:41

wip: allow scanning data in various orders

66d925c

fix lifetime issues

0897ca9

adjust the API to get existing tests to pass

2242355

test: test the unordered code path

3826ecf

feat: expose unordered_scan in Python

1834907

wjones127 commented May 17, 2023

View reviewed changes

eddyxu reviewed May 17, 2023

View reviewed changes

wjones127 added 3 commits May 17, 2023 13:02

fix: keep debug

2ea031e

test: add test to enforce scan ordering

49bd289

refactor: add fragment readahead parameter and rename to scan_in_order

2b6c0cb

wjones127 marked this pull request as ready for review May 18, 2023 00:28

eddyxu approved these changes May 18, 2023

View reviewed changes

fix: enforce batch_readahead across fragments

d3da688

wjones127 merged commit 8902a74 into main May 18, 2023

wjones127 deleted the wjones127/861-buffer-unordered branch May 18, 2023 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: allow scanning data in out of order #874

wip: allow scanning data in out of order #874

wjones127 commented May 16, 2023 •

edited

Loading

wjones127 commented May 17, 2023

wjones127 May 17, 2023

eddyxu May 17, 2023

wjones127 May 17, 2023

eddyxu May 17, 2023

eddyxu May 17, 2023

eddyxu May 17, 2023

eddyxu May 17, 2023

eddyxu May 17, 2023

eddyxu commented May 17, 2023

eddyxu left a comment

eddyxu May 18, 2023

wjones127 May 18, 2023

eddyxu May 18, 2023

wjones127 May 18, 2023

eddyxu May 18, 2023 •

edited

Loading

eddyxu May 18, 2023

wip: allow scanning data in out of order #874

wip: allow scanning data in out of order #874

Conversation

wjones127 commented May 16, 2023 • edited Loading

wjones127 commented May 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu commented May 17, 2023

eddyxu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eddyxu May 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 commented May 16, 2023 •

edited

Loading

eddyxu May 18, 2023 •

edited

Loading