Support for reading multiple sorted files per bucket #731

rahij · 2021-02-17T10:54:39Z

PR optimized for a smaller diff from master, as opposed to smaller diff from upstream PR. Single file partitions are now an actual no-op as well.

More details in #730

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSortedMergeScanRDD.scala

rshkv · 2021-02-18T02:23:44Z

Ok. I suppose this is better. BaseFileScanIterator and FileSortedBucketScanIterator is still core code that's unreviewed upstream which I'm really not a huge fan of. But at least FileScanRDD goes through the old code paths. Thanks for that.

I'll take a closer look at the iterators tomorrow. Think we'll want a second pair of eyes from @lwwmanning or @robert3005 as well.

rahij · 2021-02-18T11:32:24Z

The methods in BaseFileScanIterator are also copy pasted from FileScanRDD. The main new code that was added are the methods in FileSortedBucketScanIterator

rshkv · 2021-02-18T14:24:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanIterators.scala

+          // The readFunction may read some bytes before consuming the iterator, e.g.,
+          // vectorized Parquet reader. Here we use lazy val to delay the creation of
+          // iterator so that we will throw exception in `getNext`.
+          private lazy val internalIter = readFile(file)


I think you can just pass readFile(currentFile). Why do you need private val file = currentFile?

I have not modified any code copied from FileScanRDD

oh nvm this is new, I'll change - this was how it was in the upstream PR

rshkv · 2021-02-18T15:16:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanIterators.scala

+    // Set InputFileBlockHolder for the file block's information
+    currentFile = currentIteratorWithRow.getFile
+    InputFileBlockHolder.set(currentFile.filePath, currentFile.start, currentFile.length)


I understand that's to not break input_file_name() (which holds a thread-local of the current file). But I'm not sure that this will work if you read a batch of rows 1..N from different files and only evaluate input_file_name() afterwards. So for row 1 you might get the file name for N.

Not sure. If the tests don't cover it I guess we'll just have to trust.

Also I wonder why setting currentFile (which could have side-effects in theory but not in this PR) instead of assigning to a local variable. I guess in case it's read, but I only see the value being read when reading a file. (Which doesn't happen after the first hasNext.)

it's also used in nextIterator in the BaseScanIterators? Wondering if I should just collapse both of them into a single iterator - the superclass was there from the upstream PR since it had 3 subclasses, but we only have one here. Maybe it might be easier to read.

rshkv · 2021-02-18T15:21:25Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+    // but those files combined together are not globally sorted. With configuration
+    // "spark.sql.sources.bucketing.sortedScan.enabled" being enabled, sort ordering
+    // is preserved by reading those sorted files in sort-merge way.


Can you explain how we assert that?

Maybe some fundamental misunderstanding, but I'd expect the query plans change if you enable bucketing.sortedScan.enabled and the inputs are sorted. (Compared to the inputs are stored but the flag is disabled.) Yet you didn't need to update the testBucketing method.

The BucketedTableTestSpec has a field (expectedSort) to indicate whether there should be a sort in the query plan - we pass that as the negative of if the flag is enabled. So the testBucketing method already has this code to check the query plan:

joinOperator.left.find(_.isInstanceOf[SortExec]).isDefined == sortLeft, s"expected sort in the left child to be $sortLeft but found\n${joinOperator.left}") assert( joinOperator.right.find(_.isInstanceOf[SortExec]).isDefined == sortRight, s"expected sort in the right child to be $sortRight but found\n${joinOperator.right}")

rshkv · 2021-02-18T15:26:44Z

Read through the iterator logic and it looks fine to me. The mechanism seems simple enough. (Note to self: In the first hasNext we call initializeHeapWithFirstRows, which creates iterators for all files in the bucket and puts them in a priority queue. Then on every next() we get the next-priority iterator, get its row, then put the iterator back in the queue. Until all iterators are exhausted. Ok.)

lwwmanning · 2021-02-18T18:07:33Z

lgtm 👍🏼

rahij added 14 commits February 10, 2021 13:57

add supprot for reading multiple sorted files per bucket

c782604

fix conf and imports

23349ae

keep logic more similar

85320d0

move comment

8ab1f8d

fix imports

fea21b1

rm new import

0581815

fixes

10bf1db

fix tests

edbe575

smaller diff

d520e73

no-op for single file partitions

146598e

wire through options

aa233c2

fixup

15ea5cb

fixup

64f0799

update comment

501c71c

rahij requested review from rshkv and mattsills February 17, 2021 10:54

rahij added 3 commits February 17, 2021 11:07

fix style

eba82f2

move class outside

c04ef31

back to sql conf

181d76f

rshkv reviewed Feb 18, 2021

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSortedMergeScanRDD.scala Show resolved Hide resolved

rshkv reviewed Feb 18, 2021

View reviewed changes

rshkv approved these changes Feb 18, 2021

View reviewed changes

rshkv requested review from robert3005 and lwwmanning February 18, 2021 17:58

rshkv changed the title ~~[smaller diff] add support for reading multiple sorted files per bucket~~ Support for reading multiple sorted files per bucket Feb 19, 2021

rshkv merged commit 3195ed8 into master Feb 19, 2021

rshkv deleted the rr/round2 branch February 19, 2021 10:30

rahij mentioned this pull request Mar 8, 2021

Use non-batch scans during sorted bucketed reads #738

Merged

rahij mentioned this pull request Mar 16, 2021

Support for reading multiple sorted files per bucket #742

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for reading multiple sorted files per bucket #731

Support for reading multiple sorted files per bucket #731

rahij commented Feb 17, 2021 •

edited

Loading

rshkv commented Feb 18, 2021

rahij commented Feb 18, 2021

rshkv Feb 18, 2021

rahij Feb 18, 2021

rahij Feb 18, 2021

rshkv Feb 18, 2021

rshkv Feb 18, 2021 •

edited

Loading

rahij Feb 18, 2021

rshkv Feb 18, 2021 •

edited

Loading

rahij Feb 18, 2021

rshkv Feb 18, 2021

rshkv commented Feb 18, 2021

lwwmanning commented Feb 18, 2021

Support for reading multiple sorted files per bucket #731

Support for reading multiple sorted files per bucket #731

Conversation

rahij commented Feb 17, 2021 • edited Loading

rshkv commented Feb 18, 2021

rahij commented Feb 18, 2021

rshkv Feb 18, 2021

Choose a reason for hiding this comment

rahij Feb 18, 2021

Choose a reason for hiding this comment

rahij Feb 18, 2021

Choose a reason for hiding this comment

rshkv Feb 18, 2021

Choose a reason for hiding this comment

rshkv Feb 18, 2021 • edited Loading

Choose a reason for hiding this comment

rahij Feb 18, 2021

Choose a reason for hiding this comment

rshkv Feb 18, 2021 • edited Loading

Choose a reason for hiding this comment

rahij Feb 18, 2021

Choose a reason for hiding this comment

rshkv Feb 18, 2021

Choose a reason for hiding this comment

rshkv commented Feb 18, 2021

lwwmanning commented Feb 18, 2021

rahij commented Feb 17, 2021 •

edited

Loading

rshkv Feb 18, 2021 •

edited

Loading

rshkv Feb 18, 2021 •

edited

Loading