-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minor: Add repartition_file.slt end to end test for repartitioning files, and supporting tweaks #8505
Conversation
…les, and supporting tweaks
@@ -131,6 +131,10 @@ impl ListingTableUrl { | |||
if is_directory { | |||
fs::create_dir_all(path)?; | |||
} else { | |||
// ensure parent directory exists |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this I can't write into /dir/table_name/1.parquet
if /dir/table_name
doesn't exist already
This is not enabled by deafult, it only affects when the user has specified SINGLE_FILE_OUTPUT
explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah, I still need to fix this, this is a weird hold over from append
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What needs to be fixed? I would like to write up a catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reason you should need to create these files ahead of time anymore, it was a consequence of the head request being performed to accommodate the append semantics - I can add it to my list to clean it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the object_store LocalFile::put
implementation should automatically create the paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR in #8539
@@ -159,6 +159,7 @@ async fn run_test_file_with_postgres(test_file: TestFile) -> Result<()> { | |||
relative_path, | |||
} = test_file; | |||
info!("Running with Postgres runner: {}", path.display()); | |||
setup_scratch_dir(&relative_path)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this complete mode didn't clear the scratch dir resulting in confusing errors
SortPreservingMergeExec: [column1@0 ASC NULLS LAST] | ||
--CoalesceBatchesExec: target_batch_size=8192 | ||
----FilterExec: column1@0 != 42 | ||
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..200], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:200..400], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:400..403, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..197], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:197..394]]}, projection=[column1], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually incorrect because data from file1 appended to file2 is not sorted by column1 anymore -- #8451
(note the query gets the right answer because only the first group actually makes any values for 1.parquet as it is tiny)
The test is failing because the order of files listed is not consistent. I am working on a fix for these. Update: fixed |
Thank you for the review @tustvold 🙏 |
Which issue does this PR close?
This is part of #8451
Rationale for this change
There are no good end to end tests of repartitioned scans that I could find (because the default size threshold for repartitioning is 10MB, and most tests use files smaller than this)
What changes are included in this PR?
--complete
mode didn't clear scratch files (I think @tustvold hit this at some point)SINGLE_FILE_OUTPUT
that caused it to fail if the parent directory didn't existAre these changes tested?
Yes -- there are several new tests added
Are there any user-facing changes?
Two bug fixes as described above