Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new configuration item listing_table_ignore_subdirectory #8565

Merged
merged 15 commits into from
Dec 22, 2023

Conversation

Asura7969
Copy link
Contributor

@Asura7969 Asura7969 commented Dec 16, 2023

Which issue does this PR close?

Closes #8524.

Rationale for this change

Consistent behavior with duckdb and hive

What changes are included in this PR?

add config listing_table_ignore_subdirectory
When scanning file paths, whether to ignore subdirectory files, ignored by default (true)

NOTE: When reading a partitioned table, listing_table_ignore_subdirectory is always equal to false, even if set to true

Are these changes tested?

add sqllogictest

Are there any user-facing changes?

ListingTableUrl::contains method add parameters

@github-actions github-actions bot added the core Core DataFusion crate label Dec 16, 2023
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 16, 2023
@@ -109,7 +109,7 @@ mod tests {
.read_parquet(
// it was reported that when a path contains // (two consecutive separator) no files were found
// in this test, regardless of parquet_test_data() value, our path now contains a //
format!("{}/..//*/alltypes_plain*.parquet", parquet_test_data()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add regex matching to directories?🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this question

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Asura7969 -- this code looks good to me, but I think we need some test coverage (I am not sure the test in this PR covers the changes)

Could you maybe create a sqllogictest test for this? So like write three files:

Like this

table/1.parquet
table/subdir/2.parquet
table/3.parquet

And then verify that with ignore_subdirectory to true (default) only the rows from 1.parquet and 3.parquet appear in the output but when the setting is set to false then the rows from all three files are present?

There is an example of creating and reading parquet files in a sqllogictest in these two files:

https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/parquet.slt (👋 @hiltontj -- we are already using it!)

https://github.com/apache/arrow-datafusion/blob/main/datafusion/sqllogictest/test_files/repartition_scan.slt

@@ -109,7 +109,7 @@ mod tests {
.read_parquet(
// it was reported that when a path contains // (two consecutive separator) no files were found
// in this test, regardless of parquet_test_data() value, our path now contains a //
format!("{}/..//*/alltypes_plain*.parquet", parquet_test_data()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this question

@@ -424,6 +433,13 @@ mod tests {
let b = ListingTableUrl::parse("../bar/./foo/../baz").unwrap();
assert_eq!(a, b);
assert!(a.prefix.as_ref().ends_with("bar/baz"));

let url = ListingTableUrl::parse("../foo/*.parquet").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably missing something here, but how does this test the new code? I don't see it passing in ignore_subdirectory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's really not obvious here (actually in ListingTableUrl.contains), I would create a sqllogictest as you suggested

@@ -150,6 +150,7 @@ datafusion.execution.aggregate.scalar_update_factor 10
datafusion.execution.batch_size 8192
datafusion.execution.coalesce_batches true
datafusion.execution.collect_statistics false
datafusion.execution.ignore_subdirectory true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a name that gives some context about when the ignore_subdirectory is actually used?

For example, maybe like listing_table_ignore_subdirectory. Or maybe even it is time to create a whole new category of configuration for listing tables datafusion.listing_table.ignore_subdirectory 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used listing_table_ignore_subdirectory

--------CsvExec: file_groups={2 groups: [[WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-11-13/timestamps.csv], [WORKSPACE_ROOT/datafusion/core/tests/data/partitioned_table/c_date=2018-12-13/timestamps.csv]]}, has_header=false

query I
select count(*) from partition_csv_table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this test is testing -- in both cases the table has 4 rows (aka there is no data in a subdirectory to ignore, right)?

Copy link
Contributor Author

@Asura7969 Asura7969 Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading a partitioned table, listing_table_ignore_subdirectory is always equal to false, even if set to true,But this test seems a bit redundant, i will clean up

12

query I
select count(*) from t2_with_subdirectory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cool to see the different rows but I don't understand the need for all the different tables and explain plans

I think we can get coverage by simply creating the equivalent of t2_with_subdirectory and showing that it returns 12 rows when

set datafusion.execution.listing_table_ignore_subdirectory = false;

And 9 when

set datafusion.execution.listing_table_ignore_subdirectory = true;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion, I understand

# Conflicts:
#	datafusion/sqllogictest/test_files/parquet.slt
Comment on lines +276 to +278
/// When scanning file paths, whether to ignore subdirectory files,
/// ignored by default (true), when reading a partitioned table,
/// `listing_table_ignore_subdirectory` is always equal to false, even if set to true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the description information, do you agree? @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is trying to say 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

## read partition table

CREATE EXTERNAL TABLE csv_with_timestamps (
  name VARCHAR,
  ts TIMESTAMP,
  c_date DATE,
)
STORED AS CSV
PARTITIONED BY (c_date)
LOCATION '../core/tests/data/partitioned_table';

set datafusion.execution.listing_table_ignore_subdirectory = true;

select count(*) from partition_tbale;    ## return 4

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it -- thank you -- I will propose a clarification in a follow on PR

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Asura7969 -- this looks great

@@ -276,6 +276,39 @@ LIMIT 10;
0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))
0 2014-08-27T14:00:00Z Timestamp(Millisecond, Some("UTC"))

# Test config listing_table_ignore_subdirectory:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Asura7969 Asura7969 changed the title Add new configuration item ignore_subdirectory Add new configuration item listing_table_ignore_subdirectory Dec 22, 2023
@alamb alamb merged commit ba46434 into apache:main Dec 22, 2023
24 checks passed
@Asura7969 Asura7969 deleted the fix_list_all_files branch December 23, 2023 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

select count(*) from '*.parquet' scans all parquet file recursively including subdirectory
2 participants