Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CREATE EXTERNAL TABLE format options consistent, remove special syntax for HEADER ROW, DELIMITER and COMPRESSION #10404

Merged
merged 19 commits into from
May 13, 2024

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented May 7, 2024

Which issue does this PR close?

Closes #9945.
Closes #10414

Rationale for this change

let file_format: Arc<dyn FileFormat> = match file_type {
FileType::CSV => {
let mut csv_options = table_options.csv;
csv_options.has_header = cmd.has_header;
csv_options.delimiter = cmd.delimiter as u8;
csv_options.compression = cmd.file_compression_type;
Arc::new(CsvFormat::default().with_options(csv_options))

These lines of ListingTableFactory.create() overwrites the config options by cmd (CreateExternalTable options). Configs coming from OPTIONS('format... are discarded silently.

statement ok
CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
WITH HEADER ROW

query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----

works as expected, but

statement ok
CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')

query RRB
SELECT * FROM aggregate_simple LIMIT 3;
----

does not, since CREATE EXTERNAL TABLE does not have WITH HEADER ROW, which overwrites csv.has_header to false.

To summarize:
CreateExternalTable has 3 format specific fields:

  1. has_header
  2. delimiter
  3. file_compression_type
    After Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382, CreateExternalTable.options does store these format specific configurations.

What changes are included in this PR?

CreateExternalTable.delimiter and CreateExternalTable.file_compression_type are removed, and they are needed to be given from options, such as

OPTIONS(
    'format.delimiter' ',',
    'format.compression' 'gzip')

For the has_header config, as @alamb #9945 (comment) suggests, we can sustain its usage as WITH HEADER ROW from CreateExternalTable's scope. I refactor the code such that

(WITH HEADER ROW) + (format.has_header = true) => OK with header
(WITH HEADER ROW) + (format.has_header = false) => Error
(WITH HEADER ROW) + (format.has_header = not specified) => OK with header
() + (format.has_header = true) => OK with header
() + (format.has_header = false) => Ok without header
() + (format.has_header = not_specified) => Ok without header

However in the long term, I think removing also WITH HEADER ROW option will be a better design and help having more clear UX.

WITH HEADER ROW is also removed to OPTIONS tab.

Are these changes tested?

Yes.

Are there any user-facing changes?

Users access format specific configurations (delimiter and compression type) via OPTIONS() command.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 7, 2024
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this clean-up PR @berkaysynnada. This almost closes the loop on getting us to a consistent API w.r.t. I/O option handling (e.g. format options). With this PR, the only "weird" thing remaining in the API will be the WITH HEADER ROW top-level specifier (which it shouldn't be, as it is a format-specific option (CSV)).

We organized this PR to keep WITH HEADER ROW and gave it consistent semantics w.r.t. its interaction with format options. Maybe can add a deprecation warning in a follow-on PR and remove it entirely in the future.

@alamb, it'd be great if you can take a look as we discussed this before. Thanks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada and @ozankabak

This PR looks great to me other than removing support for DELIMTER and COMRESSION in the CREATE EXTERNAL TABLE syntax (and I almost missed that change when skimming this PR initially)

What I suggest is:

  1. Update this PR to just fix the has_header bug in Overwritten Format Configs by CreateExternalTable Options #9945
  2. Start a new discussion / ticket about removing / deprecating the WITH HEADER ROW / COMPRESSION, DELIMITER syntax support.

However in the long term, I think removing also WITH HEADER ROW option will be a better design and help having more clear UX.

I don't really have a strong preference myself, but given how long the other syntax has been around I think other people might have

datafusion/common/src/config.rs Show resolved Hide resolved
datafusion/common/src/config.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved
datafusion/core/src/datasource/listing_table_factory.rs Outdated Show resolved Hide resolved
datafusion/sql/src/parser.rs Show resolved Hide resolved
@ozankabak
Copy link
Contributor

Thanks for taking a look. Do you suggest having a similar mechanism for DELIMITER and COMPRESSION like we did for WITH HEADER ROW? We can do that, but it will increase code complexity and we will need maintain that code until we finally remove them to arrive at the consistent syntax we aim.

I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.

Start a new discussion / ticket about removing / deprecating the WITH HEADER ROW / COMPRESSION, DELIMITER syntax support.

Makes sense, let's do this

@alamb
Copy link
Contributor

alamb commented May 7, 2024

Thanks for taking a look. Do you suggest having a similar mechanism for DELIMITER and COMPRESSION like we did for WITH HEADER ROW?

Yes, that is my suggestion

We can do that, but it will increase code complexity and we will need maintain that code until we finally remove them to arrive at the consistent syntax we aim.

I agree it will make the code more complex

I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.

I agree -- I will file a follow on ticket discuss removing the options

@alamb
Copy link
Contributor

alamb commented May 7, 2024

I agree -- I will file a follow on ticket discuss removing the options

Filed #10414 to discus removing these options

@berkaysynnada berkaysynnada force-pushed the feature/format-options branch from d751320 to 9d81601 Compare May 9, 2024 12:13
@@ -66,7 +66,6 @@ CREATE EXTERNAL TABLE test_table (
date_col DATE
)
STORED AS PARQUET
WITH HEADER ROW
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to specify header rows for parquet

@berkaysynnada berkaysynnada force-pushed the feature/format-options branch from fec33b7 to fe505a2 Compare May 9, 2024 15:04
@berkaysynnada
Copy link
Contributor Author

This PR is now ready for review. By employing that method, users can choose to maintain the existing behavior if desired.

@@ -68,18 +60,6 @@ CREATE EXTERNAL TABLE t STORED AS CSV STORED AS PARQUET LOCATION 'foo.parquet'
statement error DataFusion error: SQL error: ParserError\("LOCATION specified more than once"\)
CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'foo.csv' LOCATION 'bar.csv'

# Duplicate `WITH HEADER ROW` clause
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable anymore

@@ -52,14 +52,6 @@ CREATE EXTERNAL TABLE t STORED AS CSV WITH HEADER LOCATION 'abc'
statement error DataFusion error: SQL error: ParserError\("Expected BY, found: LOCATION"\)
CREATE EXTERNAL TABLE t STORED AS CSV PARTITIONED LOCATION 'abc'

# Missing `TYPE` in COMPRESSION clause
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable anymore

constraints: vec![],
});
expect_parse_ok(sql, expected)?;

// positive case: it is ok for case insensitive sql stmt with `WITH HEADER ROW` tokens
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable

Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did one more round of review and it looks good to me. Seems like @berkaysynnada threw in a minor improvement of being able to use non-quoted strings in the OPTIONS clause as well, which is great. Creating warnings when WITH HEADER ROW is used and directing users to the OPTIONS clause is also great. Great work @berkaysynnada 🚀

@alamb, feel free to merge once you take a look and all looks good.

@alamb
Copy link
Contributor

alamb commented May 10, 2024

I plan to review this again later today. Thanks @berkaysynnada and @ozankabak

@alamb alamb changed the title Simplify Format Options Make CREATE EXTERNAL TABLE format options consistent, remove special syntax for HEADER ROW, DELIMITER and COMPRESSION May 10, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label May 10, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada and @ozankabak -- i looked over this PR and I think the code looks really nice 👏

From my perspective this PR also now implements the consensus from #10414 and everyone seems happy with the overall increase in consistency (and there is a workaround for anyone who needs the old special syntax -- kudos to @phillipleblanc 🙏 ).

Something else I noticed while reviewing this PR is that the available options (e.g. format.has_header, format.delimiter, etc do not appear to be documented anywhere -- I filed #10451 to track that

docs/source/user-guide/sql/ddl.md Show resolved Hide resolved
datafusion/core/src/datasource/file_format/csv.rs Outdated Show resolved Hide resolved
{
opt.unwrap_or(false)
} else {
return config_err!("format.has_header can either be true or false");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what this error message means. Does it mean that there are inconsistent settings somehow? Perhaps we could improve the text to give some hint about how the user can fix whatever the problem is 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error occurs here if the user would set such an option: ('format.has_header' 'x') where x is anything other than 'true' or 'false'. I'd like to emphasize that. But as you said, the message may direct the user as if a conflicting options are set. Thanks for the feedback, I am updating it now.

datafusion/sql/src/statement.rs Show resolved Hide resolved
datafusion/sql/src/parser.rs Show resolved Hide resolved
datafusion/sql/src/statement.rs Show resolved Hide resolved
docs/source/user-guide/cli/datasources.md Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented May 10, 2024

I also updated this PR's title to reflect the user visible changes and added the api-change label

@ozankabak
Copy link
Contributor

Thank you. We will address your review feedback and then merge afterwards. 🚀

@berkaysynnada berkaysynnada force-pushed the feature/format-options branch from 08237ea to a817dc8 Compare May 13, 2024 08:30
@alamb alamb merged commit 58cc4e1 into apache:main May 13, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented May 13, 2024

Thanks again for this work @berkaysynnada and the guidance @ozankabak

@Blizzara
Copy link
Contributor

Hey, this changed the behavior of writing and reading CSVs to be slightly inconsistent: now writing defaults to not writing a header, while reading defaults to having a header. Previously both defaulted to having a header, ie. the write side has flipped.

I noticed as our test started failing - a simplified version below, the following worked previously, after this commit it fails:

#[tokio::test]
async fn write_then_read_csv() -> Result<()> {
    let test_path = test_file_path("csv/");

    let ctx = SessionContext::new();

    let df = ctx.sql("SELECT 1 AS col").await?;
    assert_eq!(df.to_owned().count().await?, 1);
    df.write_csv(&test_path, DataFrameWriteOptions::default(), None).await?;

    let df = ctx.read_csv(&test_path, CsvReadOptions::default()).await?;
    assert_eq!(df.count().await?, 1);

    Ok(())
}

but can be fixed by adding Some(CsvOptions::default().with_has_header(true)) to the write_csv (restoring old behavior), or alternatively changing read_csv to have CsvReadOptions::default().has_header(false) to make it match the write.

Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change.

@ozankabak
Copy link
Contributor

ozankabak commented May 31, 2024

Hmm, this shouldn't be the case. The logic is supposed to be the following for both reads and writes:

  • Use the explicitly specified header flag, if it exists.
  • Otherwise, consult the session config and use the default value there.

If that is not so, we should open and issue and fix. Thanks for letting us know!

@berkaysynnada
Copy link
Contributor Author

Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change.

Thanks for reporting that. It's surprising that this issue hasn't emerged before. CatalogOptions's has_header value was false as default from the beginning, which is the write options in that case. However, reader's is true:

I believe we should also set it as false while creating it.

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…l syntax for `HEADER ROW`, `DELIMITER` and `COMPRESSION` (apache#10404)

* Simplify format options

* Keep PG copy from tests same

* Update datafusion/common/src/config.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/core/src/datasource/file_format/csv.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Remove WITH HEADER ROW

* Review Part 1

* .

* Fix failing tests

* Revert "Fix failing tests"

This reverts commit 9d81601.

* Final commit

* Minor

* Review

* Update avro.slt

* Apply suggestions

* Fix imports

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
4 participants