-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CREATE EXTERNAL TABLE
format options consistent, remove special syntax for HEADER ROW
, DELIMITER
and COMPRESSION
#10404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this clean-up PR @berkaysynnada. This almost closes the loop on getting us to a consistent API w.r.t. I/O option handling (e.g. format options). With this PR, the only "weird" thing remaining in the API will be the WITH HEADER ROW
top-level specifier (which it shouldn't be, as it is a format-specific option (CSV)).
We organized this PR to keep WITH HEADER ROW
and gave it consistent semantics w.r.t. its interaction with format
options. Maybe can add a deprecation warning in a follow-on PR and remove it entirely in the future.
@alamb, it'd be great if you can take a look as we discussed this before. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada and @ozankabak
This PR looks great to me other than removing support for DELIMTER
and COMRESSION
in the CREATE EXTERNAL TABLE
syntax (and I almost missed that change when skimming this PR initially)
What I suggest is:
- Update this PR to just fix the has_header bug in Overwritten Format Configs by CreateExternalTable Options #9945
- Start a new discussion / ticket about removing / deprecating the
WITH HEADER ROW
/COMPRESSION
,DELIMITER
syntax support.
However in the long term, I think removing also WITH HEADER ROW option will be a better design and help having more clear UX.
I don't really have a strong preference myself, but given how long the other syntax has been around I think other people might have
Thanks for taking a look. Do you suggest having a similar mechanism for I fear the longer we support the current syntax, the harder it will be to fix it. We may end up being stuck with CSV-related options living at the top level.
Makes sense, let's do this |
Yes, that is my suggestion
I agree it will make the code more complex
I agree -- I will file a follow on ticket discuss removing the options |
Filed #10414 to discus removing these options |
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
…i/datafusion-upstream into feature/format-options
d751320
to
9d81601
Compare
This reverts commit 9d81601.
@@ -66,7 +66,6 @@ CREATE EXTERNAL TABLE test_table ( | |||
date_col DATE | |||
) | |||
STORED AS PARQUET | |||
WITH HEADER ROW |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to specify header rows for parquet
fec33b7
to
fe505a2
Compare
This PR is now ready for review. By employing that method, users can choose to maintain the existing behavior if desired. |
@@ -68,18 +60,6 @@ CREATE EXTERNAL TABLE t STORED AS CSV STORED AS PARQUET LOCATION 'foo.parquet' | |||
statement error DataFusion error: SQL error: ParserError\("LOCATION specified more than once"\) | |||
CREATE EXTERNAL TABLE t STORED AS CSV LOCATION 'foo.csv' LOCATION 'bar.csv' | |||
|
|||
# Duplicate `WITH HEADER ROW` clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not applicable anymore
@@ -52,14 +52,6 @@ CREATE EXTERNAL TABLE t STORED AS CSV WITH HEADER LOCATION 'abc' | |||
statement error DataFusion error: SQL error: ParserError\("Expected BY, found: LOCATION"\) | |||
CREATE EXTERNAL TABLE t STORED AS CSV PARTITIONED LOCATION 'abc' | |||
|
|||
# Missing `TYPE` in COMPRESSION clause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not applicable anymore
constraints: vec![], | ||
}); | ||
expect_parse_ok(sql, expected)?; | ||
|
||
// positive case: it is ok for case insensitive sql stmt with `WITH HEADER ROW` tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not applicable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did one more round of review and it looks good to me. Seems like @berkaysynnada threw in a minor improvement of being able to use non-quoted strings in the OPTIONS
clause as well, which is great. Creating warnings when WITH HEADER ROW
is used and directing users to the OPTIONS
clause is also great. Great work @berkaysynnada 🚀
@alamb, feel free to merge once you take a look and all looks good.
I plan to review this again later today. Thanks @berkaysynnada and @ozankabak |
CREATE EXTERNAL TABLE
format options consistent, remove special syntax for HEADER ROW
, DELIMITER
and COMPRESSION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @berkaysynnada and @ozankabak -- i looked over this PR and I think the code looks really nice 👏
From my perspective this PR also now implements the consensus from #10414 and everyone seems happy with the overall increase in consistency (and there is a workaround for anyone who needs the old special syntax -- kudos to @phillipleblanc 🙏 ).
Something else I noticed while reviewing this PR is that the available options (e.g. format.has_header
, format.delimiter
, etc do not appear to be documented anywhere -- I filed #10451 to track that
{ | ||
opt.unwrap_or(false) | ||
} else { | ||
return config_err!("format.has_header can either be true or false"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure what this error message means. Does it mean that there are inconsistent settings somehow? Perhaps we could improve the text to give some hint about how the user can fix whatever the problem is 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An error occurs here if the user would set such an option: ('format.has_header' 'x') where x is anything other than 'true' or 'false'. I'd like to emphasize that. But as you said, the message may direct the user as if a conflicting options are set. Thanks for the feedback, I am updating it now.
I also updated this PR's title to reflect the user visible changes and added the |
Thank you. We will address your review feedback and then merge afterwards. 🚀 |
08237ea
to
a817dc8
Compare
Thanks again for this work @berkaysynnada and the guidance @ozankabak |
Hey, this changed the behavior of writing and reading CSVs to be slightly inconsistent: now writing defaults to not writing a header, while reading defaults to having a header. Previously both defaulted to having a header, ie. the write side has flipped. I noticed as our test started failing - a simplified version below, the following worked previously, after this commit it fails:
but can be fixed by adding Doesn't really matter to me how it works, though I find the mismatch somewhat surprising - but mainly I just wanted to flag this in case this was not expected/intentional change. |
Hmm, this shouldn't be the case. The logic is supposed to be the following for both reads and writes:
If that is not so, we should open and issue and fix. Thanks for letting us know! |
Thanks for reporting that. It's surprising that this issue hasn't emerged before.
I believe we should also set it as false while creating it. |
…l syntax for `HEADER ROW`, `DELIMITER` and `COMPRESSION` (apache#10404) * Simplify format options * Keep PG copy from tests same * Update datafusion/common/src/config.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/src/datasource/file_format/csv.rs Co-authored-by: Andrew Lamb <[email protected]> * Remove WITH HEADER ROW * Review Part 1 * . * Fix failing tests * Revert "Fix failing tests" This reverts commit 9d81601. * Final commit * Minor * Review * Update avro.slt * Apply suggestions * Fix imports --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Mehmet Ozan Kabak <[email protected]>
Which issue does this PR close?
Closes #9945.
Closes #10414
Rationale for this change
datafusion/datafusion/core/src/datasource/listing_table_factory.rs
Lines 67 to 73 in 4bd7c13
These lines of
ListingTableFactory.create()
overwrites the config options bycmd
(CreateExternalTable options). Configs coming fromOPTIONS('format...
are discarded silently.works as expected, but
does not, since
CREATE EXTERNAL TABLE
does not haveWITH HEADER ROW
, which overwritescsv.has_header
tofalse
.To summarize:
CreateExternalTable
has 3 format specific fields:has_header
delimiter
file_compression_type
After Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382,
CreateExternalTable.options
does store these format specific configurations.What changes are included in this PR?
CreateExternalTable.delimiter
andCreateExternalTable.file_compression_type
are removed, and they are needed to be given from options, such asFor thehas_header
config, as @alamb #9945 (comment) suggests, we can sustain its usage asWITH HEADER ROW
from CreateExternalTable's scope. I refactor the code such that(WITH HEADER ROW) + (format.has_header = true) => OK with header
(WITH HEADER ROW) + (format.has_header = false) => Error
(WITH HEADER ROW) + (format.has_header = not specified) => OK with header
() + (format.has_header = true) => OK with header
() + (format.has_header = false) => Ok without header
() + (format.has_header = not_specified) => Ok without header
However in the long term, I think removing alsoWITH HEADER ROW
option will be a better design and help having more clear UX.WITH HEADER ROW is also removed to OPTIONS tab.
Are these changes tested?
Yes.
Are there any user-facing changes?
Users access format specific configurations (delimiter and compression type) via
OPTIONS()
command.