-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject CREATE TABLE/VIEW with duplicate column names #13517
Conversation
57aba78
to
65e6c46
Compare
/// Whether the table is an infinite streams | ||
pub unbounded: bool, | ||
/// Table(provider) specific options | ||
pub options: HashMap<String, String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add comments what is a key and what is the value in the hashmaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I don't know what these are yet. This is copied from CreateExternalTable. This whole struct allows public full destruction of CreateExternalTable, without allowing public construction of that struct, so that validation can take place. Would it help if i removed all these doc comments, so that documentation is checked at CreateExternalTable only?
])) | ||
)?) | ||
.unwrap_err().strip_backtrace().to_string(), | ||
"Schema error: Schema contains qualified fields with duplicate unqualified names t1.c1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I feel lower case col names are not easy to read in the error message. Will tweak the schema_err!
output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can wrap them in brackets [...]
to clearly separate from surrounding text, while making it kind of obvious we're not bothering to produce proper quoted identifiers.
65e6c46
to
be87611
Compare
DFSchema checks column for uniqueness, but allows duplicate column names when they are qualified differently. This is because DFSchema plays central role during query planning as a identifier resolution scope. Those checks in their current form should not be there, since they prevent execution of queries with duplicate column aliases, which is legal in SQL. But even with these checks present, they are not sufficient to ensure CREATE TABLE/VIEW is well structured. Table or view columns need to have unique names and there is no qualification involved. This commit adds necessary checks in CREATE TABLE/VIEW DDL structs, ensuring that CREATE TABLE/VIEW logical plans are valid in that regard.
be87611
to
7dafc6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constraints: Constraints::empty(), | ||
column_defaults: Default::default(), | ||
}, | ||
&CreateExternalTable::builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
/// Creates an in memory table. | ||
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)] | ||
#[non_exhaustive] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To disallow public direct construction. If we allow public direct construction, we can't enforce invariants in the factory method, since it may or may not get called.
@@ -192,11 +193,12 @@ impl DdlStatement { | |||
|
|||
/// Creates an external table. | |||
#[derive(Debug, Clone, PartialEq, Eq)] | |||
#[non_exhaustive] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make it non exhaustive? That might make it harder to write match
statements where all the fields are matched (and the compiler tells you when new ones are added)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make it non exhaustive?
To disallow public direct construction. If we allow public direct construction, we can't enforce invariants in the factory method, since it may or may not get called.
That might make it harder to write match statements where all the fields are matched (and the compiler tells you when new ones are added)
💯 and i think this is important use-case. that's why those new into_fields()
are added
@@ -288,8 +290,234 @@ impl PartialOrd for CreateExternalTable { | |||
} | |||
} | |||
|
|||
impl CreateExternalTable { | |||
pub fn new(fields: CreateExternalTableFields) -> Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the fields in CreateExternalTable
are pub
anyway, I don't think introducing CreateExternalTableFields
requires all construction to go through new
🤔
Therefore I would personally suggest removing the CreateExternalTableFields
(and equivalent for MemoryTable) and staying with the struct and the builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given all the fields in
CreateExternalTable
arepub
anyway, I don't think introducingCreateExternalTableFields
requires all construction to go throughnew
🤔
i think it's disallowed by non_exhaustive.
in fact, i did revert this PR changes in datafusion/core/src/catalog_common/listing_schema.rs
and cargo check
failed with
error[E0639]: cannot create non-exhaustive struct using struct expression
--> datafusion/core/src/catalog_common/listing_schema.rs:134:26
|
134 | &CreateExternalTable {
| __________________________^
135 | | schema: Arc::new(DFSchema::empty()),
136 | | name,
137 | | location: table_url,
... |
147 | | column_defaults: Default::default(),
148 | | },
| |_________________________^
For more information about this error, try `rustc --explain E0639`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see -- that was non obvious.
Given that what do you think then about making all the fields not pub
instead? I think that would be less confusing / more obvious.
Also I do think it is worth considering if we could make a smaller / less breaking change here -- I understand the desire to ensure only semantically correct CreateExternalTable
structures can be created, but in my opinion the extra cognative overhead of the *Field
structures makes it significantly harder to work with this code.
What if we
- Kept the CreateExternalTableBuilder (to generate errors during construction)
- Added another check somewhere when it was processed -- for example in
cmd: &CreateExternalTable,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that what do you think then about making all the fields not
pub
instead? I think that would be less confusing / more obvious.
This would be more breaking.
And we would need to add accessor for all these fields, for cases when someone wants to inspect some attributes without deconstructing
Added another check somewhere when it was processed -- for example in
datafusion/datafusion/core/src/datasource/listing_table_factory.rs
That could work, but where could this check be?
The link is currently in core byt the listing table is something we would want to move out.
And this should be the core's guarantee, regardless which table provider is used.
8065096
to
70740c6
Compare
70740c6
to
e0ecbb8
Compare
@@ -288,8 +290,234 @@ impl PartialOrd for CreateExternalTable { | |||
} | |||
} | |||
|
|||
impl CreateExternalTable { | |||
pub fn try_new(fields: CreateExternalTableFields) -> Result<Self> { | |||
let CreateExternalTableFields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonahgao I wonder if you have some time to offer an opinion on this pattern (a way to allow pattern matching but still ensure a struct
has to be valudated as part of construction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CreateExternalTableFields
makes things too complex. It has exactly the same fields as CreateExternalTable
, and users will feel confused and need to spend time distinguishing and using them.
In my opinion, it is enough to perform all the validity checks in CreateExternalTableBuilder::build
, such as missing required fields and duplicate column names. CreateExternalTable
s typically come from the SQL planner of DataFusion, where we have already used the builder. For other scenarios, we can encourage users in the documentation to use the Builder API to create CreateExternalTable
s .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't block public construction, the downstream projects won't know to migrate to the builder API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside of not migrating is a delayed error message, right? Maybe it is ok if the error doesn't immediately happen on construction for all downstream users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verification is in the try_new()
method used by the builder. If the builder is not used, and this new try_new
method isn't used directly either, the verification won't run.
I believe the downside of not migrating may be no error message at all.
What is the status of this PR? It appears to be stalled as we haven't reached consensus -- it also has several conflicts. Marking it as a draft as we figure out what to do with it next |
Correct. Apparently the only question was about how to enforce the verification: #13517 (comment) |
That is my personal preference |
Given no-one else was involved so far, your personal preference is a verdict. |
Reject CREATE TABLE/VIEW with duplicate column names
DFSchema checks column for uniqueness, but allows duplicate column names
when they are qualified differently. This is because DFSchema plays
central role during query planning as a identifier resolution scope.
Those checks in their current form should not be there, since they
prevent execution of queries with duplicate column aliases, which is
legal in SQL. But even with these checks present, they are not
sufficient to ensure CREATE TABLE/VIEW is well structured. Table or
view columns need to have unique names and there is no qualification
involved.
This commit adds necessary checks in CREATE TABLE/VIEW DDL structs,
ensuring that CREATE TABLE/VIEW logical plans are valid in that regard.
This PR includes
Encapsulate create table/view construction (to be able to add these checks)
Checks in create table/view construction to validate schema has no duplicate names
fixes CREATE TABLE succeeds when schema has duplicate names, resulting in a table that cannot be selected from #13487
extracted from Support duplicate column aliases in queries #13489