-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid duplicate column name errors from auto-generated aliases #4827
fix: avoid duplicate column name errors from auto-generated aliases #4827
Conversation
If a source has column names in the form `KSQL_COL_x`, e.g. like those created is a select expression isn't a column reference and isn't aliased, then the existing code can result in an error about duplicate column names if a downstream query also relies on auto-generated column names. For example, ```sql -- schema: ID -> NAME CREATE STREAM S1 (ID INT KEY, NAME STRING) WITH (kafka_topic='input', value_format='JSON'); -- schema: ID -> KSQL_COL_0, NAME CREATE STREAM S2 AS SELECT LEN(STRING), NAME FROM S1; -- both of the following result in a duplicate column error as two columns end up being called KSQL_COL_0: SELECT UCASE(NAME), * FROM S2; SELECT UCASE(NAME), KSQL_COL_0 FROM S2; ``` The issue is that the generation of aliases does not take into account any existing columns in the source(s). This commit resolves this issue. Queries now start their generated alias index as one-more-than the maximum of any generated alias in any of the sources. With this change the above queries work: ```sql -- schema: KSQL_COL_1, KSQL_COL_0, NAME SELECT UCASE(NAME), * FROM S2; -- schema: KSQL_COL_1, KSQL_COL_1 SELECT UCASE(NAME), KSQL_COL_0 FROM S2; ``` BREAKING CHANGE: * Any existing persistent queries, e.g. those created with `CREATE STREAM AS SELECT`, `CREATE TABLE AS SELECT` or `INSERT INTO`, will be unaffected: their column names will not change. * Pull queries will be unaffected. * Push queries, which rely on auto-generated column names, may see a change in column names.
Tests that needed updating to pass with the new column names.
Conflicting files ksqldb-common/src/main/java/io/confluent/ksql/name/ColumnNames.java ksqldb-common/src/test/java/io/confluent/ksql/name/ColumnNamesTest.java
You use a |
.boxed() | ||
.collect(Collectors.toSet()); | ||
|
||
return new AliasGenerator(0, used)::next; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if we pass 1 here we should get names more consistent with what we had before (start at offset 1), and a smaller test diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that too and tried it. However, the old code started at zero too. If we switch this to 1 we just get a different set of tests to change.
private int selectItemIndex = 0;
The changes are because the name is no longer controlled by the index of the select expresssion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one optional nit inline
We don't control the column names in the source. It's therefore possible that even a single source could have a column names in the form CREATE STREAM FOO (KSQL_COL_0 INT KEY, KSQL_COL_2147483647 NAME) WITH (...);
SELECT KSQL_COL_0, ABS(KSQL_COL_0) FROM FOO; The above will fail with your proposed strategy as both columns will be called KSQL_COL_0. While we don't control the name of columns, it's unlikely that there will be millions of columns! Hence, tracking the columns we know we can't use is the easiest solution. |
Description
If a source has column names in the form
KSQL_COL_x
, e.g. like those created is a select expression isn't a column reference and isn't aliased, then the existing code can result in an error about duplicate column names if a downstream query also relies on auto-generated column names. For example,The issue is that the generation of aliases does not take into account any existing columns in the source(s).
This commit resolves this issue. Queries now start their generated alias index as one-more-than the maximum of any generated alias in any of the sources.
With this change the above queries work:
BREAKING CHANGE:
Potential breaking change for any query started before version 0.8:
CREATE STREAM AS SELECT
,CREATE TABLE AS SELECT
orINSERT INTO
Testing done
Usual
Reviewing notes:
Reviewer checklist