Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] fix: ignore aliases on where clauses #9854

Closed
wants to merge 1 commit into from

Conversation

villebro
Copy link
Member

@villebro villebro commented May 20, 2020

SUMMARY

On BigQuery, some WHERE clauses on nested fields of RECORD type columns were failing due to aliases being added to all columns in the WHERE clause. This is in violation of the BigQuery spec, but also more generally inconsistent with the ANSI SQL spec:

You cannot reference column aliases from the SELECT list in the WHERE clause.

Source: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#where_clause

Standard SQL disallows references to column aliases in a WHERE clause. This restriction is imposed because when the WHERE clause is evaluated, the column value may not yet have been determined.

Source: https://dev.mysql.com/doc/refman/8.0/en/problems-with-alias.html

While the BigQuery SqlAlchemy dialect should ignore the aliases automatically (see the linked issue and PR on pybigquery), this PR disables aliasing in WHERE clauses for all engines, and fixes the immediate problem for BigQuery. I propose getting this merged ASAP to fix the observed BigQuery bug, but recommend adding unit tests later for this as we add proper tests for officially supported database engines.

Links to related pybigquery issues/PRs:

BEFORE

image
Note the alias being referenced in the WHERE clause

AFTER

image
Note the original field name being referenced in the WHERE clause, with the alias being referenced by the GROUPBY clause as per ANSI SQL spec.

TEST PLAN

CI

ADDITIONAL INFORMATION

FYI: @esilver

Comment on lines -1071 to +1083
if df is not None and not df.empty:
if df is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing I noticed that column names were not being renamed back to their original format if the dataframe was empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated but I thought we banished df being None.

@codecov-commenter
Copy link

codecov-commenter commented May 20, 2020

Codecov Report

Merging #9854 into master will increase coverage by 0.12%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9854      +/-   ##
==========================================
+ Coverage   71.02%   71.14%   +0.12%     
==========================================
  Files         583      588       +5     
  Lines       30634    30794     +160     
  Branches     3165     3238      +73     
==========================================
+ Hits        21758    21909     +151     
- Misses       8765     8774       +9     
  Partials      111      111              
Flag Coverage Δ
#cypress 54.05% <ø> (+0.55%) ⬆️
#javascript 59.30% <ø> (-0.01%) ⬇️
#python 71.27% <88.23%> (+<0.01%) ⬆️
Impacted Files Coverage Δ
superset/connectors/sqla/models.py 88.63% <88.23%> (+0.03%) ⬆️
...explore/components/AdhocMetricEditPopoverTitle.jsx 84.61% <0.00%> (-8.72%) ⬇️
...ponents/AdhocFilterEditPopoverSimpleTabContent.jsx 86.50% <0.00%> (-7.20%) ⬇️
.../src/explore/components/AdhocFilterEditPopover.jsx 74.46% <0.00%> (-4.26%) ⬇️
...rontend/src/visualizations/FilterBox/FilterBox.jsx 70.83% <0.00%> (-3.34%) ⬇️
.../src/dashboard/components/gridComponents/Chart.jsx 87.64% <0.00%> (-2.25%) ⬇️
.../src/explore/components/controls/SelectControl.jsx 93.33% <0.00%> (-1.97%) ⬇️
...erset-frontend/src/components/ListView/Filters.tsx 88.88% <0.00%> (-1.68%) ⬇️
...t-frontend/src/dashboard/actions/dashboardState.js 58.00% <0.00%> (-1.34%) ⬇️
superset-frontend/src/explore/controls.jsx 80.95% <0.00%> (-1.20%) ⬇️
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c117e22...46ae952. Read the comment docs.

@@ -167,15 +167,26 @@ def is_temporal(self) -> bool:
self.type, utils.DbColumnType.TEMPORAL
)

def get_sqla_col(self, label: Optional[str] = None) -> Column:
label = label or self.column_name
def get_sqla_col(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m kind of surprised that SQLAlchemy isn’t dealing with the labels correctly. Are we not constructing the SQL query in the appropriate manner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the dialect should do this, but apparently the BQ dialect isn't handling this properly (it does it correctly for all other operators except LIKE).

Copy link
Member

@john-bodley john-bodley May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro do you think there is merit in trying to fix the BigQuery DB-API instead, i.e., overriding the visit_label method? Note I've had to do this for a custom dialect.

This would ensure that i) this would be fixed for all BigQuery use cases (and not just those emanating from Superset), and ii) reduce the need for custom Superset overrides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it, but laziness got the best of me. But why not, let me poke at it real quick and see if it can be fixed there.

Copy link
Member

@john-bodley john-bodley May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played with this in the past by forcing render_label_as_label=None when calling the super(). I think this may be the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers @john-bodley . I just forked it, will hack on it this weekend.

Comment on lines -1071 to +1083
if df is not None and not df.empty:
if df is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat unrelated but I thought we banished df being None.

@villebro villebro changed the title fix: ignore aliases on where clauses [WIP] fix: ignore aliases on where clauses May 21, 2020
@villebro
Copy link
Member Author

@esilver I will be closing this for now, as this is a bug in pybigquery, hence it should be fixed there. I haven't gotten around to crafting a fix yet, but will try to find the time over the coming months.

@villebro villebro closed this Jun 10, 2020
@esilver
Copy link

esilver commented Jun 24, 2020

makes sense, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter generated alias cause bigquery error: 400 Unrecognized name
4 participants