-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sqllab] force limit queries only when there is no existing limit #5023
Conversation
2894e8a
to
71ca5cc
Compare
Codecov Report
@@ Coverage Diff @@
## master #5023 +/- ##
==========================================
+ Coverage 77.5% 77.52% +0.02%
==========================================
Files 44 44
Lines 8720 8728 +8
==========================================
+ Hits 6758 6766 +8
Misses 1962 1962
Continue to review full report at Codecov.
|
superset/sql_lab.py
Outdated
@@ -171,6 +171,7 @@ def handle_error(msg): | |||
# Limit enforced only for retrieving the data, not for the CTA queries. | |||
superset_query = SupersetQuery(rendered_query) | |||
executed_sql = superset_query.stripped() | |||
SQL_MAX_ROWS = int(app.config.get('SQL_MAX_ROW', None)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None
is implicit on the get method and will make int
raise
In [1]: int(None)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll send out the fir for this.
superset/sql_lab.py
Outdated
@@ -185,7 +186,8 @@ def handle_error(msg): | |||
query.user_id, start_dttm.strftime('%Y_%m_%d_%H_%M_%S')) | |||
executed_sql = superset_query.as_create_table(query.tmp_table_name) | |||
query.select_as_cta_used = True | |||
elif (query.limit and superset_query.is_select()): | |||
elif (not query.limit and superset_query.is_select() and SQL_MAX_ROWS): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I think you remove all handling of query.limit
where it's defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch I tested it and it works correctly.
I changed the place where query.limit is defined upstream and it is None only when it is not specified. It is only in those cases that we append the default limit to the end.
superset/utils.py
Outdated
tokens = sql.split() | ||
try: | ||
if 'limit' in tokens: | ||
limit_pos = tokens.index('limit') + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that looks brittle here, for example limit
can exist within a subquery, or there can be line breaks or tabs as separators instead of spaces. We should use the same method / regex approach as the one I defined. Maybe that method in db_engine_spec can receive max_limit
and only apply alter if the max
_limit is lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, true. I hadn't thought of the subquery case. But split can handle the tabs and newlines just fine. I'll go ahead to reuse your regex approach as that is more robust.
And your suggestion of adding the logic within the checking function makes sense.
8c86fa1
to
04725b0
Compare
@mistercrunch That makes sense. I took out your regex and made it a helper method in utils. Now I apply the limit upstream if it exists then we modify the query with the max value if the specified value is more than the max value or there is no specified value. Lmk what you think. |
ee75b9f
to
d7eb638
Compare
PING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be simpler to just make apply_limit_to_sql
extract the LIMIT number, check if <
, and replace conditionally, all using pretty much the same one regex.
Then add unit tests similar to the existing ones confirming that it works as expected.
superset/utils.py
Outdated
"limit specified was not an integer: '{}'".format(specified_limit)) | ||
specified_limit = None | ||
return specified_limit | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return None
is implicit in python
superset/utils.py
Outdated
sql_without_limit = get_query_without_limit(sql) | ||
if sql_without_limit != sql: | ||
limit_clause = sql.partition(sql_without_limit)[2] | ||
limit_match = re.search('limit\s(\d+)', limit_clause, re.IGNORECASE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the same regex as above to extract the limit number.
superset/utils.py
Outdated
""", '', sql) | ||
|
||
|
||
def get_limit_from_sql(sql): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only works for databases that uses the LIMIT
syntax. Say if you take MS SQL Server that goes for the SELECT TOP 100 * FROM ...
this piece of code doesn't apply.
This calls for moving this logic within db_engine_spec
. I'm thinking apply_limit_to_sql
should be be expected to apply a limit only if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mistercrunch As is, we will append limit <default_limit> to MS SQL Server queries.
I moved the get_limit_from_query to db_engine_spec and that should allow us override the method for dbs with different ways of specifying the limit.
superset/sql_lab.py
Outdated
@@ -171,6 +171,8 @@ def handle_error(msg): | |||
# Limit enforced only for retrieving the data, not for the CTA queries. | |||
superset_query = SupersetQuery(rendered_query) | |||
executed_sql = superset_query.stripped() | |||
SQL_MAX_ROWS = app.config.get('SQL_MAX_ROW', None) | |||
SQL_MAX_ROWS = int(SQL_MAX_ROWS) if SQL_MAX_ROWS else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not assume that SQL_MAX_ROWS
is an int, why casting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I agree that this PR resolves a regression introduced in #4947 but I think (per my comment there) that adding a LIMIT to a query without the user's knowledge isn't ideal and thus it seems more sense to revert #4947. Additionally the work in #4834 with pre-fetching results future dilutes the need to define an explicit limit.
superset/sql_lab.py
Outdated
@@ -171,6 +171,8 @@ def handle_error(msg): | |||
# Limit enforced only for retrieving the data, not for the CTA queries. | |||
superset_query = SupersetQuery(rendered_query) | |||
executed_sql = superset_query.stripped() | |||
SQL_MAX_ROWS = app.config.get('SQL_MAX_ROW', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also app.config.get('SQL_MAX_ROW', None)
is equivalent to app.config.get('SQL_MAX_ROW')
as the second slot is only used to return a non-None value if the key doesn't exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
superset/utils.py
Outdated
limit_match = re.search('limit\s(\d+)', limit_clause, re.IGNORECASE) | ||
if limit_match: | ||
specified_limit = limit_match.group(1) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This try/except is unnecessary as the regex explicitly matched a an integer. If the value was a non-integer limit_match
would be None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Taken out completely.
superset/utils.py
Outdated
if limit_match: | ||
specified_limit = limit_match.group(1) | ||
try: | ||
specified_limit = int(specified_limit) if specified_limit else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it's applicable because the try/except is unnecessary but there's no need to do the if/else logic as the specified_limit
is assigned to None
on line 870.
@@ -196,13 +196,11 @@ def test_run_async_query(self): | |||
self.assertEqual([{'name': 'Admin'}], df.to_dict(orient='records')) | |||
self.assertEqual(QueryStatus.SUCCESS, query.status) | |||
self.assertTrue('FROM tmp_async_1' in query.select_sql) | |||
self.assertTrue('LIMIT 666' in query.select_sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were these checks removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
34313e0
to
3ffe373
Compare
3ffe373
to
a4c9e76
Compare
bab9b23
to
1c5dc79
Compare
|
||
@classmethod | ||
def get_limit_from_sql(cls, sql): | ||
limit_pattern = re.compile(r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that this logic existed previously but shouldn't we use something like sqlparse
to obtain the limit rather than using a regex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored the sqlparse option but there were no nice way to just get the limit without recursively parsing through the query.
1c5dc79
to
cf3641f
Compare
@mistercrunch I saw the notification of your comment over email but I don't see it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a unit test that makes sure that the lower limit won't be replaced (the original issue we're tackling in this PR)?
On it! |
9ddcb28
to
e264f12
Compare
e264f12
to
a9d7faf
Compare
@john-bodley @mistercrunch Ready for stamp |
@@ -104,15 +104,32 @@ def apply_limit_to_sql(cls, sql, limit, database): | |||
) | |||
return database.compile_sqla_query(qry) | |||
elif LimitMethod.FORCE_LIMIT: | |||
no_limit = re.sub(r""" | |||
sql_without_limit = cls.get_query_without_limit(sql) | |||
return '{sql_without_limit} LIMIT {limit}'.format(**locals()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic can generate sql like
SELECT id,
name
FROM main.ab_permission
LIMIT 100
OFFSET 0 LIMIT 5000
which is wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which engine is this on?
[sqllab] force limit queries only when there is no existing limit
This PR (#4947) implemented a more efficient way to limit queries by attaching the limit statement to the statement. However, it always attaches the default limit to all queries even when a fifferent (possibly smaller) limit is specified in the query. This is because query.limit is always set to the default value, never the specified limit value.
In this PR, I do some query parsing to make sure query.limit is correctly specified upstream. Then the query augnmenting logic is only called when there no limit specified.
@mistercrunch @john-bodley @michellethomas