-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query optimization for LazyTbl join #346
Comments
Hey! There are some places where siuba knows to not create an extra subquery (for example, certain forms of mutate), but in other places the exact construction of raw subqueries hasn't been optimized. I think in most cases it shouldn't affect performance, as query optimizers are pretty good at this stuff! But there is definite room for improvement. I have a refactor on the stable branch that I think improved some aspects of composing queries. I have tried to code rules for simplifying in a few places, like the col_expression_requires_cte function. But as I recall replacing the initial select with just the table (or more likely pulling the table out of a simple select in a join) should work. Would you be open to branching off stable and taking a pass? Happy to review--these kinds of simplifications to the generated SQL are really useful! |
I'm happy to give it a go. Also I agree that most query optimizers shouldn't have issues with subqueries sql but it's less aesthetic that way, especially with many nested joins as is the case for some queries that I run. Additionally, is there an option to set table aliases as currently they're defaulted to |
I experimented with radically simplifying certain queries here (e.g. using * when appropriate), if you want to try it out! It uses the show_query(simplify=True) argument! I think the name of anonymous tables is set by sqlalchemy, so wonder if it's possible to set there? If you hit any snags don't hesitate to let me know. I'm happy to clarify anything! |
Thanks, I'll look into how sqlalchemy sets the default aliases then. |
How do you feel about having an extra parameter in the |
Although I guess if it were to be added there, it would need to be added everywhere for consistency. |
Are you saying a parameter to set the name (or prefix for the name) of an aliased table? SQLAlchemy handles the building of a query, and the compiling of it to text. I suspect setting the name of anonymous tables might happen when compiling. (See this line of the bigquery dialect for an example: https://github.com/googleapis/python-bigquery-sqlalchemy/blob/main/sqlalchemy_bigquery/base.py#L181) |
Ah, I'm seeing the argument to set alias name you mentioned in the method. I would hold off on adding an argument for the alias name to all the functions. The verbs are focused on doing the analysis, but methods like show_query(), or a SQLAlchemy visitor could always change alias names if useful! |
Ok, it would seem to be a slight hassle to add that to all the verbs as well as probably a questionable design choice (as you pointed out it's better to keep their job focused on doing the analysis). I'll take a look into show_query() in that regard then. |
This show_query should now be much more user friendly, since it no longer wraps each table in an initial select: from siuba.data import cars_sql
from siuba import *
cars_sql >> inner_join(_, cars_sql, "cyl") >> select(_.cyl, _.mpg_x) >> show_query() output:
|
I have been experimenting with various verbs and operations on
LazyTbl
. When doing joins it caught my eye that the joined tables themselves enter as subqueries that select everything from them.My code:
The SQL produced:
I imagined initially that I'll get something like:
The subqueries in the first sql can be omitted but siuba doesn't realize it. This prompted me to look at how
join
is implemented in siuba/sql/verbs.py for lazy tables. The two instances that seem to be joined in the end are:So I also looked at the implementation of
LazyTbl
in siuba/sql/verbs.py. Here the default operation is select in the listself.ops = [self.tbl.select()] if ops is None else ops
. Which leads me to believe that by default the join uses this method with the subqueries that select everything from the joined instances and it's also probably not just an issue with the translation inshow_query()
.Is this problem circumventable in some way right now by using a workaround of some sorts (as in most cases having those subqueries is completely unnecessary)?
The text was updated successfully, but these errors were encountered: