-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Query optimization for JOINs to not default to SELECT subqueries (for LazyTbl) #347
Conversation
Hey--thanks for exploring this. I realized I needed to set github actions to run on pull requests. Do you mind merging stable back into this PR, or adding the line below? https://github.com/machow/siuba/blob/stable/.github/workflows/ci.yml#L5 That way I can check whether tests are passing! |
Added the line for the tests. As for the issue with the |
…forced) keys for the on argument.
…ing, we get an equivalent select to the initial one
Additionally, again about the |
… to not crash when callable is passed.
Hey, sorry for not giving a more in depth response! I should have more time to dig in this weekend. There is a sql_on argument in the SQL version of join that takes a function with arguments lhs, rhs for the two SQLAlchemy tables. (Lifted from dbplyr :). For your earlier questions I'd say..
I think some of the unit tests might have failed. Do you mind checking them? To run locally, you should be able to spin up the databases with |
I'm running into issues getting the tests to run and I think it has something to do with me being on Mac. It's more of an issue with the docker-compose setting up the databases. It says that the containers are created and then a bunch of warning s and errors. Errors being that some binaries of some database are not supported. Initially it didn't even create the containers, so I ran a "fix"
I suppose that due to this a large chunk of the tests error out. We could get tests to run here each time I commit as a possible workaround. |
Hey, sorry for sleeping on this. I've refactored the SQL tooling so it doesn't do an initial select, and produces much cleaner show_query results! Thanks for taking the time to work on this. Going to close for now, but definitely please raise an issue if anything more is needed :). here's an example: from siuba.data import cars_sql
from siuba import *
cars_sql >> inner_join(_, cars_sql, "cyl") >> select(_.cyl, _.mpg_x) >> show_query() output:
|
See Issue 346.
The current implementation of JOIN uses SELECT subqueries for the joined instances. The idea here is to abstain from doing that.
A few issues that will need to be resolved:
on
argument becomes more ambiguous since when we have consecutive joins, we would need to specify on which table's column we're joining on. The current implementation seems to be fine with accepting a dictionary{"<table1>_<column1>" : "<table2>_<column2>"}
. However this could be an annoying/bothersome/confusing notation in some cases (especially if tables/columns have underscores in their names)