Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Query optimization for JOINs to not default to SELECT subqueries (for LazyTbl) #347

Closed

Conversation

ivdim1999
Copy link

@ivdim1999 ivdim1999 commented Oct 15, 2021

See Issue 346.

The current implementation of JOIN uses SELECT subqueries for the joined instances. The idea here is to abstain from doing that.

A few issues that will need to be resolved:

  1. What to do with aliasing, so we still want to keep an alias of the tables by default?
  2. A bigger one - now the on argument becomes more ambiguous since when we have consecutive joins, we would need to specify on which table's column we're joining on. The current implementation seems to be fine with accepting a dictionary {"<table1>_<column1>" : "<table2>_<column2>"}. However this could be an annoying/bothersome/confusing notation in some cases (especially if tables/columns have underscores in their names)
  3. Would a standard like this need to be enforced in other places as well/an option to toggle between the current and this way could be added.

@ivdim1999 ivdim1999 requested a review from machow as a code owner October 15, 2021 13:51
@machow
Copy link
Owner

machow commented Oct 16, 2021

Hey--thanks for exploring this. I realized I needed to set github actions to run on pull requests. Do you mind merging stable back into this PR, or adding the line below?

https://github.com/machow/siuba/blob/stable/.github/workflows/ci.yml#L5

That way I can check whether tests are passing!

@ivdim1999
Copy link
Author

Added the line for the tests. As for the issue with the on argument for joins I mentioned , what would your idea for resolving it be. I could imagine accepting "<table>.<column>" would be better as this way no ambiguity with the underscores would happen. Also, there could be an idea of passing a symbolic similar to the other verbs but I'm not sure how that would work given that we have 2 tables that we're working on.

@ivdim1999
Copy link
Author

Additionally, again about the on argument for the JOINs. I can deduce that when a mapping is passed, it will combine all pairs of columns and join on column1=column2 etc. However if I want a more complicated join clause for example (r.column1 = c.column1 OR (r.column1 = 55 AND c.column1= 1)) or something similar, how should this condition be passed? There is an option to accept a callable in some functions so I assume it will be in some way by using that but what would an explicit syntax look like?

@machow
Copy link
Owner

machow commented Oct 22, 2021

Hey, sorry for not giving a more in depth response! I should have more time to dig in this weekend.

There is a sql_on argument in the SQL version of join that takes a function with arguments lhs, rhs for the two SQLAlchemy tables. (Lifted from dbplyr :).

For your earlier questions I'd say..

  1. Let's keep the aliases for now (but it could be a separate show_query simplify related PR?)
  2. Can we support the current ON argument behavior? It seems like if there are multiple joins and we use the .join() method in all cases, then it will keep adding joins on? (But I'd need to double check in SQLAlchemy 1.4).
  3. I'm sure I totally understand, but in SqlAlchemy 1.4 joins are their own object, so we might need to be careful about compatibility? I can dig a bit more this weekend!

I think some of the unit tests might have failed. Do you mind checking them? To run locally, you should be able to spin up the databases with docker-compose up and then just pytest.

@ivdim1999
Copy link
Author

I'm running into issues getting the tests to run and I think it has something to do with me being on Mac. It's more of an issue with the docker-compose setting up the databases. It says that the containers are created and then a bunch of warning s and errors. Errors being that some binaries of some database are not supported. Initially it didn't even create the containers, so I ran a "fix" docker pull --platform linux/x86_64 mysql so I'm getting the following now:

(siuba) ivandimitrov@Ivans-MBP siuba % docker-compose up               
[+] Running 2/0
 ⠿ Container siuba-db-1        Created                                                           0.0s
 ⠿ Container siuba-db_mysql-1  Created                                                           0.0s
Attaching to siuba-db-1, siuba-db_mysql-1
siuba-db-1        | 
siuba-db-1        | PostgreSQL Database directory appears to contain a database; Skipping initialization
siuba-db-1        | 
siuba-db-1        | 2021-10-22 15:29:35.047 UTC [1] LOG:  starting PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
siuba-db-1        | 2021-10-22 15:29:35.047 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
siuba-db-1        | 2021-10-22 15:29:35.047 UTC [1] LOG:  listening on IPv6 address "::", port 5432
siuba-db-1        | 2021-10-22 15:29:35.049 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
siuba-db-1        | 2021-10-22 15:29:35.052 UTC [28] LOG:  database system was shut down at 2021-10-22 15:28:49 UTC
siuba-db-1        | 2021-10-22 15:29:35.054 UTC [1] LOG:  database system is ready to accept connections
siuba-db_mysql-1  | 2021-10-22 15:29:35+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.27-1debian10 started.
siuba-db_mysql-1  | 2021-10-22 15:29:35+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
siuba-db_mysql-1  | 2021-10-22 15:29:35+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.27-1debian10 started.
siuba-db_mysql-1  | 2021-10-22T15:29:36.755522Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.27) starting as process 1
siuba-db_mysql-1  | 2021-10-22T15:29:36.850933Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
siuba-db_mysql-1  | 2021-10-22T15:29:36.874877Z 1 [ERROR] [MY-012585] [InnoDB] Linux Native AIO interface is not supported on this platform. Please check your OS documentation and install appropriate binary of InnoDB.
siuba-db_mysql-1  | 2021-10-22T15:29:36.875205Z 1 [Warning] [MY-012654] [InnoDB] Linux Native AIO disabled.
siuba-db_mysql-1  | 2021-10-22T15:29:37.102977Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
siuba-db_mysql-1  | 2021-10-22T15:29:37.752066Z 0 [Warning] [MY-013746] [Server] A deprecated TLS version TLSv1 is enabled for channel mysql_main
siuba-db_mysql-1  | 2021-10-22T15:29:37.752229Z 0 [Warning] [MY-013746] [Server] A deprecated TLS version TLSv1.1 is enabled for channel mysql_main
siuba-db_mysql-1  | 2021-10-22T15:29:37.777931Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
siuba-db_mysql-1  | 2021-10-22T15:29:37.778497Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
siuba-db_mysql-1  | 2021-10-22T15:29:37.783411Z 0 [Warning] [MY-011810] [Server] Insecure configuration for --pid-file: Location '/var/run/mysqld' in the path is accessible to all OS users. Consider choosing a different directory.
siuba-db_mysql-1  | 2021-10-22T15:29:37.894497Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/run/mysqld/mysqlx.sock
siuba-db_mysql-1  | 2021-10-22T15:29:37.896134Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.27'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  MySQL Community Server - GPL.

I suppose that due to this a large chunk of the tests error out. We could get tests to run here each time I commit as a possible workaround.

@machow
Copy link
Owner

machow commented Nov 16, 2022

Hey, sorry for sleeping on this. I've refactored the SQL tooling so it doesn't do an initial select, and produces much cleaner show_query results! Thanks for taking the time to work on this. Going to close for now, but definitely please raise an issue if anything more is needed :).

here's an example:

from siuba.data import cars_sql
from siuba import *
cars_sql >> inner_join(_, cars_sql, "cyl") >> select(_.cyl, _.mpg_x) >> show_query()

output:

SELECT cars_1.cyl, cars_1.mpg AS mpg_x
FROM cars AS cars_1 JOIN cars AS cars_2 ON cars_1.cyl = cars_2.cyl

@machow machow closed this Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants