-
-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Type inference func #268
Type inference func #268
Conversation
Will look to add a few more tests before opening for review. |
Should be mostly good for review, but I did have some questions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to see if you can make it work with an actual DB-level TEMPORARY
table. See the novella I wrote in my line-level comment for some ideas about how that might go.
db/tables.py
Outdated
temp_name = "temp_table" | ||
temp_full_name = schema + "." + temp_name | ||
columns = [c.copy() for c in table.columns] | ||
temp_table = Table(temp_name, metadata, *columns) | ||
|
||
create_table = f""" | ||
CREATE TABLE {temp_full_name} AS | ||
TABLE {table.schema}.{table.name} | ||
""" | ||
with engine.begin() as conn: | ||
conn.execute(DDL(create_table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For creating the temporary table, I suggest using the recipe posted by the SQLAlchemy maintainer here: sqlalchemy/sqlalchemy#5687 , but modified with the TEMPORARY
prefix. This will let you quickly create the table with the data you want to use for inference.
Your goal should be to end up with SQL along the lines of:
CREATE TEMPORARY TABLE my_temp_table AS SELECT * FROM my_orig_table;
The TEMPORARY
prefix will result in the table being automatically dropped either at the end of the current transaction block, or whenever the connection is closed (when using the context manager, this should be when the outermost context-management block ends). One caveat: Temporary tables cannot be created in a schema (they exist in a system schema). For details about this schema caveat, and how to set the dropping behavior properly, see the docs here: https://www.postgresql.org/docs/13/sql-createtable.html .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative route would be to try to copy the table in SQLAlchemy (i.e., reflect the table, and then copy the object). You'd then do an insert().from_select(...)
sort of statement: https://docs.sqlalchemy.org/en/14/core/dml.html?highlight=from_select#sqlalchemy.sql.expression.Insert.from_select .
I suspect that won't be as nice in the end, and it'll be a bit less efficient. If you want to try, you can use prefixes=["TEMPORARY"]
in your temp table definition, and then use the table.create(engine)
method of the SA table object (see create_mathesar_table
for an example of idiomatic table creation). Once the table is created, use the insert().from_select(...)
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to use the DDLElement
and MathesarColumns
, but am struggling with getting the proper temporary table implementation working. Two issues so far:
- Problems reflecting the table after the type has been changed. Ideally we should be able to do this with
Table(autoload_with=conn)
, but I'm running into aa transaction action has already begun
error. Making sure to run the type inference outside the initialwith conn.begin()
block, so not entirely sure whats happening here. As a temporary fix, we just pass theMetaData
through all the inference functions, but this leads to cycles since the sqlalchemy table is never updated. - An error in the middle of the type inference functions seems to prevent the table from being dropped. When the above error occurs and our first test fails, the subsequent tests also fail with a
temp_table already exists
error. Will add a test for this behavior, but not entirely sure how to fix yet.
If there's anything obvious I'm doing wrong, please let me know! Otherwise, I'll keep hacking away at this tomorrow.
@eito-fis Regarding If you want to do anything that feels like "copying" a column, I advise first making it a |
8522776
to
e17f895
Compare
The function should be mostly up and running now, but still running into an issue with the |
Updated, should be passing all relevant tests now (might still have Unfortunately, we might be stuck with having to manually drop the temp table. I did try using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider changing the temp table to a standard table to avoid confusion. If you think there's still utility to creating it as a temp table, please document why in a comment. Long term, I think we should try to get that working for safety, but I think we can push it till a future issue.
Also, if we aren't using a temp table, I think it should be possible to clean up some (lots) of the ickier context management that's necessary to get access to the temp table. Is that correct? If so, I think it's an even better argument for just giving up on the temp table idea (for the moment) and changing to a standard table.
db/tables.py
Outdated
|
||
def infer_table_column_types(schema, table_name, engine): | ||
table = reflect_table(table_name, schema, engine) | ||
temp_name = "temp_table" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be randomly-chosen on the fly (or random-ish: Think temp_table_<uuid>
or something similar). The reason is that PostgreSQL puts temp tables into a kind of "global" space, and while they're not accessible from other sessions, they block creation of a table with the same name in those session. This would lead to problems if more than one client is connected, trying to use this function.
db/tables.py
Outdated
with engine.connect() as conn: | ||
with conn.begin(): | ||
conn.execute(CreateTempTableAs(temp_name, select_table)) | ||
with conn.begin(): | ||
temp_table = reflect_table(temp_name, None, conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems from this that the temp table is persisting over these connections. In that case, there isn't really a point in making it an actual temporary table. I.e., we could just use a standard table at that point, which would make some things slightly simpler (and allow for namespacing the created table). It also avoids confusion for our future selves, trying to figure out why we need to drop a temp table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I believe this shows the table persisting across several transactions within a single connection? However, given the issue with the connection pooling (see comment below), I agree that there isn't a point to using temp tables. For namespacing, are there schemas we have reserved for this sort of work?
db/tables.py
Outdated
# Ensure the temp table is deleted | ||
with conn.begin(): | ||
temp_table.drop() | ||
raise e | ||
with conn.begin(): | ||
temp_table.drop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. If we can't get the temp table to work (since SQLAlchemy won't give up connections), we should probably just use a normal table to keep from confusing people.
db/types/alteration.py
Outdated
with ExitStack() as stack: | ||
if conn is not None: | ||
stack.enter_context(conn.begin()) | ||
else: | ||
conn = stack.enter_context(engine.begin()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own information: Why is this move necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To access the temp table, we have to start our transaction using the connection that made the temp table. So here we start the transaction using a connection if there is a connection, and the engine otherwise. (Hopefully will be getting rid of this with the move from temp tables though.)
Heres a snippet to confirm the issue is with connection pools for future reference: with engine.connect() as conn:
_conn = conn
with conn.begin():
conn.execute(CreateTempTableAs(temp_name, select_table))
with engine.connect() as conn:
print(conn == conn) # True
with conn.begin():
temp_table = reflect_table(temp_name, None, conn) # No error with engine.connect() as conn:
_conn = conn
with conn.begin():
conn.execute(CreateTempTableAs(temp_name, select_table))
# Force a set of new connections
engine.pool.dispose()
engine.pool = engine.pool.recreate()
with engine.connect() as conn:
print(_conn == conn) # False
with conn.begin():
temp_table = reflect_table(temp_name, None, conn) # sqlalchemy.exc.NoSuchTableError: temp_table The problem being that the temp tables are default tied to a connection, but since connections are re-used the temp table isn't dropped. We could force the connection we used to be recreated, but that ends up being more work than using a non-temp table and making sure to drop it. I'll go ahead and migrate to using non-temp tables. |
30779fb
to
9665ca4
Compare
Updated to use non-temp files with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good for now. We should reassess using temp tables after the completion of #280 and any other pressing engine or connection modifying work. Long term, we need a way to make absolutely sure that this function can't leave random tables littered about the DB under any conditions.
Related to #200
Adds a function to
db
that takes a table and returns a list of inferred types, corresponding to the columns of the passed table.Technical details
Updates the old
infer_table_column_types
toupdate_table_column_types
, and writes a newinfer_table_column_types
. Uses aCREATE TABLE AS
postgres statement to copy the passed table into a temporary table, which is then passed toupdate_table_column_types
. Finally, we extract the types from the temp table, drop the temp table, and return the types.Checklist
Update index.md
).master
branch of the repositoryvisible errors.
Developer Certificate of Origin
Developer Certificate of Origin