Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type inference func #268

Merged
merged 26 commits into from
Jun 28, 2021
Merged

Type inference func #268

merged 26 commits into from
Jun 28, 2021

Conversation

eito-fis
Copy link
Contributor

@eito-fis eito-fis commented Jun 18, 2021

Related to #200

Adds a function to db that takes a table and returns a list of inferred types, corresponding to the columns of the passed table.

Technical details
Updates the old infer_table_column_types to update_table_column_types, and writes a new infer_table_column_types. Uses a CREATE TABLE AS postgres statement to copy the passed table into a temporary table, which is then passed to update_table_column_types. Finally, we extract the types from the temp table, drop the temp table, and return the types.

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@eito-fis
Copy link
Contributor Author

Will look to add a few more tests before opening for review.

@kgodey
Copy link
Contributor

kgodey commented Jun 18, 2021

@eito-fis I don't think this entirely resolves #200 so please change your note from "fixes" to "related to" so that #200 isn't auto-closed.

@eito-fis eito-fis marked this pull request as ready for review June 18, 2021 22:03
@eito-fis eito-fis requested review from a team, kgodey and pavish June 18, 2021 22:03
@eito-fis
Copy link
Contributor Author

eito-fis commented Jun 18, 2021

Should be mostly good for review, but I did have some questions:

  • Do we have a set way of creating temporary tables? Right now the function literally makes a table called temp_table, which doesn't seem safe at all.
  • Currently we drop the temp table at the end of the function, which means that an error earlier in the function leaves the table behind. Are there any strategies to mitigate this problem?
  • SQLAlchemy's Column.copy() is deprecated, but I can't find anything to replace it. Is there some functionality I'm missing?

Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to see if you can make it work with an actual DB-level TEMPORARY table. See the novella I wrote in my line-level comment for some ideas about how that might go.

db/tables.py Outdated
Comment on lines 367 to 377
temp_name = "temp_table"
temp_full_name = schema + "." + temp_name
columns = [c.copy() for c in table.columns]
temp_table = Table(temp_name, metadata, *columns)

create_table = f"""
CREATE TABLE {temp_full_name} AS
TABLE {table.schema}.{table.name}
"""
with engine.begin() as conn:
conn.execute(DDL(create_table))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For creating the temporary table, I suggest using the recipe posted by the SQLAlchemy maintainer here: sqlalchemy/sqlalchemy#5687 , but modified with the TEMPORARY prefix. This will let you quickly create the table with the data you want to use for inference.

Your goal should be to end up with SQL along the lines of:

CREATE TEMPORARY TABLE my_temp_table AS SELECT * FROM my_orig_table;

The TEMPORARY prefix will result in the table being automatically dropped either at the end of the current transaction block, or whenever the connection is closed (when using the context manager, this should be when the outermost context-management block ends). One caveat: Temporary tables cannot be created in a schema (they exist in a system schema). For details about this schema caveat, and how to set the dropping behavior properly, see the docs here: https://www.postgresql.org/docs/13/sql-createtable.html .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative route would be to try to copy the table in SQLAlchemy (i.e., reflect the table, and then copy the object). You'd then do an insert().from_select(...) sort of statement: https://docs.sqlalchemy.org/en/14/core/dml.html?highlight=from_select#sqlalchemy.sql.expression.Insert.from_select .

I suspect that won't be as nice in the end, and it'll be a bit less efficient. If you want to try, you can use prefixes=["TEMPORARY"] in your temp table definition, and then use the table.create(engine) method of the SA table object (see create_mathesar_table for an example of idiomatic table creation). Once the table is created, use the insert().from_select(...) statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to use the DDLElement and MathesarColumns, but am struggling with getting the proper temporary table implementation working. Two issues so far:

  • Problems reflecting the table after the type has been changed. Ideally we should be able to do this with Table(autoload_with=conn), but I'm running into a a transaction action has already begun error. Making sure to run the type inference outside the initial with conn.begin() block, so not entirely sure whats happening here. As a temporary fix, we just pass the MetaData through all the inference functions, but this leads to cycles since the sqlalchemy table is never updated.
  • An error in the middle of the type inference functions seems to prevent the table from being dropped. When the above error occurs and our first test fails, the subsequent tests also fail with a temp_table already exists error. Will add a test for this behavior, but not entirely sure how to fix yet.

If there's anything obvious I'm doing wrong, please let me know! Otherwise, I'll keep hacking away at this tomorrow.

@mathemancer
Copy link
Contributor

@eito-fis Regarding Column.copy, the deprecated version didn't actually copy the column in the expected way under all circumstances. So, for now, we need to do copying ourselves. This is the motivation behind the MathesarColumn object. The point of that is to constrain the parts of the column that we're aware of so that we can make sure those parts get copied correctly. See the docstring of that object for more.

If you want to do anything that feels like "copying" a column, I advise first making it a MathesarColumn, then creating a copy by calling MathesarColumn.from_column on that column. (or just using MathesarColumn.from_column on the actual SA Column type). If you need to handle some property that's not included in MathesarColumn, please add it to that class and relevant methods, and then make sure that the from_column method has the expected behavior w.r.t. that property.

@eito-fis eito-fis force-pushed the type_inference_func branch from 8522776 to e17f895 Compare June 22, 2021 18:23
@eito-fis eito-fis requested a review from mathemancer June 22, 2021 18:50
@eito-fis eito-fis mentioned this pull request Jun 22, 2021
7 tasks
@eito-fis
Copy link
Contributor Author

The function should be mostly up and running now, but still running into an issue with the same_name test. Weirdly, it only fails when run after the drop_temp test. When run on its own, or with drop_temp commented out, we don't get a temp_table already exists error. Not entirely sure whats going on there, need to look into it.

@eito-fis
Copy link
Contributor Author

Updated, should be passing all relevant tests now (might still have too many clients error).

Unfortunately, we might be stuck with having to manually drop the temp table. I did try using ON COMMIT DROP instead, but ran into problems. Doing so means we have to execute all type inference commands in a single transaction. But, we check types by intentionally causing the transaction to error and psycopg2 seems to prevent additional commands from executing if an error has already been thrown with a current transaction is aborted, commands ignored until end of transaction block error. Not sure if we can get around this without rewriting the type inference code.

Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider changing the temp table to a standard table to avoid confusion. If you think there's still utility to creating it as a temp table, please document why in a comment. Long term, I think we should try to get that working for safety, but I think we can push it till a future issue.

Also, if we aren't using a temp table, I think it should be possible to clean up some (lots) of the ickier context management that's necessary to get access to the temp table. Is that correct? If so, I think it's an even better argument for just giving up on the temp table idea (for the moment) and changing to a standard table.

db/tables.py Outdated

def infer_table_column_types(schema, table_name, engine):
table = reflect_table(table_name, schema, engine)
temp_name = "temp_table"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be randomly-chosen on the fly (or random-ish: Think temp_table_<uuid> or something similar). The reason is that PostgreSQL puts temp tables into a kind of "global" space, and while they're not accessible from other sessions, they block creation of a table with the same name in those session. This would lead to problems if more than one client is connected, trying to use this function.

db/tables.py Outdated
Comment on lines 439 to 443
with engine.connect() as conn:
with conn.begin():
conn.execute(CreateTempTableAs(temp_name, select_table))
with conn.begin():
temp_table = reflect_table(temp_name, None, conn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems from this that the temp table is persisting over these connections. In that case, there isn't really a point in making it an actual temporary table. I.e., we could just use a standard table at that point, which would make some things slightly simpler (and allow for namespacing the created table). It also avoids confusion for our future selves, trying to figure out why we need to drop a temp table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I believe this shows the table persisting across several transactions within a single connection? However, given the issue with the connection pooling (see comment below), I agree that there isn't a point to using temp tables. For namespacing, are there schemas we have reserved for this sort of work?

db/tables.py Outdated
Comment on lines 452 to 457
# Ensure the temp table is deleted
with conn.begin():
temp_table.drop()
raise e
with conn.begin():
temp_table.drop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. If we can't get the temp table to work (since SQLAlchemy won't give up connections), we should probably just use a normal table to keep from confusing people.

Comment on lines 36 to 40
with ExitStack() as stack:
if conn is not None:
stack.enter_context(conn.begin())
else:
conn = stack.enter_context(engine.begin())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own information: Why is this move necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To access the temp table, we have to start our transaction using the connection that made the temp table. So here we start the transaction using a connection if there is a connection, and the engine otherwise. (Hopefully will be getting rid of this with the move from temp tables though.)

@eito-fis
Copy link
Contributor Author

Heres a snippet to confirm the issue is with connection pools for future reference:

 with engine.connect() as conn:
    _conn = conn
    with conn.begin():
        conn.execute(CreateTempTableAs(temp_name, select_table))
with engine.connect() as conn:
    print(conn == conn) # True
    with conn.begin(): 
        temp_table = reflect_table(temp_name, None, conn) # No error
 with engine.connect() as conn:
    _conn = conn
    with conn.begin():
        conn.execute(CreateTempTableAs(temp_name, select_table))

# Force a set of new connections
engine.pool.dispose()
engine.pool = engine.pool.recreate()

with engine.connect() as conn:
    print(_conn == conn) # False
    with conn.begin(): 
        temp_table = reflect_table(temp_name, None, conn) # sqlalchemy.exc.NoSuchTableError: temp_table

The problem being that the temp tables are default tied to a connection, but since connections are re-used the temp table isn't dropped. We could force the connection we used to be recreated, but that ends up being more work than using a non-temp table and making sure to drop it. I'll go ahead and migrate to using non-temp tables.

@eito-fis eito-fis force-pushed the type_inference_func branch from 30779fb to 9665ca4 Compare June 24, 2021 19:15
@eito-fis
Copy link
Contributor Author

Updated to use non-temp files with {MATHESAR_PREFIX}temp_schema and {MATHESAR_PREFIX}temp_table_{current_epoch} as the schema and table name, respectively. Now also ensures that the table name is unique. Assuming we protect schemas from starting with MATHESAR_PREFIX, I think we should be safe from table name collisions.

@eito-fis eito-fis requested a review from mathemancer June 24, 2021 21:39
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good for now. We should reassess using temp tables after the completion of #280 and any other pressing engine or connection modifying work. Long term, we need a way to make absolutely sure that this function can't leave random tables littered about the DB under any conditions.

@mathemancer mathemancer merged commit 0bc1a97 into master Jun 28, 2021
@mathemancer mathemancer deleted the type_inference_func branch June 28, 2021 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants