Type inference func #268

eito-fis · 2021-06-18T02:27:02Z

Related to #200

Adds a function to db that takes a table and returns a list of inferred types, corresponding to the columns of the passed table.

Technical details
Updates the old infer_table_column_types to update_table_column_types, and writes a new infer_table_column_types. Uses a CREATE TABLE AS postgres statement to copy the passed table into a temporary table, which is then passed to update_table_column_types. Finally, we extract the types from the temp table, drop the temp table, and return the types.

Checklist

My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the master branch of the repository
My commit messages follow best practices.
My code follows the established code style of the repository.
I added tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

eito-fis · 2021-06-18T02:27:29Z

Will look to add a few more tests before opening for review.

kgodey · 2021-06-18T14:12:48Z

@eito-fis I don't think this entirely resolves #200 so please change your note from "fixes" to "related to" so that #200 isn't auto-closed.

eito-fis · 2021-06-18T22:34:07Z

Should be mostly good for review, but I did have some questions:

Do we have a set way of creating temporary tables? Right now the function literally makes a table called temp_table, which doesn't seem safe at all.
Currently we drop the temp table at the end of the function, which means that an error earlier in the function leaves the table behind. Are there any strategies to mitigate this problem?
SQLAlchemy's Column.copy() is deprecated, but I can't find anything to replace it. Is there some functionality I'm missing?

mathemancer

Please try to see if you can make it work with an actual DB-level TEMPORARY table. See the novella I wrote in my line-level comment for some ideas about how that might go.

mathemancer · 2021-06-21T16:04:35Z

db/tables.py

+    temp_name = "temp_table"
+    temp_full_name = schema + "." + temp_name
+    columns = [c.copy() for c in table.columns]
+    temp_table = Table(temp_name, metadata, *columns)
+
+    create_table = f"""
+    CREATE TABLE {temp_full_name} AS
+    TABLE {table.schema}.{table.name}
+    """
+    with engine.begin() as conn:
+        conn.execute(DDL(create_table))


For creating the temporary table, I suggest using the recipe posted by the SQLAlchemy maintainer here: sqlalchemy/sqlalchemy#5687 , but modified with the TEMPORARY prefix. This will let you quickly create the table with the data you want to use for inference.

Your goal should be to end up with SQL along the lines of:

CREATE TEMPORARY TABLE my_temp_table AS SELECT * FROM my_orig_table;

The TEMPORARY prefix will result in the table being automatically dropped either at the end of the current transaction block, or whenever the connection is closed (when using the context manager, this should be when the outermost context-management block ends). One caveat: Temporary tables cannot be created in a schema (they exist in a system schema). For details about this schema caveat, and how to set the dropping behavior properly, see the docs here: https://www.postgresql.org/docs/13/sql-createtable.html .

An alternative route would be to try to copy the table in SQLAlchemy (i.e., reflect the table, and then copy the object). You'd then do an insert().from_select(...) sort of statement: https://docs.sqlalchemy.org/en/14/core/dml.html?highlight=from_select#sqlalchemy.sql.expression.Insert.from_select .

I suspect that won't be as nice in the end, and it'll be a bit less efficient. If you want to try, you can use prefixes=["TEMPORARY"] in your temp table definition, and then use the table.create(engine) method of the SA table object (see create_mathesar_table for an example of idiomatic table creation). Once the table is created, use the insert().from_select(...) statement.

I updated to use the DDLElement and MathesarColumns, but am struggling with getting the proper temporary table implementation working. Two issues so far:

Problems reflecting the table after the type has been changed. Ideally we should be able to do this with Table(autoload_with=conn), but I'm running into a a transaction action has already begun error. Making sure to run the type inference outside the initial with conn.begin() block, so not entirely sure whats happening here. As a temporary fix, we just pass the MetaData through all the inference functions, but this leads to cycles since the sqlalchemy table is never updated.

An error in the middle of the type inference functions seems to prevent the table from being dropped. When the above error occurs and our first test fails, the subsequent tests also fail with a temp_table already exists error. Will add a test for this behavior, but not entirely sure how to fix yet.

If there's anything obvious I'm doing wrong, please let me know! Otherwise, I'll keep hacking away at this tomorrow.

mathemancer · 2021-06-21T17:44:14Z

@eito-fis Regarding Column.copy, the deprecated version didn't actually copy the column in the expected way under all circumstances. So, for now, we need to do copying ourselves. This is the motivation behind the MathesarColumn object. The point of that is to constrain the parts of the column that we're aware of so that we can make sure those parts get copied correctly. See the docstring of that object for more.

If you want to do anything that feels like "copying" a column, I advise first making it a MathesarColumn, then creating a copy by calling MathesarColumn.from_column on that column. (or just using MathesarColumn.from_column on the actual SA Column type). If you need to handle some property that's not included in MathesarColumn, please add it to that class and relevant methods, and then make sure that the from_column method has the expected behavior w.r.t. that property.

eito-fis · 2021-06-23T00:03:47Z

The function should be mostly up and running now, but still running into an issue with the same_name test. Weirdly, it only fails when run after the drop_temp test. When run on its own, or with drop_temp commented out, we don't get a temp_table already exists error. Not entirely sure whats going on there, need to look into it.

eito-fis · 2021-06-24T04:53:41Z

Updated, should be passing all relevant tests now (might still have too many clients error).

Unfortunately, we might be stuck with having to manually drop the temp table. I did try using ON COMMIT DROP instead, but ran into problems. Doing so means we have to execute all type inference commands in a single transaction. But, we check types by intentionally causing the transaction to error and psycopg2 seems to prevent additional commands from executing if an error has already been thrown with a current transaction is aborted, commands ignored until end of transaction block error. Not sure if we can get around this without rewriting the type inference code.

mathemancer

Please consider changing the temp table to a standard table to avoid confusion. If you think there's still utility to creating it as a temp table, please document why in a comment. Long term, I think we should try to get that working for safety, but I think we can push it till a future issue.

Also, if we aren't using a temp table, I think it should be possible to clean up some (lots) of the ickier context management that's necessary to get access to the temp table. Is that correct? If so, I think it's an even better argument for just giving up on the temp table idea (for the moment) and changing to a standard table.

mathemancer · 2021-06-24T17:30:46Z

db/tables.py

+
+def infer_table_column_types(schema, table_name, engine):
+    table = reflect_table(table_name, schema, engine)
+    temp_name = "temp_table"


I think this should be randomly-chosen on the fly (or random-ish: Think temp_table_<uuid> or something similar). The reason is that PostgreSQL puts temp tables into a kind of "global" space, and while they're not accessible from other sessions, they block creation of a table with the same name in those session. This would lead to problems if more than one client is connected, trying to use this function.

mathemancer · 2021-06-24T17:32:55Z

db/tables.py

+    with engine.connect() as conn:
+        with conn.begin():
+            conn.execute(CreateTempTableAs(temp_name, select_table))
+        with conn.begin():
+            temp_table = reflect_table(temp_name, None, conn)


It seems from this that the temp table is persisting over these connections. In that case, there isn't really a point in making it an actual temporary table. I.e., we could just use a standard table at that point, which would make some things slightly simpler (and allow for namespacing the created table). It also avoids confusion for our future selves, trying to figure out why we need to drop a temp table.

Correct me if I'm wrong, but I believe this shows the table persisting across several transactions within a single connection? However, given the issue with the connection pooling (see comment below), I agree that there isn't a point to using temp tables. For namespacing, are there schemas we have reserved for this sort of work?

mathemancer · 2021-06-24T17:34:17Z

db/tables.py

+            # Ensure the temp table is deleted
+            with conn.begin():
+                temp_table.drop()
+            raise e
+        with conn.begin():
+            temp_table.drop()


Same as above. If we can't get the temp table to work (since SQLAlchemy won't give up connections), we should probably just use a normal table to keep from confusing people.

mathemancer · 2021-06-24T17:42:14Z

db/types/alteration.py

+    with ExitStack() as stack:
+        if conn is not None:
+            stack.enter_context(conn.begin())
+        else:
+            conn = stack.enter_context(engine.begin())


For my own information: Why is this move necessary?

To access the temp table, we have to start our transaction using the connection that made the temp table. So here we start the transaction using a connection if there is a connection, and the engine otherwise. (Hopefully will be getting rid of this with the move from temp tables though.)

eito-fis · 2021-06-24T18:14:43Z

Heres a snippet to confirm the issue is with connection pools for future reference:

 with engine.connect() as conn:
    _conn = conn
    with conn.begin():
        conn.execute(CreateTempTableAs(temp_name, select_table))
with engine.connect() as conn:
    print(conn == conn) # True
    with conn.begin(): 
        temp_table = reflect_table(temp_name, None, conn) # No error

 with engine.connect() as conn:
    _conn = conn
    with conn.begin():
        conn.execute(CreateTempTableAs(temp_name, select_table))

# Force a set of new connections
engine.pool.dispose()
engine.pool = engine.pool.recreate()

with engine.connect() as conn:
    print(_conn == conn) # False
    with conn.begin(): 
        temp_table = reflect_table(temp_name, None, conn) # sqlalchemy.exc.NoSuchTableError: temp_table

The problem being that the temp tables are default tied to a connection, but since connections are re-used the temp table isn't dropped. We could force the connection we used to be recreated, but that ends up being more work than using a non-temp table and making sure to drop it. I'll go ahead and migrate to using non-temp tables.

eito-fis · 2021-06-24T21:39:45Z

Updated to use non-temp files with {MATHESAR_PREFIX}temp_schema and {MATHESAR_PREFIX}temp_table_{current_epoch} as the schema and table name, respectively. Now also ensures that the table name is unique. Assuming we protect schemas from starting with MATHESAR_PREFIX, I think we should be safe from table name collisions.

mathemancer

I think this looks good for now. We should reassess using temp tables after the completion of #280 and any other pressing engine or connection modifying work. Long term, we need a way to make absolutely sure that this function can't leave random tables littered about the DB under any conditions.

eito-fis added 3 commits June 17, 2021 19:05

Add table column type inference function

ebbd6ae

Add test for table type inference function

8972124

Drop temp table at end of func

5cd645e

eito-fis added 3 commits June 18, 2021 14:31

Fix broken update_table_column_types test

713c0d3

Clean up create temp table statement

dab4ee2

Update table inference to test types one at a time

93e8a09

eito-fis marked this pull request as ready for review June 18, 2021 22:03

eito-fis requested review from a team, kgodey and pavish June 18, 2021 22:03

github-actions bot requested review from ghislaineguerin and mathemancer June 18, 2021 22:04

mathemancer requested changes Jun 21, 2021

View reviewed changes

eito-fis added 6 commits June 21, 2021 16:29

Update to use custom DDL object for table duplication

7867c7b

Remove Column.copy() method calls

d6fc4cd

First pass at using Postgres TEMP table

c94646f

Update table inference to use TEMPORARY prefix table

f0ad6e1

Fix broken type inference tests

7441d86

Ensure that temp table is dropped in case of error

e17f895

eito-fis force-pushed the type_inference_func branch from 8522776 to e17f895 Compare June 22, 2021 18:23

eito-fis requested a review from mathemancer June 22, 2021 18:50

eito-fis added 3 commits June 22, 2021 11:56

Add test for duplicate original table name

61b88dd

Clean up temp table drop logic

c1727fa

Merge branch 'master' into type_inference_func

77ae589

eito-fis mentioned this pull request Jun 22, 2021

Type inference endpoint #276

Merged

7 tasks

eito-fis mentioned this pull request Jun 24, 2021

Schema sqlalchemy engine caching is per-instance #280

Closed

Reflect table instead of copying columns into new table

fb6610b

mathemancer requested changes Jun 24, 2021

View reviewed changes

eito-fis added 2 commits June 24, 2021 12:03

Revert to using non-temporary tables

bc17c16

Add timestamp to temp table and schema

9665ca4

eito-fis force-pushed the type_inference_func branch from 30779fb to 9665ca4 Compare June 24, 2021 19:15

eito-fis added 3 commits June 24, 2021 14:01

Ensure that temp table name is unique

901cb1f

Fix broken mock in test

ee0a422

Use MATHESAR_PREFIX and proper constants for naming

2fb784d

eito-fis requested a review from mathemancer June 24, 2021 21:39

eito-fis and others added 5 commits June 24, 2021 15:13

Remove left over temp table handling

55d04ae

Merge branch 'master' into type_inference_func

b5ab7d9

Cleanup to use try:except:else pattern

89ff067

Merge branch 'master' into type_inference_func

b06547b

Merge branch 'master' into type_inference_func

22246fc

mathemancer approved these changes Jun 28, 2021

View reviewed changes

mathemancer merged commit 0bc1a97 into master Jun 28, 2021

mathemancer deleted the type_inference_func branch June 28, 2021 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type inference func #268

Type inference func #268

eito-fis commented Jun 18, 2021 •

edited

Loading

eito-fis commented Jun 18, 2021

kgodey commented Jun 18, 2021

eito-fis commented Jun 18, 2021 •

edited

Loading

mathemancer left a comment

mathemancer Jun 21, 2021

mathemancer Jun 21, 2021

eito-fis Jun 22, 2021

mathemancer commented Jun 21, 2021

eito-fis commented Jun 23, 2021

eito-fis commented Jun 24, 2021

mathemancer left a comment

mathemancer Jun 24, 2021

mathemancer Jun 24, 2021

eito-fis Jun 24, 2021

mathemancer Jun 24, 2021

mathemancer Jun 24, 2021

eito-fis Jun 24, 2021

eito-fis commented Jun 24, 2021

eito-fis commented Jun 24, 2021

mathemancer left a comment

Type inference func #268

Type inference func #268

Conversation

eito-fis commented Jun 18, 2021 • edited Loading

Checklist

Developer Certificate of Origin

eito-fis commented Jun 18, 2021

kgodey commented Jun 18, 2021

eito-fis commented Jun 18, 2021 • edited Loading

mathemancer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mathemancer commented Jun 21, 2021

eito-fis commented Jun 23, 2021

eito-fis commented Jun 24, 2021

mathemancer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eito-fis commented Jun 24, 2021

eito-fis commented Jun 24, 2021

mathemancer left a comment

Choose a reason for hiding this comment

eito-fis commented Jun 18, 2021 •

edited

Loading

eito-fis commented Jun 18, 2021 •

edited

Loading