-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SQL Support for ADBC Drivers #53869
Conversation
pandas/tests/io/test_sql.py
Outdated
with tm.assert_produces_warning(UserWarning, match="the 'timedelta'"): | ||
df.to_sql("test_arrow", conn, if_exists="replace", index=False) | ||
|
||
|
||
@pytest.mark.db | ||
@pytest.mark.parametrize("conn", all_connectable) | ||
def test_dataframe_to_sql_arrow_dtypes_missing(conn, request, nulls_fixture): | ||
if conn == "postgresql_adbc_conn": | ||
request.node.add_marker( | ||
pytest.skip("int8/datetime not implemented yet in adbc driver") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing as an 8-bit integer in postgres, so this test wouldn't round trip in its current form. I think this being an int8 is just a test detail and not anything we actually care for. Can probably bump to int16, though that separately begs the question of how we want to add/extend the ADBC types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the canonical way to add ADBC types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the mapping of an arrow type to a database-specific type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could in theory occur in the driver or could be handled by pandas itself. Since timestamp is a primitive in most databases I think that will be handled by the driver (plan on looking at this today/tomorrow myself for postgres)
Int8 is a little trickier. From the databases I've used I am not aware of 8 bit integers being all that common, so either the driver implements some compatability to another type (logically int16) or leaves it to the application to put its data into a supported type. My guess is the ADBC drivers may want to start explicitly and force end users to upcast to int16 if that's what they want, but @lidavidm knows best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a strong convention (like timestamp strings in SQLite) I'd like to support it natively, otherwise I think it would be best for the client to be explicit about what it wants (and if there's something the driver can do to help, we can do it - e.g. I think it's reasonable to optionally ingest int8 as SMALLINT in the driver, so long as you aren't concerned about perfect roundtripping)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we kinda do this already for numpy types -> sqlalchemy types so not opposed to adding that in pandas. I was just curious what the API for defining these is like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int8 support was added in apache/arrow-adbc#858 and working on timestamp in apache/arrow-adbc#861; guessing these can make it into the next ADBC release and we can just start support at that as a minimum version to keep things easy
pandas/io/sql.py
Outdated
stmt = f"SELECT * FROM {table_name}" | ||
|
||
with self.con.cursor() as cur: | ||
return cur(stmt).fetch_arrow_table().to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be nice to at minimum support dtype_backend
and return arrow backed types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should always just return arrow backed types. Related to the other conversation around kwargs I am unsure of the best way to handle this. If we raise for non-default arguments this wouldn't work; alternately we could except the dtype_backend
argument from raising for non-default arguments but arguably is heavy handed to require end users to specify that when they are already using the ADBC driver
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd have to add types_mapper=pd.ArrowDtype
for this to work.
Not sure how I'd feel about arrow backed only, this makes sense but we went in a different way for other readers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah didn't realize that. Thanks for the heads up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #51846 for long-term
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the new library in install.rst
would be good too
AFACIT the pyarrow drivers required pyarrow>=8.0.0 since they use the RecordBatchReader object, but our min version is still pinned at 6.0.0. @phofl I know you've upgraded us in the past with pyarrow - is it too soon to make the jump to 8.0? |
We could likely relax Python and PyArrow minimum versions further if needed |
Technically we could upgrade for 2.1, but this depends on pdep 10 as well |
What is the link between this and PDEP 10? The ADBC dbapi requires pyarrow to use but I don't think that impacts how we need to manage that on our end? Or am I overlooking something simple? |
PDEP10 has some language around minimum supported Arrow versions |
But note that the PDEP-10 text is even more conservative, so based on that we would not be bumping to pyarrow 8 as a minimum version any time soon (only next year).
IMO there is really no problem in requiring a more recent pyarrow version for certain new features, such as pyarrow 8 for using adbc in our SQL IO, while the general minimum required version is lower. |
Yea I agree. That definitely adds complexity to test compat/xfailing, but I think can be reasonably implemented in another pass at this |
with pg_dbapi.connect(uri) as conn: | ||
df2 = pd.read_sql("pandas_table", conn) | ||
|
||
The Arrow type system offers a wider array of types that can more closely match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is important enough of a point to make in the changelog, but should also probably put in io.rst. Just wasn't sure how to best structure that yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to include in io.rst. I think it might be good to add a separate section in io.rst to talk about type mapping (if there isn't one already) and also include sqlalchemy type mapping
Alright all green and I think I've address comments. Let me know if anything is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments otherwise looks good
OK all green and I think I addressed your feedback @mroeschke |
@@ -335,7 +335,8 @@ lxml 4.9.2 xml XML parser for read | |||
SQL databases | |||
^^^^^^^^^^^^^ | |||
|
|||
Installable with ``pip install "pandas[postgresql, mysql, sql-other]"``. | |||
Traditional drivers are installable with ``pip install "pandas[postgresql, mysql, sql-other]"``. ADBC drivers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer anymore since you added them to the postgresql
and sql-other
extra in the pyproject.toml
@@ -345,6 +346,8 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat | |||
sql-other | |||
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy | |||
pymysql 1.0.2 mysql MySQL engine for sqlalchemy | |||
adbc-driver-postgresql 0.8.0 ADBC Driver for PostgreSQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in the postgresql
extra now
@@ -345,6 +346,8 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat | |||
sql-other | |||
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy | |||
pymysql 1.0.2 mysql MySQL engine for sqlalchemy | |||
adbc-driver-postgresql 0.8.0 ADBC Driver for PostgreSQL | |||
adbc-driver-sqlite 0.8.0 ADBC Driver for SQLite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is in the sql-other
extra now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small comments otherwise LGTM. Any other thoughts @jorisvandenbossche
Nice! Thanks @WillAyd. Can have follow up PRs if needed |
No description provided.