Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SQL Support for ADBC Drivers #53869

Merged
merged 78 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
4f2b760
close to complete implementation
WillAyd Jun 26, 2023
a4ebbb5
working implementation for postgres
WillAyd Jun 26, 2023
b2cd149
sqlite implementation
WillAyd Jun 26, 2023
512bd00
Added ADBC to CI
WillAyd Jun 26, 2023
f49115c
Doc updates
WillAyd Jun 26, 2023
a8512b5
Whatsnew update
WillAyd Jun 26, 2023
c1c68ef
Better optional dependency import
WillAyd Jun 26, 2023
3d7fb15
min versions fix
WillAyd Jun 26, 2023
1093bc8
import updates
WillAyd Jun 27, 2023
926e567
docstring fix
WillAyd Jun 27, 2023
093dd86
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Jun 27, 2023
fcc21a8
doc fixup
WillAyd Jun 27, 2023
88642f7
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Jul 14, 2023
156096d
Updates for 0.6.0
WillAyd Jul 14, 2023
dd26edb
fix sqlite name escaping
WillAyd Jul 20, 2023
4d8a233
more cleanups
WillAyd Jul 20, 2023
5238e69
more 0.6.0 updates
WillAyd Aug 2, 2023
51c6c98
typo
WillAyd Aug 2, 2023
39b462b
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Aug 28, 2023
428c4f7
remove warning
WillAyd Aug 28, 2023
84d95bb
test_sql expectations
WillAyd Aug 28, 2023
a4d5b31
revert whatsnew issues
WillAyd Aug 28, 2023
21b35f6
pip deps
WillAyd Aug 28, 2023
e709d52
Suppress pyarrow warning
WillAyd Aug 28, 2023
6077fa9
Updated docs
WillAyd Aug 28, 2023
5bba566
mypy fixes
WillAyd Aug 28, 2023
236e12b
Remove stacklevel check from test
WillAyd Aug 29, 2023
b35374c
typo fix
WillAyd Aug 29, 2023
8d814e1
compat
WillAyd Aug 30, 2023
cfac2c7
Joris feedback
WillAyd Aug 31, 2023
47caaf1
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Aug 31, 2023
a22e5d1
Better test coverage with ADBC
WillAyd Aug 31, 2023
c51b7f4
cleanups
WillAyd Aug 31, 2023
7f5e6ac
feedback
WillAyd Sep 1, 2023
9ee6255
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Sep 19, 2023
a8b645f
checkpoint
WillAyd Sep 19, 2023
902df4f
more checkpoint
WillAyd Sep 19, 2023
90ca2cb
more skips
WillAyd Sep 20, 2023
d753c3c
updates
WillAyd Sep 20, 2023
d469e24
implement more
WillAyd Sep 21, 2023
2bc11a1
bump to 0.7.0
WillAyd Sep 24, 2023
f205f90
fixups
WillAyd Oct 2, 2023
2755100
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Oct 2, 2023
3577a59
cleanups
WillAyd Oct 2, 2023
c5bf7f8
sqlite fixups
WillAyd Oct 2, 2023
98d22ce
pyarrow compat
WillAyd Oct 2, 2023
4f72010
revert to using pip instead of conda
WillAyd Oct 2, 2023
7223e63
documentation cleanups
WillAyd Oct 2, 2023
c2cd90a
compat fixups
WillAyd Oct 3, 2023
de65ec0
Fix stacklevel
WillAyd Oct 3, 2023
7645727
remove unneeded code
WillAyd Oct 3, 2023
3dc914c
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Oct 16, 2023
6dbaae5
commit after drop in fixtures
WillAyd Oct 16, 2023
3bf550c
close cursor
WillAyd Oct 17, 2023
492301f
Merge branch 'main' into adbc-integration
WillAyd Oct 23, 2023
fc463a4
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Oct 23, 2023
cc72ecd
Merge branch 'main' into adbc-integration
WillAyd Oct 25, 2023
f5fd529
Merge branch 'main' into adbc-integration
WillAyd Oct 30, 2023
1207bc4
fix table dropping
WillAyd Oct 30, 2023
e8d93c7
Merge branch 'main' into adbc-integration
WillAyd Nov 10, 2023
3eed897
Bumped ADBC min to 0.8.0
WillAyd Nov 10, 2023
adef2f2
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Nov 10, 2023
67101fd
documentation
WillAyd Nov 10, 2023
ea5dcb9
doc updates
WillAyd Nov 10, 2023
fb38411
more fixups
WillAyd Nov 10, 2023
a0bed67
documentation fixups
WillAyd Nov 11, 2023
150e267
Merge branch 'main' into adbc-integration
WillAyd Nov 13, 2023
1e77f2b
fixes
WillAyd Nov 13, 2023
97ed24f
more documentation
WillAyd Nov 13, 2023
7dc07da
doc spacing
WillAyd Nov 13, 2023
52ee8d3
doc target fix
WillAyd Nov 14, 2023
1de8488
pyarrow warning compat
WillAyd Nov 14, 2023
21edaea
Merge branch 'main' into adbc-integration
WillAyd Nov 17, 2023
2d077e9
feedback
WillAyd Nov 17, 2023
accbd49
updated io documentation
WillAyd Nov 17, 2023
64b63bd
Merge branch 'main' into adbc-integration
WillAyd Nov 17, 2023
f84f63a
install updates
WillAyd Nov 18, 2023
391d045
Merge remote-tracking branch 'upstream/main' into adbc-integration
WillAyd Nov 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ci/deps/actions-310.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@ dependencies:
- zstandard>=0.19.0

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- pyqt5>=5.15.8
- tzdata>=2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-311-downstream_compat.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,8 @@ dependencies:
- pyyaml
- py
- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- dataframe-api-compat>=0.1.7
- pyqt5>=5.15.8
- tzdata>=2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-311.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@ dependencies:
- zstandard>=0.19.0

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- pyqt5>=5.15.8
- tzdata>=2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-39-minimum_versions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ dependencies:
- zstandard=0.19.0

- pip:
- adbc-driver-postgresql==0.8.0
- adbc-driver-sqlite==0.8.0
- dataframe-api-compat==0.1.7
- pyqt5==5.15.8
- tzdata==2022.7
2 changes: 2 additions & 0 deletions ci/deps/actions-39.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,5 +56,7 @@ dependencies:
- zstandard>=0.19.0

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- pyqt5>=5.15.8
- tzdata>=2022.7
3 changes: 3 additions & 0 deletions ci/deps/circle-310-arm64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,6 @@ dependencies:
- xlrd>=2.0.1
- xlsxwriter>=3.0.5
- zstandard>=0.19.0
- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
4 changes: 3 additions & 1 deletion doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@ lxml 4.9.2 xml XML parser for read
SQL databases
^^^^^^^^^^^^^

Installable with ``pip install "pandas[postgresql, mysql, sql-other]"``.
Traditional drivers are installable with ``pip install "pandas[postgresql, mysql, sql-other]"``

========================= ================== =============== =============================================================
Dependency Minimum Version pip extra Notes
Expand All @@ -345,6 +345,8 @@ SQLAlchemy 2.0.0 postgresql, SQL support for dat
sql-other
psycopg2 2.9.6 postgresql PostgreSQL engine for sqlalchemy
pymysql 1.0.2 mysql MySQL engine for sqlalchemy
adbc-driver-postgresql 0.8.0 postgresql ADBC Driver for PostgreSQL
adbc-driver-sqlite 0.8.0 sql-other ADBC Driver for SQLite
========================= ================== =============== =============================================================

Other data sources
Expand Down
114 changes: 104 additions & 10 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5565,9 +5565,23 @@ SQL queries
-----------

The :mod:`pandas.io.sql` module provides a collection of query wrappers to both
facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction
is provided by SQLAlchemy if installed. In addition you will need a driver library for
your database. Examples of such drivers are `psycopg2 <https://www.psycopg.org/>`__
facilitate data retrieval and to reduce dependency on DB-specific API.

Where available, users may first want to opt for `Apache Arrow ADBC
<https://arrow.apache.org/adbc/current/index.html>`_ drivers. These drivers
should provide the best performance, null handling, and type detection.

.. versionadded:: 2.2.0

Added native support for ADBC drivers

For a full list of ADBC drivers and their development status, see the `ADBC Driver
Implementation Status <https://arrow.apache.org/adbc/current/driver/status.html>`_
documentation.

Where an ADBC driver is not available or may be missing functionality,
users should opt for installing SQLAlchemy alongside their database driver library.
Examples of such drivers are `psycopg2 <https://www.psycopg.org/>`__
for PostgreSQL or `pymysql <https://github.com/PyMySQL/PyMySQL>`__ for MySQL.
For `SQLite <https://docs.python.org/3/library/sqlite3.html>`__ this is
included in Python's standard library by default.
Expand Down Expand Up @@ -5600,6 +5614,18 @@ In the following example, we use the `SQlite <https://www.sqlite.org/index.html>
engine. You can use a temporary SQLite database where data are stored in
"memory".

To connect using an ADBC driver you will want to install the ``adbc_driver_sqlite`` using your
package manager. Once installed, you can use the DBAPI interface provided by the ADBC driver
to connect to your database.

.. code-block:: python

import adbc_driver_sqlite.dbapi as sqlite_dbapi

# Create the connection
with sqlite_dbapi.connect("sqlite:///:memory:") as conn:
df = pd.read_sql_table("data", conn)

To connect with SQLAlchemy you use the :func:`create_engine` function to create an engine
object from database URI. You only need to create the engine once per database you are
connecting to.
Expand Down Expand Up @@ -5675,9 +5701,74 @@ writes ``data`` to the database in batches of 1000 rows at a time:
SQL data types
++++++++++++++

:func:`~pandas.DataFrame.to_sql` will try to map your data to an appropriate
SQL data type based on the dtype of the data. When you have columns of dtype
``object``, pandas will try to infer the data type.
Ensuring consistent data type management across SQL databases is challenging.
Not every SQL database offers the same types, and even when they do the implementation
of a given type can vary in ways that have subtle effects on how types can be
preserved.

For the best odds at preserving database types users are advised to use
ADBC drivers when available. The Arrow type system offers a wider array of
types that more closely match database types than the historical pandas/NumPy
type system. To illustrate, note this (non-exhaustive) listing of types
available in different databases and pandas backends:

+-----------------+-----------------------+----------------+---------+
|numpy/pandas |arrow |postgres |sqlite |
+=================+=======================+================+=========+
|int16/Int16 |int16 |SMALLINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int32/Int32 |int32 |INTEGER |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int64/Int64 |int64 |BIGINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|float32 |float32 |REAL |REAL |
+-----------------+-----------------------+----------------+---------+
|float64 |float64 |DOUBLE PRECISION|REAL |
+-----------------+-----------------------+----------------+---------+
|object |string |TEXT |TEXT |
+-----------------+-----------------------+----------------+---------+
|bool |``bool_`` |BOOLEAN | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns] |timestamp(us) |TIMESTAMP | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns,tz]|timestamp(us,tz) |TIMESTAMPTZ | |
+-----------------+-----------------------+----------------+---------+
| |date32 |DATE | |
+-----------------+-----------------------+----------------+---------+
| |month_day_nano_interval|INTERVAL | |
+-----------------+-----------------------+----------------+---------+
| |binary |BINARY |BLOB |
+-----------------+-----------------------+----------------+---------+
| |decimal128 |DECIMAL [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |list |ARRAY [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |struct |COMPOSITE TYPE | |
| | | [#f1]_ | |
+-----------------+-----------------------+----------------+---------+

.. rubric:: Footnotes

.. [#f1] Not implemented as of writing, but theoretically possible

If you are interested in preserving database types as best as possible
throughout the lifecycle of your DataFrame, users are encouraged to
leverage the ``dtype_backend="pyarrow"`` argument of :func:`~pandas.read_sql`

.. code-block:: ipython

# for roundtripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")

This will prevent your data from being converted to the traditional pandas/NumPy
type system, which often converts SQL types in ways that make them impossible to
round-trip.

In case an ADBC driver is not available, :func:`~pandas.DataFrame.to_sql`
will try to map your data to an appropriate SQL data type based on the dtype of
the data. When you have columns of dtype ``object``, pandas will try to infer
the data type.

You can always override the default type by specifying the desired SQL type of
any of the columns by using the ``dtype`` argument. This argument needs a
Expand All @@ -5696,7 +5787,9 @@ default ``Text`` type for string columns:

Due to the limited support for timedelta's in the different database
flavors, columns with type ``timedelta64`` will be written as integer
values as nanoseconds to the database and a warning will be raised.
values as nanoseconds to the database and a warning will be raised. The only
exception to this is when using the ADBC PostgreSQL driver in which case a
timedelta will be written to the database as an ``INTERVAL``

.. note::

Expand All @@ -5711,7 +5804,7 @@ default ``Text`` type for string columns:
Datetime data types
'''''''''''''''''''

Using SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing
Using ADBC or SQLAlchemy, :func:`~pandas.DataFrame.to_sql` is capable of writing
datetime data that is timezone naive or timezone aware. However, the resulting
data stored in the database ultimately depends on the supported data type
for datetime data of the database system being used.
Expand Down Expand Up @@ -5802,15 +5895,16 @@ table name and optionally a subset of columns to read.
.. note::

In order to use :func:`~pandas.read_sql_table`, you **must** have the
SQLAlchemy optional dependency installed.
ADBC driver or SQLAlchemy optional dependency installed.

.. ipython:: python

pd.read_sql_table("data", engine)

.. note::

Note that pandas infers column dtypes from query outputs, and not by looking
ADBC drivers will map database types directly back to arrow types. For other drivers
note that pandas infers column dtypes from query outputs, and not by looking
up data types in the physical database schema. For example, assume ``userid``
is an integer column in a table. Then, intuitively, ``select userid ...`` will
return integer-valued series, while ``select cast(userid as text) ...`` will
Expand Down
91 changes: 91 additions & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,97 @@ a Series. (:issue:`55323`)
)
series.list[0]

.. _whatsnew_220.enhancements.adbc_support:

ADBC Driver support in to_sql and read_sql
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

:func:`read_sql` and :meth:`~DataFrame.to_sql` now work with `Apache Arrow ADBC
<https://arrow.apache.org/adbc/current/index.html>`_ drivers. Compared to
traditional drivers used via SQLAlchemy, ADBC drivers should provide
significant performance improvements, better type support and cleaner
nullability handling.

.. code-block:: ipython

import adbc_driver_postgresql.dbapi as pg_dbapi

df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
],
columns=['a', 'b', 'c']
)
uri = "postgresql://postgres:postgres@localhost/postgres"
with pg_dbapi.connect(uri) as conn:
df.to_sql("pandas_table", conn, index=False)

# for roundtripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn)

The Arrow type system offers a wider array of types that can more closely match
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is important enough of a point to make in the changelog, but should also probably put in io.rst. Just wasn't sure how to best structure that yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to include in io.rst. I think it might be good to add a separate section in io.rst to talk about type mapping (if there isn't one already) and also include sqlalchemy type mapping

what databases like PostgreSQL can offer. To illustrate, note this (non-exhaustive)
listing of types available in different databases and pandas backends:

+-----------------+-----------------------+----------------+---------+
|numpy/pandas |arrow |postgres |sqlite |
+=================+=======================+================+=========+
|int16/Int16 |int16 |SMALLINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int32/Int32 |int32 |INTEGER |INTEGER |
+-----------------+-----------------------+----------------+---------+
|int64/Int64 |int64 |BIGINT |INTEGER |
+-----------------+-----------------------+----------------+---------+
|float32 |float32 |REAL |REAL |
+-----------------+-----------------------+----------------+---------+
|float64 |float64 |DOUBLE PRECISION|REAL |
+-----------------+-----------------------+----------------+---------+
|object |string |TEXT |TEXT |
+-----------------+-----------------------+----------------+---------+
|bool |``bool_`` |BOOLEAN | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns] |timestamp(us) |TIMESTAMP | |
+-----------------+-----------------------+----------------+---------+
|datetime64[ns,tz]|timestamp(us,tz) |TIMESTAMPTZ | |
+-----------------+-----------------------+----------------+---------+
| |date32 |DATE | |
+-----------------+-----------------------+----------------+---------+
| |month_day_nano_interval|INTERVAL | |
+-----------------+-----------------------+----------------+---------+
| |binary |BINARY |BLOB |
+-----------------+-----------------------+----------------+---------+
| |decimal128 |DECIMAL [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |list |ARRAY [#f1]_ | |
+-----------------+-----------------------+----------------+---------+
| |struct |COMPOSITE TYPE | |
| | | [#f1]_ | |
+-----------------+-----------------------+----------------+---------+

.. rubric:: Footnotes

.. [#f1] Not implemented as of writing, but theoretically possible

If you are interested in preserving database types as best as possible
throughout the lifecycle of your DataFrame, users are encouraged to
leverage the ``dtype_backend="pyarrow"`` argument of :func:`~pandas.read_sql`

.. code-block:: ipython

# for roundtripping
with pg_dbapi.connect(uri) as conn:
df2 = pd.read_sql("pandas_table", conn, dtype_backend="pyarrow")

This will prevent your data from being converted to the traditional pandas/NumPy
type system, which often converts SQL types in ways that make them impossible to
round-trip.

For a full list of ADBC drivers and their development status, see the `ADBC Driver
Implementation Status <https://arrow.apache.org/adbc/current/driver/status.html>`_
documentation.

.. _whatsnew_220.enhancements.other:

Other enhancements
Expand Down
2 changes: 2 additions & 0 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ dependencies:
- pygments # Code highlighting

- pip:
- adbc-driver-postgresql>=0.8.0
- adbc-driver-sqlite>=0.8.0
- dataframe-api-compat>=0.1.7
- sphinx-toggleprompt # conda-forge version has stricter pins on jinja2
- typing_extensions; python_version<"3.11"
Expand Down
2 changes: 2 additions & 0 deletions pandas/compat/_optional.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
# Update install.rst & setup.cfg when updating versions!

VERSIONS = {
"adbc-driver-postgresql": "0.8.0",
"adbc-driver-sqlite": "0.8.0",
"bs4": "4.11.2",
"blosc": "1.21.3",
"bottleneck": "1.3.6",
Expand Down
Loading