Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Comparison to pandas? #812

Closed
WillAyd opened this issue Jun 16, 2023 · 11 comments · Fixed by #1940
Closed

Document Comparison to pandas? #812

WillAyd opened this issue Jun 16, 2023 · 11 comments · Fixed by #1940

Comments

@WillAyd
Copy link
Contributor

WillAyd commented Jun 16, 2023

I was experimenting with the ADBC postgres driver in comparison to equivalent pandas read/write sql functions. I put a rough draft of that up on my blog:

https://willayd.com/leveraging-the-adbc-driver-in-analytics-workflows.html

Do you think any of that is worth integrating into the documentation here? Not sure how much we care to highlight differences here against other tools in the space

@lidavidm
Copy link
Member

This is quite cool, thanks for sharing!

I wonder what the best way might be to integrate this. Much of this is really about how Arrow as a whole compares to Pandas (e.g. the data types).

Maybe we could consider explicit "if you did this in Pandas, do this with ADBC" examples? There's some examples going into the next release: https://arrow.apache.org/adbc/main/python/recipe/postgresql.html

The other thing could be highlighting your post somehow (retweeting it?)

Medium-to-long term, I was actually hoping we could integrate ADBC directly in the Pandas read/write_sql functions.

@WillAyd
Copy link
Contributor Author

WillAyd commented Jun 16, 2023

That recipes page looks nice - I'll see what I can add there.

@datapythonista @MarcoGorelli think this is worth tweeting from the pandas account? I have a Mastadon so can post there, but this might be good for Twitter users. Though arguably also strange for pandas without maybe a large context on what this could mean for pandas itself; I'm pretty indifferent

As far as your medium to long term goal I don't want to speak for the entire pandas team just yet but I agree it would be good to integrate directly. The sql part of the pandas codebase has a lot of legacy cruft and isn't as actively maintained as other parts, so pandas should stand to gain a lot from using that internally

@MarcoGorelli
Copy link

nice!

reading posting - the access to Twitter is in the 1password (see Joris' email), if you join then you should be able to access it and post

@WillAyd
Copy link
Contributor Author

WillAyd commented Jun 26, 2023

FYI I started integration with pandas in pandas-dev/pandas#53869 . Looks like we aren't too far off on meeting the pandas requirements, just need int8 support for postgres and datetime support for the postgres/sqlite drivers

@lidavidm
Copy link
Member

Oh that's great!

For SQLite: is there a standard date/time/datetime encoding? That seems to be the main issue with stuffing those values in SQLite.

@lidavidm
Copy link
Member

Actually I'll just take a look at what pandas does currently when I get a chance, and then think about how to mimic that

@WillAyd
Copy link
Contributor Author

WillAyd commented Jun 26, 2023

pandas will just defer to sqlalchemy or sqlite3. I think both just store those values as ISO strings. Here are relevant docs:

https://docs.python.org/3/library/sqlite3.html#default-adapters-and-converters
https://docs.sqlalchemy.org/en/20/dialects/sqlite.html#date-and-time-types

@lidavidm
Copy link
Member

Cool, thanks.

It looks like read_sql has you explicitly specify which columns to read as datetimes, so we can probably reasonably add an option for that to the SQLite driver. Though it might be easier/more consistent to just do it as a post-processing step instead of in-driver...? But given the layers in between, it may be valuable to just support it directly anyways.

@WillAyd
Copy link
Contributor Author

WillAyd commented Jun 26, 2023

Yea the parse_dates argument is there in case the driver itself cannot infer the date, which lets pandas apply its own inferencing logic. But it isn't always required to specify and usually preferable to let the driver handle. With sqlite you can see it maintains this on roundtrip:

>>> import pandas as pd
>>> from sqlalchemy import create_engine
>>> df = pd.DataFrame([[pd.Timestamp("2023-01-01")]], columns=["dt"]) 
>>> engine = create_engine('sqlite://', echo=False)
>>> df.to_sql("test", con=engine, index=False)
>>> pd.read_sql("test", con=engine).dtypes
dt    datetime64[ns]
dtype: object

@lidavidm
Copy link
Member

Ah, I see, thanks. In that case, maybe the right option to provide is some way to map the SQLite column type to a date/time/datetime Arrow type and format string, and then Pandas can configure it to mimic the standard library sqlite3 module. (Though it sounds like SQLAlchemy can do this itself as well from that reference.)

@lidavidm
Copy link
Member

To circle back here after about exactly a year, I've opened a PR with brief examples of using ADBC with Pandas: #1940

lidavidm added a commit that referenced this issue Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants