Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding some support for PyArrow Date and Datetimes to Rust #969

Closed
wants to merge 9 commits into from

Conversation

HeardACat
Copy link

@HeardACat HeardACat commented Sep 4, 2021

Which issue does this PR close?

Attempts to close #949.

Rationale for this change

This adds some (probably not ideal) support for converting pyarrow date/timestamp fields to rust. Since the datatypes are not uniquely identified by id field.

Requires #967 - otherwise tests will fail.
Maybe this approach is redundant due to: #873

Also I understand that this might not be the proper way to do this - keen to understand more. Also I'm not well versed in rust, so I'll understand if the PR requires a lot more work.

Any pointers/directions are appreciated.

What changes are included in this PR?

Are there any user-facing changes?

@mmuru
Copy link
Contributor

mmuru commented Sep 5, 2021

@charliec443:You should check str_ob is date or timestamp and call data_type_date or data_type_timestamp method instead of calling them based on the id since it could be any pyarrow datatypes e.g Date64. Also, you should add tase case for it. If needed, I could provide a test case for it.

@houqp houqp mentioned this pull request Sep 5, 2021
@HeardACat
Copy link
Author

Thanks @mmuru for your feedback. I'll shall add some tests first and then get back to you on the latter bit. With respect to the id side of things.

@HeardACat
Copy link
Author

Actually could @mmuru provide test cases? I don't think I 100% understand the issue you're pointing out.
Sorry about that.

python/src/types.rs Outdated Show resolved Hide resolved
@mmuru
Copy link
Contributor

mmuru commented Sep 6, 2021

@charliec443: Now, you fixed other date and time related issues. Added test case looks good. Here is the my test case before your fix.

import pyarrow as pa
import pytest
from datafusion import ExecutionContext
from datafusion import functions as f
import datetime
from . import generic as helpers

@pytest.fixture
def df():
    ctx = ExecutionContext()

    # create a RecordBatch and a new DataFrame from it
    batch = pa.RecordBatch.from_arrays(
        [helpers.data_datetime("s"),helpers.data_date32(), helpers.data_date64()],
        names=["ts", "dt1", "dt2"],
    )

    return ctx.create_dataframe([[batch]])


def test_select_ts_date(df):
    df = df.select(f.col('ts'),f.col('dt1'),f.col('dt2'))

    # execute and collect the first (and only) batch
    result = df.collect()[0]

    assert result.column(0) == helpers.data_datetime("s")
    assert result.column(1) == helpers.data_date32()
    assert result.column(2) == helpers.data_date64()

@kszucs
Copy link
Member

kszucs commented Sep 6, 2021

@charliec443 indeed, this change is redundant since apache/arrow-rs#691 already handles the DataType exchange between pyarrow and arrow-rs (hence arrow-datafusion as well). The actual conversion happens here using the C data interface.

#873 should expose the functionality to datafusion.

@mmuru
Copy link
Contributor

mmuru commented Sep 13, 2021

@houqp: Currently, Arrow date and timestamp datatypes not supported on the Python binding. It is a blocker for the parquet and other data sources have these columns and not able to use datafusion. Do we know an ETA for #873? Otherwise, we need this fix for the short term till #873 gets merged. Thanks.

@houqp
Copy link
Member

houqp commented Sep 13, 2021

I think #873 might need to wait for the arrow 6.x release, which might take some time.

@alamb
Copy link
Contributor

alamb commented Sep 22, 2021

Arrow 6 is planned for mid-October, so a few weeks away

@alamb alamb added the stale-pr label Oct 26, 2021
@alamb
Copy link
Contributor

alamb commented Oct 26, 2021

Marking PRs that haven't had activity in over a month as 'stale-pr' to help me filter the list. Please remove the label or let me know if "stale" is not the correct designation

@kszucs
Copy link
Member

kszucs commented Oct 26, 2021

The new C data interface based approach in #873 should supersede the python string representation based one.

@houqp
Copy link
Member

houqp commented Nov 5, 2021

Thank you for your work on this @charliec443. Closing it for now since #873 has already been merged into master.

@houqp houqp closed this Nov 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants