Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: test_read_parquet_pandas_index[pyarrow] is broken at main due to pyarrow 12.0 #6072

Closed
2 of 3 tasks
mvashishtha opened this issue May 3, 2023 · 2 comments · Fixed by #6223
Closed
2 of 3 tasks
Labels
bug 🦗 Something isn't working CI P2 Minor bugs or low-priority feature requests

Comments

@mvashishtha
Copy link
Collaborator

mvashishtha commented May 3, 2023

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

pytest modin/pandas/test/test_io.py::TestParquet::test_read_parquet_pandas_index[pyarrow]

Issue Description

failure here: https://github.com/modin-project/modin/actions/runs/4865646453/jobs/8676331782?pr=6064

test passes on older pyarrow 11.0.0

Expected Behavior

test should pass

Error Logs

=================================== FAILURES ===================================
_____________ TestParquet.test_read_parquet_pandas_index[pyarrow] ______________

self = <modin.pandas.test.test_io.TestParquet object at 0x7f0029e61730>
engine = 'pyarrow'

    @pytest.mark.xfail(
        condition="config.getoption('--simulate-cloud').lower() != 'off'",
        reason="The reason of tests fail in `cloud` mode is unknown for now - issue #3264",
    )
    def test_read_parquet_pandas_index(self, engine):
        # Ensure modin can read parquet files written by pandas with a non-RangeIndex object
        pandas_df = pandas.DataFrame(
            {
                "idx": np.random.randint(0, 100_000, size=2000),
                "idx_categorical": pandas.Categorical(["y", "z"] * 1000),
                # Can't do interval index right now because of this bug fix that is planned
                # to be apart of the pandas 1.5.0 release: https://github.com/pandas-dev/pandas/pull/46034
                # "idx_interval": pandas.interval_range(start=0, end=2000),
                "idx_periodrange": pandas.period_range(
                    start="2017-01-01", periods=2000
                ),
                "A": np.random.randint(0, 100_000, size=2000),
                "B": ["a", "b"] * 1000,
                "C": ["c"] * 2000,
            }
        )
        # Older versions of pyarrow do not support Arrow to Parquet
        # schema conversion for duration[ns]
        # https://issues.apache.org/jira/browse/ARROW-6780
        if version.parse(pa.__version__) >= version.parse("8.0.0"):
            pandas_df["idx_timedelta"] = pandas.timedelta_range(
                start="1 day", periods=2000
            )
    
        # There is a non-deterministic bug in the fastparquet engine when we
        # try to set the index to the datetime column. Please see:
        # https://github.com/dask/fastparquet/issues/796
        if engine == "pyarrow":
            pandas_df["idx_datetime"] = pandas.date_range(
                start="1/1/2018", periods=2000
            )
    
        for col in pandas_df.columns:
            if col.startswith("idx"):
                with ensure_clean(".parquet") as unique_filename:
                    pandas_df.set_index(col).to_parquet(unique_filename)
                    # read the same parquet using modin.pandas
>                   eval_io(
                        "read_parquet",
                        # read_parquet kwargs
                        path=unique_filename,
                        engine=engine,
                    )

modin/pandas/test/test_io.py:1525: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
modin/pandas/test/utils.py:1008: in eval_io
    call_eval_general()
modin/pandas/test/utils.py:990: in call_eval_general
    eval_general(
modin/pandas/test/utils.py:940: in eval_general
    comparator(*values, **(comparator_kwargs or {}))
modin/pandas/test/utils.py:731: in df_equals
    assert_frame_equal(
pandas/_libs/testing.pyx:52: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: DataFrame.index are different
E   
E   DataFrame.index values are different (100.0 %)
E   [left]:  Int64Index([17167, 17168, 17169, 17170, 17171, 17172, 17173, 17174, 17175,
E               17176,
E               ...
E               19157, 19158, 19159, 19160, 19161, 19162, 19163, 19164, 19165,
E               19166],
E              dtype='int64', name='idx_periodrange', length=2000)
E   [right]: PeriodIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
E                '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
E                '2017-01-09', '2017-01-10',
E                ...
E                '2022-06-14', '2022-06-15', '2022-06-16', '2022-06-17',
E                '2022-06-18', '2022-06-19', '2022-06-20', '2022-06-21',
E                '2022-06-22', '2022-06-23'],
E               dtype='period[D]', name='idx_periodrange', length=2000)

pandas/_libs/testing.pyx:167: AssertionError
=========================== short test summary info ============================
FAILED modin/pandas/test/test_io.py::TestParquet::test_read_parquet_pandas_index[pyarrow] - AssertionError: DataFrame.index are different

DataFrame.index values are different (100.0 %)
[left]:  Int64Index([17167, 17168, 17169, 17170, 17171, 17172, 17173, 17174, 17175,
            17176,
            ...
            19157, 19158, 19159, 19160, 19161, 19162, 19163, 19164, 19165,
            19166],
           dtype='int64', name='idx_periodrange', length=2000)
[right]: PeriodIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
             '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
             '2017-01-09', '2017-01-10',
             ...
             '2022-06-14', '2022-06-15', '2022-06-16', '2022-06-17',
             '2022-06-18', '2022-06-19', '2022-06-20', '2022-06-21',
             '2022-06-22', '2022-06-23'],
            dtype='period[D]', name='idx_periodrange', length=2000)
= 1 failed, 2035 passed, 341 skipped, 332 xfailed, 1 xpassed, 793 warnings in 489.19s (0:08:09) =
Error: Process completed with exit code 1.

Installed Versions


INSTALLED VERSIONS
------------------
commit           : 8817093cb2f16be3709513b1517d8a4bc679900c
python           : 3.8.16.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.5.0
Version          : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

Modin dependencies
------------------
modin            : 0.20.0+37.g8817093cb
ray              : 2.4.0
dask             : 2023.4.1
distributed      : 2023.4.1
hdk              : None

pandas dependencies
-------------------
pandas           : 1.5.3
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 66.0.0
pip              : 23.0.1
Cython           : 0.29.34
pytest           : 7.3.1
hypothesis       : None
sphinx           : 7.0.0
blosc            : None
feather          : 0.4.1
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : None
psycopg2         : 2.9.6
jinja2           : 3.1.2
IPython          : 8.12.1
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : None
brotli           : None
fastparquet      : 2022.12.0
fsspec           : 2023.4.0
gcsfs            : None
matplotlib       : 3.7.1
numba            : None
numexpr          : 2.8.4
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : 0.19.1
pyarrow          : 12.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : 2023.4.0
scipy            : 1.10.1
snappy           : None
sqlalchemy       : 1.4.45
tables           : 3.8.0
tabulate         : None
xarray           : 2023.1.0
xlrd             : 2.0.1
xlwt             : None
zstandard        : None
tzdata           : None
@mvashishtha mvashishtha added bug 🦗 Something isn't working P0 Highest priority tasks requiring immediate fix CI labels May 3, 2023
@mvashishtha mvashishtha self-assigned this May 3, 2023
@noloerino
Copy link
Collaborator

Prior to pyarrow 12.0.0, pandas returns an Int64Index for the type of the column that's failing the assertion, and it looks like we do the same. However, this type is deprecated: https://pandas.pydata.org/pandas-docs/version/1.5/reference/api/pandas.Int64Index.html

I wasn't able to track down a specific breaking commit in pyarrow's release notes, but apache/arrow#34404 may be tangentially related. The change in pyarrow's behavior may be just for compatibility with pandas 2.0.0, so we may need to just pin pyarrow to a working version until we update our own pandas version.

@mvashishtha mvashishtha removed their assignment May 15, 2023
@mvashishtha mvashishtha added P2 Minor bugs or low-priority feature requests and removed P0 Highest priority tasks requiring immediate fix labels May 24, 2023
@h-vetinari
Copy link

I reiterate my plea to please stop pinning pandas so hard:

You are not responsible for pandas-bugs, pandas is (and your users should be free to have the bugs in pandas fixed independently of which version you test in CI; if pandas releases a new patch release tomorrow, you're actively preventing your users from getting those fixes, because you claim you need to vet them for them first, which is paternalistic at best).

Pandas 2.0 is almost two months old - compatibility with a new pandas releases should be a release blocker IMO, so releasing 0.21 without that is really suboptimal. Now you're forcing your users to forego both the benefits of pyarrow 12 (much less pandas 2.0), because of a single test failure in your CI. Respectfully, this is absurd. Xfail the test, and unskip it once it's fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working CI P2 Minor bugs or low-priority feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants