BUG: test_read_parquet_pandas_index[pyarrow] is broken at main due to pyarrow 12.0 #6072

mvashishtha · 2023-05-03T00:05:48Z

Modin version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest released version of Modin.
I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

pytest modin/pandas/test/test_io.py::TestParquet::test_read_parquet_pandas_index[pyarrow]

Issue Description

failure here: https://github.com/modin-project/modin/actions/runs/4865646453/jobs/8676331782?pr=6064

test passes on older pyarrow 11.0.0

Expected Behavior

test should pass

Error Logs

=================================== FAILURES ===================================
_____________ TestParquet.test_read_parquet_pandas_index[pyarrow] ______________

self = <modin.pandas.test.test_io.TestParquet object at 0x7f0029e61730>
engine = 'pyarrow'

    @pytest.mark.xfail(
        condition="config.getoption('--simulate-cloud').lower() != 'off'",
        reason="The reason of tests fail in `cloud` mode is unknown for now - issue #3264",
    )
    def test_read_parquet_pandas_index(self, engine):
        # Ensure modin can read parquet files written by pandas with a non-RangeIndex object
        pandas_df = pandas.DataFrame(
            {
                "idx": np.random.randint(0, 100_000, size=2000),
                "idx_categorical": pandas.Categorical(["y", "z"] * 1000),
                # Can't do interval index right now because of this bug fix that is planned
                # to be apart of the pandas 1.5.0 release: https://github.com/pandas-dev/pandas/pull/46034
                # "idx_interval": pandas.interval_range(start=0, end=2000),
                "idx_periodrange": pandas.period_range(
                    start="2017-01-01", periods=2000
                ),
                "A": np.random.randint(0, 100_000, size=2000),
                "B": ["a", "b"] * 1000,
                "C": ["c"] * 2000,
            }
        )
        # Older versions of pyarrow do not support Arrow to Parquet
        # schema conversion for duration[ns]
        # https://issues.apache.org/jira/browse/ARROW-6780
        if version.parse(pa.__version__) >= version.parse("8.0.0"):
            pandas_df["idx_timedelta"] = pandas.timedelta_range(
                start="1 day", periods=2000
            )
    
        # There is a non-deterministic bug in the fastparquet engine when we
        # try to set the index to the datetime column. Please see:
        # https://github.com/dask/fastparquet/issues/796
        if engine == "pyarrow":
            pandas_df["idx_datetime"] = pandas.date_range(
                start="1/1/2018", periods=2000
            )
    
        for col in pandas_df.columns:
            if col.startswith("idx"):
                with ensure_clean(".parquet") as unique_filename:
                    pandas_df.set_index(col).to_parquet(unique_filename)
                    # read the same parquet using modin.pandas
>                   eval_io(
                        "read_parquet",
                        # read_parquet kwargs
                        path=unique_filename,
                        engine=engine,
                    )

modin/pandas/test/test_io.py:1525: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
modin/pandas/test/utils.py:1008: in eval_io
    call_eval_general()
modin/pandas/test/utils.py:990: in call_eval_general
    eval_general(
modin/pandas/test/utils.py:940: in eval_general
    comparator(*values, **(comparator_kwargs or {}))
modin/pandas/test/utils.py:731: in df_equals
    assert_frame_equal(
pandas/_libs/testing.pyx:52: in pandas._libs.testing.assert_almost_equal
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   AssertionError: DataFrame.index are different
E   
E   DataFrame.index values are different (100.0 %)
E   [left]:  Int64Index([17167, 17168, 17169, 17170, 17171, 17172, 17173, 17174, 17175,
E               17176,
E               ...
E               19157, 19158, 19159, 19160, 19161, 19162, 19163, 19164, 19165,
E               19166],
E              dtype='int64', name='idx_periodrange', length=2000)
E   [right]: PeriodIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
E                '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
E                '2017-01-09', '2017-01-10',
E                ...
E                '2022-06-14', '2022-06-15', '2022-06-16', '2022-06-17',
E                '2022-06-18', '2022-06-19', '2022-06-20', '2022-06-21',
E                '2022-06-22', '2022-06-23'],
E               dtype='period[D]', name='idx_periodrange', length=2000)

pandas/_libs/testing.pyx:167: AssertionError
=========================== short test summary info ============================
FAILED modin/pandas/test/test_io.py::TestParquet::test_read_parquet_pandas_index[pyarrow] - AssertionError: DataFrame.index are different

DataFrame.index values are different (100.0 %)
[left]:  Int64Index([17167, 17168, 17169, 17170, 17171, 17172, 17173, 17174, 17175,
            17176,
            ...
            19157, 19158, 19159, 19160, 19161, 19162, 19163, 19164, 19165,
            19166],
           dtype='int64', name='idx_periodrange', length=2000)
[right]: PeriodIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
             '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
             '2017-01-09', '2017-01-10',
             ...
             '2022-06-14', '2022-06-15', '2022-06-16', '2022-06-17',
             '2022-06-18', '2022-06-19', '2022-06-20', '2022-06-21',
             '2022-06-22', '2022-06-23'],
            dtype='period[D]', name='idx_periodrange', length=2000)
= 1 failed, 2035 passed, 341 skipped, 332 xfailed, 1 xpassed, 793 warnings in 489.19s (0:08:09) =
Error: Process completed with exit code 1.

Installed Versions


INSTALLED VERSIONS
------------------
commit           : 8817093cb2f16be3709513b1517d8a4bc679900c
python           : 3.8.16.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.5.0
Version          : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:22 PDT 2022; root:xnu-8020.121.3~4/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

Modin dependencies
------------------
modin            : 0.20.0+37.g8817093cb
ray              : 2.4.0
dask             : 2023.4.1
distributed      : 2023.4.1
hdk              : None

pandas dependencies
-------------------
pandas           : 1.5.3
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 66.0.0
pip              : 23.0.1
Cython           : 0.29.34
pytest           : 7.3.1
hypothesis       : None
sphinx           : 7.0.0
blosc            : None
feather          : 0.4.1
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : None
psycopg2         : 2.9.6
jinja2           : 3.1.2
IPython          : 8.12.1
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : None
brotli           : None
fastparquet      : 2022.12.0
fsspec           : 2023.4.0
gcsfs            : None
matplotlib       : 3.7.1
numba            : None
numexpr          : 2.8.4
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : 0.19.1
pyarrow          : 12.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : 2023.4.0
scipy            : 1.10.1
snappy           : None
sqlalchemy       : 1.4.45
tables           : 3.8.0
tabulate         : None
xarray           : 2023.1.0
xlrd             : 2.0.1
xlwt             : None
zstandard        : None
tzdata           : None

The text was updated successfully, but these errors were encountered:

noloerino · 2023-05-03T01:15:01Z

Prior to pyarrow 12.0.0, pandas returns an Int64Index for the type of the column that's failing the assertion, and it looks like we do the same. However, this type is deprecated: https://pandas.pydata.org/pandas-docs/version/1.5/reference/api/pandas.Int64Index.html

I wasn't able to track down a specific breaking commit in pyarrow's release notes, but apache/arrow#34404 may be tangentially related. The change in pyarrow's behavior may be just for compatibility with pandas 2.0.0, so we may need to just pin pyarrow to a working version until we update our own pandas version.

h-vetinari · 2023-05-25T09:02:00Z

I reiterate my plea to please stop pinning pandas so hard:

You are not responsible for pandas-bugs, pandas is (and your users should be free to have the bugs in pandas fixed independently of which version you test in CI; if pandas releases a new patch release tomorrow, you're actively preventing your users from getting those fixes, because you claim you need to vet them for them first, which is paternalistic at best).

Pandas 2.0 is almost two months old - compatibility with a new pandas releases should be a release blocker IMO, so releasing 0.21 without that is really suboptimal. Now you're forcing your users to forego both the benefits of pyarrow 12 (much less pandas 2.0), because of a single test failure in your CI. Respectfully, this is absurd. Xfail the test, and unskip it once it's fixed.

Signed-off-by: Anatoly Myachev <[email protected]>

…est (#6223) Signed-off-by: Anatoly Myachev <[email protected]>

mvashishtha added bug 🦗 Something isn't working P0 Highest priority tasks requiring immediate fix CI labels May 3, 2023

mvashishtha self-assigned this May 3, 2023

mvashishtha mentioned this issue May 3, 2023

BUG: cap pyarrow < 12 as a temporary fix for Int64Index test_read_parquet_pandas failure #6074

Closed

mvashishtha removed their assignment May 15, 2023

mvashishtha added P2 Minor bugs or low-priority feature requests and removed P0 Highest priority tasks requiring immediate fix labels May 24, 2023

h-vetinari mentioned this issue May 25, 2023

Cap pyarrow<12 in meta.yml conda-forge/modin-feedstock#75

Closed

5 tasks

anmyachev added a commit to anmyachev/modin that referenced this issue May 31, 2023

FIX-modin-project#6072: unpin pyarrow and xfail test

04610db

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev mentioned this issue May 31, 2023

FIX-#6072: unpin pyarrow and xfail test_read_parquet_pandas_index test #6223

Merged

7 tasks

YarShev closed this as completed in #6223 May 31, 2023

YarShev pushed a commit that referenced this issue May 31, 2023

FIX-#6072: unpin pyarrow and xfail test_read_parquet_pandas_index t…

f0f9ffe

…est (#6223) Signed-off-by: Anatoly Myachev <[email protected]>

zmbc mentioned this issue Aug 7, 2023

FEAT-#6417: Add support for filters to read_parquet #6442

Merged

7 tasks

vnlitvinov mentioned this issue Aug 7, 2023

Incorrect condition in test_io masks Modin bug #6467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: test_read_parquet_pandas_index[pyarrow] is broken at main due to pyarrow 12.0 #6072

BUG: test_read_parquet_pandas_index[pyarrow] is broken at main due to pyarrow 12.0 #6072

mvashishtha commented May 3, 2023 •

edited

Loading

noloerino commented May 3, 2023

h-vetinari commented May 25, 2023

BUG: test_read_parquet_pandas_index[pyarrow] is broken at main due to pyarrow 12.0 #6072

BUG: test_read_parquet_pandas_index[pyarrow] is broken at main due to pyarrow 12.0 #6072

Comments

mvashishtha commented May 3, 2023 • edited Loading

Modin version checks

Reproducible Example

Issue Description

Expected Behavior

Error Logs

Installed Versions

noloerino commented May 3, 2023

h-vetinari commented May 25, 2023

mvashishtha commented May 3, 2023 •

edited

Loading