Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types) #10504

cvm-a · 2024-06-29T17:36:16Z

We have a a issue with upgrading XGBoost from 2.0.3 to 2.1.0. We use arrow backed types for our pandas dataframes, and if there are boolean columns (1 bit per element, saves memory for our pandas manipulations), we can't create DMatrix form the dataframe.

Repro code:

import pandas as pd
import pyarrow as pa
import xgboost

tab1 = pa.table({"a": pa.array([1,2,3,4]), "b":([True, False, False, True])})
df1 = tab1.to_pandas(types_mapper=pd.ArrowDtype)

xgboost.DMatrix(df1)

In 2.0.3, this constructs a DMatrix, but in 2.1.0, this raises an error ( last few frames of the stack trace):

    605     raise ValueError(f"DataFrame for {meta} cannot have multiple columns")
    607 feature_names, feature_types = pandas_feature_info(
    608     data, meta, feature_names, feature_types, enable_categorical
    609 )
--> 611 arrays = pandas_transform_data(data)
    612 return PandasTransformed(arrays), feature_names, feature_types

File <install_path>/python3.11/site-packages/xgboost/data.py:540, in pandas_transform_data(data)
    538     result.append(cat_codes(data[col]))
    539 elif is_pa_ext_dtype(dtype):
--> 540     result.append(pandas_pa_type(data[col]))
    541 elif is_nullable_dtype(dtype):
    542     result.append(nu_type(data[col]))

File <install_path>/python3.11/site-packages/xgboost/data.py:468, in pandas_pa_type(ser)
    461 zero_copy = chunk.null_count == 0
    462 # Alternately, we can use chunk.buffers(), which returns a list of buffers and
    463 # we need to concatenate them ourselves.
    464 # FIXME(jiamingy): Is there a better way to access the arrow buffer along with
    465 # its mask?
    466 # Buffers from chunk.buffers() have the address attribute, but don't expose the
    467 # mask.
--> 468 arr: np.ndarray = chunk.to_numpy(zero_copy_only=zero_copy, writable=False)
    469 arr, _ = _ensure_np_dtype(arr, arr.dtype)
    470 return arr

File <install_path>/python3.11/site-packages/pyarrow/array.pxi:1587, in pyarrow.lib.Array.to_numpy()

File <install_path>/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Zero copy conversions not possible with boolean types

Environment:
Python 3.11.6
OS : Darwin
machine : arm64
pandas : 2.2.2
numpy : 1.24.4
pyarrow : 16.1.0

The text was updated successfully, but these errors were encountered:

cvm-a · 2024-06-29T17:47:28Z

If I try to create a DMatrix directly from the pyarrow table, we get the same "ArrowInvalid: Zero copy conversions not possible with boolean types" in 2.1.0, but in 2.0.3 we get an error

File <installpath>/python3.11/site-packages/xgboost/data.py:1118, in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical, data_split_mode)
   1114     return _from_pandas_series(
   1115         data, missing, threads, enable_categorical, feature_names, feature_types
   1116     )
   1117 if _is_arrow(data):
-> 1118     return _from_arrow(
   1119         data, missing, threads, feature_names, feature_types, enable_categorical
   1120     )
   1121 if _has_array_protocol(data):
   1122     array = np.asarray(data)

File <installpath>/python3.11/site-packages/xgboost/data.py:737, in _from_arrow(data, missing, nthread, feature_names, feature_types, enable_categorical)
    732 import pyarrow as pa
    734 if not all(
    735     pa.types.is_integer(t) or pa.types.is_floating(t) for t in data.schema.types
    736 ):
--> 737     raise ValueError(
    738         "Features in dataset can only be integers or floating point number"
    739     )
    740 if enable_categorical:
    741     raise ValueError("categorical data in arrow is not supported yet.")

ValueError: Features in dataset can only be integers or floating point number

Pyarrow bools are packed, and they need to be unpacked from 1-bit bools to 1 byte bools for numpy

trivialfis · 2024-06-30T07:04:47Z

Thank you for sharing.

This test

xgboost/python-package/xgboost/testing/data.py

Line 206 in 09d32f1

df = pd.DataFrame(

xgboost/tests/python/test_with_pandas.py

Line 512 in 09d32f1

m_etype = DMatrixT(df, enable_categorical=True, label=y)

is not doing what it's supposed to be doing.

trivialfis added the type: bug label Jun 30, 2024

trivialfis mentioned this issue Jul 1, 2024

Fix boolean array for arrow-backed DF. #10527

Merged

trivialfis closed this as completed in #10527 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types) #10504

Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types) #10504

cvm-a commented Jun 29, 2024

cvm-a commented Jun 29, 2024

trivialfis commented Jun 30, 2024 •

edited

Loading

Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types) #10504

Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types) #10504

Comments

cvm-a commented Jun 29, 2024

cvm-a commented Jun 29, 2024

trivialfis commented Jun 30, 2024 • edited Loading

trivialfis commented Jun 30, 2024 •

edited

Loading