Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types) #10504

Closed
cvm-a opened this issue Jun 29, 2024 · 2 comments · Fixed by #10527

Comments

@cvm-a
Copy link

cvm-a commented Jun 29, 2024

We have a a issue with upgrading XGBoost from 2.0.3 to 2.1.0. We use arrow backed types for our pandas dataframes, and if there are boolean columns (1 bit per element, saves memory for our pandas manipulations), we can't create DMatrix form the dataframe.

Repro code:

import pandas as pd
import pyarrow as pa
import xgboost

tab1 = pa.table({"a": pa.array([1,2,3,4]), "b":([True, False, False, True])})
df1 = tab1.to_pandas(types_mapper=pd.ArrowDtype)

xgboost.DMatrix(df1)

In 2.0.3, this constructs a DMatrix, but in 2.1.0, this raises an error ( last few frames of the stack trace):

    605     raise ValueError(f"DataFrame for {meta} cannot have multiple columns")
    607 feature_names, feature_types = pandas_feature_info(
    608     data, meta, feature_names, feature_types, enable_categorical
    609 )
--> 611 arrays = pandas_transform_data(data)
    612 return PandasTransformed(arrays), feature_names, feature_types

File <install_path>/python3.11/site-packages/xgboost/data.py:540, in pandas_transform_data(data)
    538     result.append(cat_codes(data[col]))
    539 elif is_pa_ext_dtype(dtype):
--> 540     result.append(pandas_pa_type(data[col]))
    541 elif is_nullable_dtype(dtype):
    542     result.append(nu_type(data[col]))

File <install_path>/python3.11/site-packages/xgboost/data.py:468, in pandas_pa_type(ser)
    461 zero_copy = chunk.null_count == 0
    462 # Alternately, we can use chunk.buffers(), which returns a list of buffers and
    463 # we need to concatenate them ourselves.
    464 # FIXME(jiamingy): Is there a better way to access the arrow buffer along with
    465 # its mask?
    466 # Buffers from chunk.buffers() have the address attribute, but don't expose the
    467 # mask.
--> 468 arr: np.ndarray = chunk.to_numpy(zero_copy_only=zero_copy, writable=False)
    469 arr, _ = _ensure_np_dtype(arr, arr.dtype)
    470 return arr

File <install_path>/python3.11/site-packages/pyarrow/array.pxi:1587, in pyarrow.lib.Array.to_numpy()

File <install_path>/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Zero copy conversions not possible with boolean types

Environment:
Python 3.11.6
OS : Darwin
machine : arm64
pandas : 2.2.2
numpy : 1.24.4
pyarrow : 16.1.0

@cvm-a
Copy link
Author

cvm-a commented Jun 29, 2024

If I try to create a DMatrix directly from the pyarrow table, we get the same "ArrowInvalid: Zero copy conversions not possible with boolean types" in 2.1.0, but in 2.0.3 we get an error

File <installpath>/python3.11/site-packages/xgboost/data.py:1118, in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical, data_split_mode)
   1114     return _from_pandas_series(
   1115         data, missing, threads, enable_categorical, feature_names, feature_types
   1116     )
   1117 if _is_arrow(data):
-> 1118     return _from_arrow(
   1119         data, missing, threads, feature_names, feature_types, enable_categorical
   1120     )
   1121 if _has_array_protocol(data):
   1122     array = np.asarray(data)

File <installpath>/python3.11/site-packages/xgboost/data.py:737, in _from_arrow(data, missing, nthread, feature_names, feature_types, enable_categorical)
    732 import pyarrow as pa
    734 if not all(
    735     pa.types.is_integer(t) or pa.types.is_floating(t) for t in data.schema.types
    736 ):
--> 737     raise ValueError(
    738         "Features in dataset can only be integers or floating point number"
    739     )
    740 if enable_categorical:
    741     raise ValueError("categorical data in arrow is not supported yet.")

ValueError: Features in dataset can only be integers or floating point number

Pyarrow bools are packed, and they need to be unpacked from 1-bit bools to 1 byte bools for numpy

@trivialfis
Copy link
Member

trivialfis commented Jun 30, 2024

Thank you for sharing.

This test


m_etype = DMatrixT(df, enable_categorical=True, label=y)

is not doing what it's supposed to be doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants