You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug: [2.1.0] DMatrix creation from Arrow-backed pandas Dataframes can trigger (ArrowInvalid: Zero copy conversions not possible with boolean types)
#10504
Closed
cvm-a opened this issue
Jun 29, 2024
· 2 comments
· Fixed by #10527
We have a a issue with upgrading XGBoost from 2.0.3 to 2.1.0. We use arrow backed types for our pandas dataframes, and if there are boolean columns (1 bit per element, saves memory for our pandas manipulations), we can't create DMatrix form the dataframe.
In 2.0.3, this constructs a DMatrix, but in 2.1.0, this raises an error ( last few frames of the stack trace):
605 raise ValueError(f"DataFrame for {meta} cannot have multiple columns")
607 feature_names, feature_types = pandas_feature_info(
608 data, meta, feature_names, feature_types, enable_categorical
609 )
--> 611 arrays = pandas_transform_data(data)
612 return PandasTransformed(arrays), feature_names, feature_types
File <install_path>/python3.11/site-packages/xgboost/data.py:540, in pandas_transform_data(data)
538 result.append(cat_codes(data[col]))
539 elif is_pa_ext_dtype(dtype):
--> 540 result.append(pandas_pa_type(data[col]))
541 elif is_nullable_dtype(dtype):
542 result.append(nu_type(data[col]))
File <install_path>/python3.11/site-packages/xgboost/data.py:468, in pandas_pa_type(ser)
461 zero_copy = chunk.null_count == 0
462 # Alternately, we can use chunk.buffers(), which returns a list of buffers and
463 # we need to concatenate them ourselves.
464 # FIXME(jiamingy): Is there a better way to access the arrow buffer along with
465 # its mask?
466 # Buffers from chunk.buffers() have the address attribute, but don't expose the
467 # mask.
--> 468 arr: np.ndarray = chunk.to_numpy(zero_copy_only=zero_copy, writable=False)
469 arr, _ = _ensure_np_dtype(arr, arr.dtype)
470 return arr
File <install_path>/python3.11/site-packages/pyarrow/array.pxi:1587, in pyarrow.lib.Array.to_numpy()
File <install_path>/python3.11/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()
ArrowInvalid: Zero copy conversions not possible with boolean types
Environment:
Python 3.11.6
OS : Darwin
machine : arm64
pandas : 2.2.2
numpy : 1.24.4
pyarrow : 16.1.0
The text was updated successfully, but these errors were encountered:
If I try to create a DMatrix directly from the pyarrow table, we get the same "ArrowInvalid: Zero copy conversions not possible with boolean types" in 2.1.0, but in 2.0.3 we get an error
File <installpath>/python3.11/site-packages/xgboost/data.py:1118, in dispatch_data_backend(data, missing, threads, feature_names, feature_types, enable_categorical, data_split_mode)
1114 return _from_pandas_series(
1115 data, missing, threads, enable_categorical, feature_names, feature_types
1116 )
1117 if _is_arrow(data):
-> 1118 return _from_arrow(
1119 data, missing, threads, feature_names, feature_types, enable_categorical
1120 )
1121 if _has_array_protocol(data):
1122 array = np.asarray(data)
File <installpath>/python3.11/site-packages/xgboost/data.py:737, in _from_arrow(data, missing, nthread, feature_names, feature_types, enable_categorical)
732 import pyarrow as pa
734 if not all(
735 pa.types.is_integer(t) or pa.types.is_floating(t) for t in data.schema.types
736 ):
--> 737 raise ValueError(
738 "Features in dataset can only be integers or floating point number"
739 )
740 if enable_categorical:
741 raise ValueError("categorical data in arrow is not supported yet.")
ValueError: Features in dataset can only be integers or floating point number
Pyarrow bools are packed, and they need to be unpacked from 1-bit bools to 1 byte bools for numpy
We have a a issue with upgrading XGBoost from 2.0.3 to 2.1.0. We use arrow backed types for our pandas dataframes, and if there are boolean columns (1 bit per element, saves memory for our pandas manipulations), we can't create DMatrix form the dataframe.
Repro code:
In 2.0.3, this constructs a DMatrix, but in 2.1.0, this raises an error ( last few frames of the stack trace):
Environment:
Python 3.11.6
OS : Darwin
machine : arm64
pandas : 2.2.2
numpy : 1.24.4
pyarrow : 16.1.0
The text was updated successfully, but these errors were encountered: