You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some recent benchmarking with spark-rapids I noticed slower performance than past recorded runs with early pre-release pyspark api versions for a particular parquet format data set. In looking back through past results, the slow down seemed to actually have happened a while back. After building and profiling a couple of these past versions, I think I narrowed it down to this change: https://github.com/dmlc/xgboost/pull/8284/files#diff-e02cb86420040cfdaa950d0f5b5d1d50149a7df9c03d13e23b2c7c4f9426d762R249
It seems converting other non-feature columns (like labels) to Numpy here (vs simply leaving them as pd.Series) can lead to significant slow downs (nearly 50% longer time for subset of mortgage data set) when feature_cols is not None, when running with spark-rapids gpu accelerated data loading + gpu_hist, reading from columnar parquet format.
I just reproduced this issue. I tested the mortgage dataset (28 features and 1 label) with a total of 118, 781, 111 rows on the latest xgboost and rapids 23.02 release.
xgboost JVM package takes 88.256s to finish running, while xgboost pyspark takes 222.49s. After commenting out the stack_series for the label column, xgboost pyspark just takes 117s. And accuracies are the same. And confirmed with @trivialfis, the stack_series is not needed for the single numeric column. I will put up a PR to fix it.
In some recent benchmarking with spark-rapids I noticed slower performance than past recorded runs with early pre-release pyspark api versions for a particular parquet format data set. In looking back through past results, the slow down seemed to actually have happened a while back. After building and profiling a couple of these past versions, I think I narrowed it down to this change: https://github.com/dmlc/xgboost/pull/8284/files#diff-e02cb86420040cfdaa950d0f5b5d1d50149a7df9c03d13e23b2c7c4f9426d762R249
It seems converting other non-feature columns (like labels) to Numpy here (vs simply leaving them as pd.Series) can lead to significant slow downs (nearly 50% longer time for subset of mortgage data set) when
feature_cols is not None
, when running with spark-rapids gpu accelerated data loading + gpu_hist, reading from columnar parquet format.@wbo4958 Can you take a look? (cc. @trivialfis )
The text was updated successfully, but these errors were encountered: