Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

eordentlich · 2023-04-09T23:36:07Z

In some recent benchmarking with spark-rapids I noticed slower performance than past recorded runs with early pre-release pyspark api versions for a particular parquet format data set. In looking back through past results, the slow down seemed to actually have happened a while back. After building and profiling a couple of these past versions, I think I narrowed it down to this change: https://github.com/dmlc/xgboost/pull/8284/files#diff-e02cb86420040cfdaa950d0f5b5d1d50149a7df9c03d13e23b2c7c4f9426d762R249
It seems converting other non-feature columns (like labels) to Numpy here (vs simply leaving them as pd.Series) can lead to significant slow downs (nearly 50% longer time for subset of mortgage data set) when feature_cols is not None, when running with spark-rapids gpu accelerated data loading + gpu_hist, reading from columnar parquet format.

@wbo4958 Can you take a look? (cc. @trivialfis )

The text was updated successfully, but these errors were encountered:

wbo4958 · 2023-04-10T01:10:04Z

Hi @eordentlich, Thx for your detailed finding. I will take a look at it today. Thx

wbo4958 · 2023-04-25T09:15:39Z

Sorry for the late response.

I just reproduced this issue. I tested the mortgage dataset (28 features and 1 label) with a total of 118, 781, 111 rows on the latest xgboost and rapids 23.02 release.

xgboost JVM package takes 88.256s to finish running, while xgboost pyspark takes 222.49s. After commenting out the stack_series for the label column, xgboost pyspark just takes 117s. And accuracies are the same. And confirmed with @trivialfis, the stack_series is not needed for the single numeric column. I will put up a PR to fix it.

wbo4958 mentioned this issue Apr 25, 2023

[pyspark] Don't stack for non feature columns #9088

Merged

trivialfis closed this as completed in #9088 Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

eordentlich commented Apr 9, 2023

wbo4958 commented Apr 10, 2023

wbo4958 commented Apr 25, 2023 •

edited

Loading

Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

Comments

eordentlich commented Apr 9, 2023

wbo4958 commented Apr 10, 2023

wbo4958 commented Apr 25, 2023 • edited Loading

wbo4958 commented Apr 25, 2023 •

edited

Loading