Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

Closed
eordentlich opened this issue Apr 9, 2023 · 2 comments · Fixed by #9088
Closed

Regression in pyspark fit time for columnar input + spark-rapids + gpu #9016

eordentlich opened this issue Apr 9, 2023 · 2 comments · Fixed by #9088

Comments

@eordentlich
Copy link

In some recent benchmarking with spark-rapids I noticed slower performance than past recorded runs with early pre-release pyspark api versions for a particular parquet format data set. In looking back through past results, the slow down seemed to actually have happened a while back. After building and profiling a couple of these past versions, I think I narrowed it down to this change: https://github.com/dmlc/xgboost/pull/8284/files#diff-e02cb86420040cfdaa950d0f5b5d1d50149a7df9c03d13e23b2c7c4f9426d762R249
It seems converting other non-feature columns (like labels) to Numpy here (vs simply leaving them as pd.Series) can lead to significant slow downs (nearly 50% longer time for subset of mortgage data set) when feature_cols is not None, when running with spark-rapids gpu accelerated data loading + gpu_hist, reading from columnar parquet format.

@wbo4958 Can you take a look? (cc. @trivialfis )

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 10, 2023

Hi @eordentlich, Thx for your detailed finding. I will take a look at it today. Thx

@wbo4958
Copy link
Contributor

wbo4958 commented Apr 25, 2023

Sorry for the late response.

I just reproduced this issue. I tested the mortgage dataset (28 features and 1 label) with a total of 118, 781, 111 rows on the latest xgboost and rapids 23.02 release.

xgboost JVM package takes 88.256s to finish running, while xgboost pyspark takes 222.49s. After commenting out the stack_series for the label column, xgboost pyspark just takes 117s. And accuracies are the same. And confirmed with @trivialfis, the stack_series is not needed for the single numeric column. I will put up a PR to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants