[QST] - Spark3 question #5335

eyalhir74 · 2022-02-19T18:21:36Z

eyalhir74
Feb 19, 2022

I'm trying to run some queries on big data. I've taken a portion of our data (only 43GB) and test some query with 15 fields in two scenarios:

24 CPU cores with 200 files, up to 400MB per file
X CPU cores with one V100 GPU with 10 files, each about 4+GB as per the tuning guide suggestions.
The GPU is mostly idle and runs much slower than the CPU. Running the Spark on the GPU with the 400MBs files, runs slow as well.

I'm using the following command to run the GPU code:
$SPARK_HOME/bin/spark-shell --master "local[10]" --driver-memory 50g --conf spark.locality.wait=0s --conf spark.rapids.memory.pinnedPool.size=30G --conf spark.sql.files.maxPartitionBytes=256m --conf spark.rapids.sql.concurrentGpuTasks=2 --conf spark.plugins=com.nvidia.spark.SQLPlugin --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}

Changing maxPartitionBytes or concurrentGpuTasks or any other parameter, doesn't seem to have any effect.
As far as I can see most of the time the network I/O is not working nor does the GPU.

Any idea would be highly appereciated.

Answered by jlowe

Feb 22, 2022

With the GPU being mostly idle, I'm wondering about two possibilities:

is the entire query eligible to run on the GPU? There are costs to transitioning between CPU and GPU, and this could potentially cause some of the slowdown
is the query mostly bound by the filesystem read?

To answer the first question, you could run with the config spark.rapids.sql.explain set to true, and then you should see log messages for any portions of queries that are not on the GPU (and why they're not on the GPU). Depending on how many rows are being processed by nodes not on the GPU, it could contribute substantially to the slowdown you're seeing. Also if there are portions of the query not running on the G…

View full answer

jlowe · 2022-02-22T14:53:17Z

jlowe
Feb 22, 2022

With the GPU being mostly idle, I'm wondering about two possibilities:

is the entire query eligible to run on the GPU? There are costs to transitioning between CPU and GPU, and this could potentially cause some of the slowdown
is the query mostly bound by the filesystem read?

To answer the first question, you could run with the config spark.rapids.sql.explain set to true, and then you should see log messages for any portions of queries that are not on the GPU (and why they're not on the GPU). Depending on how many rows are being processed by nodes not on the GPU, it could contribute substantially to the slowdown you're seeing. Also if there are portions of the query not running on the GPU then the reduced parallelism of the GPU cluster (10 cores vs. 24) will impact the query performance.

If the query is dominated by filesystem access, then running the query with less than half of the CPU cores (10 vs. 24) could significantly slowdown the GPU run. Fetching the raw data (as opposed to decoding the data) is still processed by the CPU, so this could be a significant contributor of the slowdown in comparison. To help answer this question, you could try running with more CPU cores for your GPU-configured setup and see how it impacts the query. Separately, you could use the Spark SQL web UI to examine the graphical query plan and see if the bufferTime metric for the GpuFileSourceScanExec or BatchScanExec is significantly higher than the gpuDecodeTime. The former metric is how log the tasks spent reading the raw data from the filesystem, while the second reflects how much time the task spent waiting for the GPU to decode the raw data after it was fetched from the filesystem.

0 replies

viadea · 2022-02-22T17:24:55Z

viadea
Feb 22, 2022
Collaborator

To help on 1st possibility mentioned by @jlowe , we have a workload qualification doc here:
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-workload-qualification.html
Since you already have GPU spark env, so you can refer to option #3 in above doc.

After setting spark.rapids.sql.explain=all and then check spark driver log to see if you find any CPU fallback related messages.

0 replies

eyalhir74 · 2022-02-23T07:20:55Z

eyalhir74
Feb 23, 2022
Author

Wow, that a ton of information. Thank you both!
I am first trying to create a bigger input file (as suggested in the Tunning guide). Currently I have files of 300-500MBs, trying to merge them to bigger file.
Once this is done, I'll explore all the tips you've mentioned and report back.

Thanks!

0 replies

eyalhir74 · 2022-03-01T06:49:14Z

eyalhir74
Mar 1, 2022
Author

@viadea Thanks for the input, very helpful :)
I have a huge dataset and huge amounts of data to be processed, seems its still a bit challenging with RAPIDS.

I've added the following as per the comments in the explain output
--conf spark.rapids.sql.explain=all --conf spark.rapids.sql.variableFloatAgg.enabled=true --conf spark.rapids.sql.castDecimalToFloat.enabled=true --conf spark.rapids.sql.incompatibleOps.enabled=true

As far as I can say, these are the remaning issues preventing the query to run entirely on the GPU:
`!Exec cannot run on GPU because ArrayTypes or MapTypes in grouping expressions are not supported

        !Exec <ShuffleExchangeExec> cannot run on GPU because not all partitioning can be replaced; Columnar exchange without columnar children is inefficient

          !Partitioning <HashPartitioning> cannot run on GPU because hash_key expression AttributeReference sort_array(InfoList#1123, true)#2078 (ArrayType(StructType(StructField(experimentId,LongType,true), StructField(experimentLayerTemplateId,LongType,true), StructField(experimentVariantId,LongType,true)),true) is not supported); hash_key expression AttributeReference CASE WHEN isnull(map_keys(pv_supplyFeaturesStat#1134)) THEN null ELSE array_intersect(sort_array(map_keys(pv_supplyFeaturesStat#1134), true), [READ_MORE,EXPLORE_MORE,TABOOLA_REMINDER,NEXT_UP]) END#2080 (ArrayType(StringType,false) is not supported)

          !Exec <HashAggregateExec> cannot run on GPU because not all expressions can be replaced; ArrayTypes or MapTypes in grouping expressions are not supported

              !Expression <SortArray> sort_array(InfoList#1123, true) cannot run on GPU because expression SortArray sort_array(InfoList#1123, true) produces an unsupported type ArrayType(StructType(StructField(experimentId,LongType,true), StructField(experimentLayerTemplateId,LongType,true), StructField(experimentVariantId,LongType,true)),true); array expression AttributeReference InfoList#1123 (child StructType(StructField(experimentId,LongType,true), StructField(experimentLayerTemplateId,LongType,true), StructField(experimentVariantId,LongType,true)) is not supported)
                @Expression <AttributeReference> InfoList#1123 could run on GPU
                @Expression <Literal> true could run on GPU

                !NOT_FOUND <ArrayIntersect> array_intersect(sort_array(map_keys(pv_supplyFeaturesStat#1134), true), [READ_MORE,EXPLORE_MORE,TABOOLA_REMINDER,NEXT_UP]) cannot run on GPU because no GPU enabled version of expression class org.apache.spark.sql.catalyst.expressions.ArrayIntersect could be found

`

Is there anything further I can try to make it run on the GPU?
The query also gets spark killed, I'll have a look at this as well.

thanks
Eyal

0 replies

jlowe · 2022-03-01T22:21:23Z

jlowe
Mar 1, 2022

The RAPIDS Accelerator does not currently support hash partitioning on ArrayType, nor does it support sort_array. on ArrayType. #3715 tracks sort_array and I've filed #4887 to track adding support for GPU hashing of ArrayType.

0 replies

eyalhir74 · 2022-03-07T05:04:15Z

eyalhir74
Mar 7, 2022
Author

@jlowe I've updated #4900 with all the missing ops I've encountered so far.

0 replies

jlowe · 2022-03-08T14:49:02Z

jlowe
Mar 8, 2022

Are there further questions for this issue, or is it covered by the other issues?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] - Spark3 question #5335

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[QST] - Spark3 question #5335

eyalhir74 Feb 19, 2022

Replies: 7 comments

jlowe Feb 22, 2022

viadea Feb 22, 2022 Collaborator

eyalhir74 Feb 23, 2022 Author

eyalhir74 Mar 1, 2022 Author

jlowe Mar 1, 2022

eyalhir74 Mar 7, 2022 Author

jlowe Mar 8, 2022

eyalhir74
Feb 19, 2022

jlowe
Feb 22, 2022

viadea
Feb 22, 2022
Collaborator

eyalhir74
Feb 23, 2022
Author

eyalhir74
Mar 1, 2022
Author

jlowe
Mar 1, 2022

eyalhir74
Mar 7, 2022
Author

jlowe
Mar 8, 2022