Replies: 3 comments 4 replies
-
A few things: so it sounds like you aren't using the Spark readers directly, like parquet or orc, but using the hudi reader. I'm assuming that is reading on cpu and then has to do columnar to row, which might be a lot of overhead. You can check the metrics in the spark sql table for that to see how much time is spent there vs the rest of the query. The readers Re one place we accelerate very well and it goes directly to GPU.
Unless you really need another timezone to be Asia/Shanghai suggest you run your spark cluster with timezone in UTC.
What is the operation that this isn't supported with?
So this is weird. What is the rest of this error? I assume it failed to load.. Look in the executor logs for that fails executor to get more information. The other thing that would be useful is if we can get screen shot of hte sql query (spark ui SQL tab) if that is something you can share. |
Beta Was this translation helpful? Give feedback.
-
@abellina I fix it, I got the error because I put the jar on the wrong folder on some work nodes |
Beta Was this translation helpful? Give feedback.
-
@tgravescs Thanks for your reply. I have tried oncurrent tasks as 2. Overall the job is running 20 minutes the same as cpu but use much less resource. |
Beta Was this translation helpful? Give feedback.
-
We have setup one standalone spark cluster with four workers (one worker node with one GPU installed).
Spark version:3.1.3
hudi version:0.9
rapid jar:rapids-4-spark_2.12-22.10.0.jar
Below is spark-env.sh added:
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/xxx/getGpusResources.sh"
Below is my spark submit parameter (I have tried spark.sql.shuffle.partitions as 4/16/64/220/400 but no big difference):
$SPARK_HOME/bin/spark-submit
--class DataWorkflowMain
--master spark://${MASTER_HOST}:7077
--deploy-mode client
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.rapids.sql.concurrentGpuTasks=4
--conf spark.kryo.registrator=com.nvidia.spark.rapids.GpuKryoRegistrator
--driver-memory 10G
--conf spark.executor.memory=40G
--conf spark.executor.cores=4
--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=0.25
--conf spark.rapids.sql.explain=ALL
--conf spark.rapids.memory.pinnedPool.size=4G
--conf spark.locality.wait=0s
--conf spark.sql.shuffle.partitions=64 \
--conf spark.sql.adaptive=true
--conf spark.rapids.sql.enabled=true
--conf spark.sql.adaptive.coalescePartitions.enabled=true
--conf spark.sql.files.maxPartitionBytes=512m
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf 'spark.sql.legacy.parquet.datetimeRebaseModeInRead'=CORRECTED
--conf 'spark.sql.legacy.parquet.datetimeRebaseModeInRead'=LEGACY
--conf spark.sql.session.timeZone=UTC
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
--conf 'spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter'
--conf spark.worker.resourcesFile=${SPARK_RAPIDS_DIR}/bin/getGpusResources.sh
--conf spark.executor.resourcesFile=${SPARK_RAPIDS_DIR}/bin/getGpusResources.sh
--name $job_name
/home/hadoop/xxx-jar-with-dependencies.jar \
Below is Job info:
ETL job is to call hudi api to load the data and register it as temp table in spark sql and rest are pure table joins in spark sql.
Below are the cannot run on GPU messages I extract from log:
! cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.LocalTableScanExec
!Exec cannot run on GPU because unsupported data types in output: TimestampType
!Expression AppointmentDate#755 cannot run on GPU because expression AttributeReference AppointmentDate#755 produces an unsupported type TimestampType
!Expression cast(TradeInValueUpliftAmount#2054 as decimal(20,4)) cannot run on GPU because Only UTC zone id is supported. Actual default zone id: Asia/Shanghai
!Exec cannot run on GPU because the BroadcastHashJoin this feeds is not on the GPU
Is there any tuning parameter to enable GPU support above operations?
I have GPU installed and I can see each worker is using one GPU Why I still faced below error messages(many error messages):
2022-10-28 12:46:46,467 ERROR scheduler.TaskSchedulerImpl: Lost executor 999 on 10.120.39.19: Unable to create executor due to com.nvidia.spark.SQLPlugin
Beta Was this translation helpful? Give feedback.
All reactions