TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

sagarlakshmipathy · 2024-03-12T21:13:48Z

Backend

VL (Velox)

Bug description

[Expected behavior] Faster query runs compared to OSS Spark
[actual behavior] OSS Spark runs in half the time taken by Gluten+Velox Spark.

Spark version

None

Spark configurations

Gluten+Velox+Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cores 5 --num-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=30g --conf spark.shuffler=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

OSS Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cornum-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

System information

Environment: Amazon EMR - 10 workers, 1 driver all m5.4xlarge
OS: Amazon Linux 2

Relevant logs

Wondering what you need me to capture that'll help you

The text was updated successfully, but these errors were encountered:

zhouyuan · 2024-03-13T09:17:03Z

Hi @sagarlakshmipathy
Can you please also share the performance number per query? on TPCDS the Q72 is still a trouble for gluten and needs some special config. Here's some discussions:
#1775

Are you testing with HUDI tables by any chance?
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
For now the HUDI support is not ready in Gluten. It will actually run with vanilla Spark code, and with a RowtoColumn(memcpy) connect to Gluten native operators. So this will actually bring lots of overhead.

thanks,
-yuan

sagarlakshmipathy · 2024-03-14T07:57:52Z

Query ID	Gluten Velox Spark Hudi (ms)	OSS Spark Hudi
1	22040	16699
2	60531	33095
3	61031	25965
4	360561	172286
5	140865	72149
6	48038	22890
7	106637	44359
8	45072	19636

I didn't bother running the rest of them. I am testing Hudi tables with Gluten. Is there a gh issue/discussion I can +1 to?

zhouyuan · 2024-03-18T00:47:06Z

It is quite likely due to the fallback of scanning HUDI tables. Here's the issue tracker for unified data lake design, ICEBERG and DELTA LAKE are now both supported(not 100%) now.
#3378

Thanks,
-yuan

my7ym · 2024-07-13T20:20:58Z

@sagarlakshmipathy Hey, may I know your setups & configurations for running Gluten on EMR? Thanks!

sagarlakshmipathy added bug Something isn't working triage labels Mar 12, 2024

zhouyuan added the performance label Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

sagarlakshmipathy commented Mar 12, 2024

zhouyuan commented Mar 13, 2024

sagarlakshmipathy commented Mar 14, 2024

zhouyuan commented Mar 18, 2024

my7ym commented Jul 13, 2024

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

Comments

sagarlakshmipathy commented Mar 12, 2024

Backend

Bug description

Spark version

Spark configurations

System information

Relevant logs

zhouyuan commented Mar 13, 2024

sagarlakshmipathy commented Mar 14, 2024

zhouyuan commented Mar 18, 2024

my7ym commented Jul 13, 2024