Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPCDS queries on Gluten+Velox in EMR is considerably slower than OSS Spark #4940

Open
sagarlakshmipathy opened this issue Mar 12, 2024 · 4 comments
Labels
bug Something isn't working performance triage

Comments

@sagarlakshmipathy
Copy link

Backend

VL (Velox)

Bug description

[Expected behavior] Faster query runs compared to OSS Spark
[actual behavior] OSS Spark runs in half the time taken by Gluten+Velox Spark.

Spark version

None

Spark configurations

Gluten+Velox+Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cores 5 --num-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.plugins=io.glutenproject.GlutenPlugin --conf spark.memory.offHeap.enabled=true --conf spark.memory.offHeap.size=30g --conf spark.shuffler=org.apache.spark.shuffle.sort.ColumnarShuffleManager --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

OSS Spark

./spark-3.4.1-bin-hadoop3/bin/spark-shell --master yarn --deploy-mode client --driver-memory 19g --executor-memory 19g --executor-cornum-executors 32 --jars /home/hadoop/hudi-spark3.4-bundle_2.12-0.14.1.jar,/home/hadoop/hudi-benchmarks-0.1-SNAPSHOT.jar --packages org.apache.hadoop:hadoop-aws:3.2.4 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain --conf spark.sql.catalogImplementation=in-memory --conf spark.ui.proxyBase="" --conf 'spark.eventLog.enabled=true' --conf 'spark.eventLog.dir=hdfs:///var/log/spark/apps'

System information

Environment: Amazon EMR - 10 workers, 1 driver all m5.4xlarge
OS: Amazon Linux 2

Relevant logs

Wondering what you need me to capture that'll help you
@sagarlakshmipathy sagarlakshmipathy added bug Something isn't working triage labels Mar 12, 2024
@zhouyuan
Copy link
Contributor

Hi @sagarlakshmipathy
Can you please also share the performance number per query? on TPCDS the Q72 is still a trouble for gluten and needs some special config. Here's some discussions:
#1775

Are you testing with HUDI tables by any chance?
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
For now the HUDI support is not ready in Gluten. It will actually run with vanilla Spark code, and with a RowtoColumn(memcpy) connect to Gluten native operators. So this will actually bring lots of overhead.

thanks,
-yuan

@sagarlakshmipathy
Copy link
Author

Query ID Gluten Velox Spark Hudi (ms) OSS Spark Hudi
1 22040 16699
2 60531 33095
3 61031 25965
4 360561 172286
5 140865 72149
6 48038 22890
7 106637 44359
8 45072 19636

I didn't bother running the rest of them. I am testing Hudi tables with Gluten. Is there a gh issue/discussion I can +1 to?

@zhouyuan
Copy link
Contributor

It is quite likely due to the fallback of scanning HUDI tables. Here's the issue tracker for unified data lake design, ICEBERG and DELTA LAKE are now both supported(not 100%) now.
#3378

Thanks,
-yuan

@my7ym
Copy link

my7ym commented Jul 13, 2024

@sagarlakshmipathy Hey, may I know your setups & configurations for running Gluten on EMR? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance triage
Projects
None yet
Development

No branches or pull requests

3 participants