We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dropDuplicates
Comet is 4x slower than Spark for the following query:
spark.read.parquet("/mnt/bigdata/tpcds/sf100/store_sales.parquet") \ .repartition("ss_item_sk") \ .dropDuplicates(["ss_item_sk", "ss_quantity"]) \ .write.parquet("output.parquet")
If I remove the dropDuplicates call, I see similar performance between Comet and Spark.
No response
The text was updated successfully, but these errors were encountered:
Disabling Comet aggregation improves performance from 230s to 64s and is slightly faster than Spark at 78s.
Sorry, something went wrong.
No branches or pull requests
What is the problem the feature request solves?
Comet is 4x slower than Spark for the following query:
If I remove the
dropDuplicates
call, I see similar performance between Comet and Spark.Describe the potential solution
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: