Improve performance of `dropDuplicates` #1275

andygrove · 2025-01-13T19:44:38Z

What is the problem the feature request solves?

Comet is 4x slower than Spark for the following query:

spark.read.parquet("/mnt/bigdata/tpcds/sf100/store_sales.parquet") \
    .repartition("ss_item_sk") \
    .dropDuplicates(["ss_item_sk", "ss_quantity"]) \
    .write.parquet("output.parquet")

If I remove the dropDuplicates call, I see similar performance between Comet and Spark.

Describe the potential solution

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

andygrove · 2025-01-13T19:48:47Z

Disabling Comet aggregation improves performance from 230s to 64s and is slightly faster than Spark at 78s.

andygrove added enhancement New feature or request performance labels Jan 13, 2025

kazuyukitanimura added this to the 0.6.0 milestone Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `dropDuplicates` #1275

Improve performance of `dropDuplicates` #1275

andygrove commented Jan 13, 2025

andygrove commented Jan 13, 2025

Improve performance of dropDuplicates #1275

Improve performance of dropDuplicates #1275

Comments

andygrove commented Jan 13, 2025

What is the problem the feature request solves?

Describe the potential solution

Additional context

andygrove commented Jan 13, 2025

Improve performance of `dropDuplicates` #1275

Improve performance of `dropDuplicates` #1275