Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ClickBench benchmarks with DataFusion 36 #9404

Closed
alamb opened this issue Feb 29, 2024 · 15 comments
Closed

Update ClickBench benchmarks with DataFusion 36 #9404

alamb opened this issue Feb 29, 2024 · 15 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 29, 2024

Like #8789

Is your feature request related to a problem or challenge?

DataFusion 36 has been released. https://crates.io/crates/datafusion/36.0.0

It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version.

It appears we still are at 34: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/results/partitioned.json

Describe the solution you'd like

Perhaps we can follow the model of ClickHouse/ClickBench#159 @kmitchener )

Describe alternatives you've considered

No response

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Feb 29, 2024

@kmitchener any chance you are willing to run the benchmark numbers again for DF 36?

@alamb
Copy link
Contributor Author

alamb commented Feb 29, 2024

(I am especially keen to see how #8827 has helped)

@pmcgleenon
Copy link
Contributor

@kmitchener @alamb I'm happy to help out here if needed

@pmcgleenon
Copy link
Contributor

@alamb @kmitchener I went ahead and ran the benchmarks against 36.0.0 - the results are available on this branch

I've added them to the handy html from here which includes the results from releases 33 and 34 for comparison

I accidentally created a PR against the ClickBench repo. I was hoping to get your feedback on these results first before doing that....

Screenshot 2024-03-07 at 19 55 26

@pmcgleenon
Copy link
Contributor

pmcgleenon commented Mar 7, 2024

I have a basic question about the time being reported for the queries. If you take a look at the shell script
https://github.com/ClickHouse/ClickBench/blob/main/datafusion/run.sh#L34

the variable RES holds the result, it's taking the text from the 7th column in the 2nd line

RES=`datafusion-cli -f $CREATE_SQL_FILE /tmp/query.sql 2>&1 | grep "Query took" | sed -n 2p | awk '{print $7}'`

If I run the cli there are two query times being reported (assuming one for creating the table and one for the query as per the note in the run.sh.

/arrow-datafusion/datafusion-cli/target/release/datafusion-cli -f create_partitioned.sql /tmp/query.sql 2>&1
DataFusion CLI v36.0.0
0 rows in set. Query took 0.034 seconds.

+---------------------+-----------+
| m                   | pageviews |
+---------------------+-----------+
| 2013-07-15T12:40:00 | 513       |
| 2013-07-15T12:41:00 | 457       |
| 2013-07-15T12:42:00 | 470       |
| 2013-07-15T12:43:00 | 468       |
| 2013-07-15T12:44:00 | 453       |
| 2013-07-15T12:45:00 | 462       |
| 2013-07-15T12:46:00 | 481       |
| 2013-07-15T12:47:00 | 458       |
| 2013-07-15T12:48:00 | 466       |
| 2013-07-15T12:49:00 | 467       |
+---------------------+-----------+
10 rows in set. Query took 0.154 seconds.

I just wanted to double-check that using the time from the 2nd line of the output for the result is correct? In the example above, the reported result is 0.034 seconds.

@Dandandan
Copy link
Contributor

Dandandan commented Mar 8, 2024

@pmcgleenon second time looks correct to me.

it looks like the script will return 0.154 rather than 0.034 because the grep is done first.

@pmcgleenon
Copy link
Contributor

@Dandandan thanks for the confirmation, yep the result will be 0.154 in the example above

@pmcgleenon
Copy link
Contributor

For reference, here's the clickbench PR with the update ClickHouse/ClickBench#178

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2024

@alamb @kmitchener I went ahead and ran the benchmarks against 36.0.0 - the results are available on this branch

Thank you very much @pmcgleenon

I accidentally created a PR against the ClickBench repo. I was hoping to get your feedback on these results first before doing that....

I took a quick look at the results and in general they looked reasonable to me

One thing I noticed is that there didn't seem to be a DataFusion 36 run for single file parquet (only partitioned).

Also, I wonder if we should remove the older datafusion versions 🤔

Initial performance observation

The only query that seems to have gotten substantially faster is Q9: I think the improvement is due to #8721 from @korowa

It was hoping to see a bigger improvement due to #8827 but it seems that has not yet been released. So we'll have to try with DataFusion 37 again,

There are some queries that show slight degregation in speed -- I am not sure if that is realted to measurement variance or if we have increased our per partition file overheard or something. It would be nice to see the numbers for single file and see if they showed a similar pattern

@pmcgleenon
Copy link
Contributor

pmcgleenon commented Mar 8, 2024

Thanks @alamb. I've fixed the file so that the Datafusion single file is visible for release 36. I've included 33, 34, 36 in this file so the results can be compared easily with the previous results - the previous versions will not be visible on the Clickbench overall results.

The published clickbench file will only include the latest datafusion release (36.0.0 tag) - this is available here

For a quick comparison here's the single file results for Datafusion 36

Screenshot 2024-03-08 at 22 26 09

Yep Q9 is showing a decent improvement.

@alamb
Copy link
Contributor Author

alamb commented Mar 9, 2024

I think the numbers are good to publish.

In terms of the slowdown I agree it is likely something real (and we are tracking something similar in #8836)

Perhaps there is per-file or per-partition overhead that has increased in 34.0.0 that we haven't gotten to the bottom of. @Ted-Jiang I wonder if this could be related to caching the file metadata -- for example I wonder if using something like https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManager.html would improve the performance as it would avoid some small IOs at the start of the query

It sort of feels like cheating however to cache the metadata, though I suppose it would be allowed 🤔

@Ted-Jiang is there some way to test via configuration setting if caching the per-file metadata would help these queries? I couldn't find any way to enable Caching by default for the Cache Manager.

If it turns out to make a difference, maybe we could provide a simple LRU type cache by default in DataFusion 🤔

@Ted-Jiang
Copy link
Member

Ted-Jiang commented Mar 11, 2024

@alamb Sorry for the late response.

@Ted-Jiang is there some way to test via configuration setting if caching the per-file metadata would help these queries?

There are two kinds of cache in datafusion ListFilesCache and FileStatisticsCache

  1. ListFilesCache: Before I change this into session level, i think it is already on in global level (so no config setting in df 🤣 ). it will reuse the result under the same path of list files.
  2. FileStatisticsCache: This cache is specifically used for Cost-Based Optimization (CBO), It is typically disabled and only activated within our internal system to cache FileStatistics, which are used for join selection.

If it turns out to make a difference, maybe we could provide a simple LRU type cache by default in DataFusion

This sounds like great, like linked_hash_map, i would like to help this.

@pmcgleenon
Copy link
Contributor

@alamb the Clickbench PR with the datafusion 36 updates has been merged ClickHouse/ClickBench#178

So I think we are good here!

Screenshot 2024-03-22 at 07 36 20

@alamb alamb closed this as completed Mar 22, 2024
@alamb
Copy link
Contributor Author

alamb commented Mar 22, 2024

Thank you very much @pmcgleenon 🙏

@alamb
Copy link
Contributor Author

alamb commented Jul 20, 2024

Filed #11567 to track udating for DataFusion 40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants