-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ClickBench benchmarks with DataFusion 36 #9404
Comments
@kmitchener any chance you are willing to run the benchmark numbers again for DF 36? |
(I am especially keen to see how #8827 has helped) |
@kmitchener @alamb I'm happy to help out here if needed |
@alamb @kmitchener I went ahead and ran the benchmarks against 36.0.0 - the results are available on this branch I've added them to the handy html from here which includes the results from releases 33 and 34 for comparison I accidentally created a PR against the ClickBench repo. I was hoping to get your feedback on these results first before doing that.... |
I have a basic question about the time being reported for the queries. If you take a look at the shell script the variable RES holds the result, it's taking the text from the 7th column in the 2nd line
If I run the cli there are two query times being reported (assuming one for creating the table and one for the query as per the note in the
I just wanted to double-check that using the time from the 2nd line of the output for the result is correct? In the example above, the reported result is 0.034 seconds. |
@pmcgleenon second time looks correct to me. it looks like the script will return 0.154 rather than 0.034 because the grep is done first. |
@Dandandan thanks for the confirmation, yep the result will be 0.154 in the example above |
For reference, here's the clickbench PR with the update ClickHouse/ClickBench#178 |
Thank you very much @pmcgleenon
I took a quick look at the results and in general they looked reasonable to me One thing I noticed is that there didn't seem to be a DataFusion 36 run for single file parquet (only partitioned). Also, I wonder if we should remove the older datafusion versions 🤔 Initial performance observationThe only query that seems to have gotten substantially faster is Q9: I think the improvement is due to #8721 from @korowa It was hoping to see a bigger improvement due to #8827 but it seems that has not yet been released. So we'll have to try with DataFusion 37 again, There are some queries that show slight degregation in speed -- I am not sure if that is realted to measurement variance or if we have increased our per partition file overheard or something. It would be nice to see the numbers for single file and see if they showed a similar pattern |
Thanks @alamb. I've fixed the file so that the Datafusion single file is visible for release 36. I've included 33, 34, 36 in this file so the results can be compared easily with the previous results - the previous versions will not be visible on the Clickbench overall results. The published clickbench file will only include the latest datafusion release (36.0.0 tag) - this is available here For a quick comparison here's the single file results for Datafusion 36 Yep Q9 is showing a decent improvement. |
I think the numbers are good to publish. In terms of the slowdown I agree it is likely something real (and we are tracking something similar in #8836) Perhaps there is per-file or per-partition overhead that has increased in 34.0.0 that we haven't gotten to the bottom of. @Ted-Jiang I wonder if this could be related to caching the file metadata -- for example I wonder if using something like https://docs.rs/datafusion/latest/datafusion/execution/cache/cache_manager/struct.CacheManager.html would improve the performance as it would avoid some small IOs at the start of the query It sort of feels like cheating however to cache the metadata, though I suppose it would be allowed 🤔 @Ted-Jiang is there some way to test via configuration setting if caching the per-file metadata would help these queries? I couldn't find any way to enable Caching by default for the Cache Manager. If it turns out to make a difference, maybe we could provide a simple LRU type cache by default in DataFusion 🤔 |
@alamb Sorry for the late response.
There are two kinds of cache in datafusion
This sounds like great, like |
@alamb the Clickbench PR with the datafusion 36 updates has been merged ClickHouse/ClickBench#178 So I think we are good here! |
Thank you very much @pmcgleenon 🙏 |
Filed #11567 to track udating for DataFusion 40 |
Like #8789
Is your feature request related to a problem or challenge?
DataFusion 36 has been released. https://crates.io/crates/datafusion/36.0.0
It would be great to update ClickBench https://benchmark.clickhouse.com/ with runs from the latest version.
It appears we still are at 34: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/results/partitioned.json
Describe the solution you'd like
Perhaps we can follow the model of ClickHouse/ClickBench#159 @kmitchener )
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: