-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the Parquet writer's dictionary encoding benchmark. #16591
Extend the Parquet writer's dictionary encoding benchmark. #16591
Conversation
@@ -202,8 +202,8 @@ NVBENCH_BENCH_TYPES(BM_parq_write_encode, NVBENCH_TYPE_AXES(d_type_list)) | |||
.set_name("parquet_write_encode") | |||
.set_type_axes_names({"data_type"}) | |||
.set_min_samples(4) | |||
.add_int64_axis("cardinality", {0, 1000}) | |||
.add_int64_axis("run_length", {1, 32}); | |||
.add_int64_axis("cardinality", {1, 1000, 10'000, 100'000, 1'000'000}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any resource/infrastructure constraints that we need to consider here? We can perhaps remove one or two entries from here if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop 1
. Given the table size, it's effectively the same as 1000.
Also, should 1M just be 0? Not sure if the goal is to have unique elements (AFAIK row count is typically lower than 1M).
/ok to test |
what's the reason for this change? |
First of all welcome back. Greg wanted me to push any updates I did to the benchmark for #16541. Though I think that all my local changes (even wider extended ranges) need not to be pushed upstream if not needed. |
…::static_map` (#16541) Part of #12261. This PR refactors the dictionary encoding in Parquet writers to migrate from `cuco::legacy::static_map` to `cuco::static_map` to build the dictionaries. ### Performance Results The changes result in +0.08% average speed improvement and +16.22% average memory footprint increase (stems from the adjusted sizes by `cuco::make_window_extent` due to [prime gap](https://en.wikipedia.org/wiki/Prime_gap)) across the benchmark cases extended from #16591 Currently, we do see a roughly 8% speed improvement in map insert and find kernels which is counteracted by the map init and map collect kernels as they have to process 16.22% more slots. With a cuco bump, the average speed improvement will increase from 0.08% to +3% and the memory footprint change will go back from 16.22% to +0%. ### Hardware used for benchmarking ``` `NVIDIA RTX 5880 Ada Generation` * SM Version: 890 (PTX Version: 860) * Number of SMs: 110 * SM Default Clock Rate: 18446744071874 MHz * Global Memory: 23879 MiB Free / 48632 MiB Total * Global Memory Bus Peak: 960 GB/sec (384-bit DDR @10001MHz) * Max Shared Memory: 100 KiB/SM, 48 KiB/Block * L2 Cache Size: 98304 KiB * Maximum Active Blocks: 24/SM * Maximum Active Threads: 1536/SM, 1024/Block * Available Registers: 65536/SM, 65536/Block * ECC Enabled: No ``` Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #16541
/ok to test |
/ok to test |
.add_int64_axis("cardinality", {0, 1000}) | ||
.add_int64_axis("run_length", {1, 32}); | ||
.add_int64_axis("cardinality", {0, 1000, 10'000, 100'000}) | ||
.add_int64_axis("run_length", {1, 32, 64}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed this the first time around. To me, 32 is already a very high run length. I think we should instead add 4 or 8. I'm fine with 64 if you do see a significant difference in file size and/or performance compared to 32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No significant differences with 64 for the benchmarks we ran either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for 4 or 8, then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All done!
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to know how much time this increases in benchmark runs.
If it is not available now, follow up with Randy on benchmark runs.
/merge |
Results in #16541 (here) for which we are extending this. Each new benchmark in the matrix takes roughly 0.5s to run on my workstation (AMD Threadripper + RTX Ada 5880) so it should be roughly an 4s increase in total time (8x new benchmarks). |
Description
This PR extends the data cardinality and run length range for the existing parquet writer's encoding benchmark.
Checklist