Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the Parquet writer's dictionary encoding benchmark. #16591

Merged
merged 14 commits into from
Sep 10, 2024
4 changes: 2 additions & 2 deletions cpp/benchmarks/io/parquet/parquet_writer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -202,8 +202,8 @@ NVBENCH_BENCH_TYPES(BM_parq_write_encode, NVBENCH_TYPE_AXES(d_type_list))
.set_name("parquet_write_encode")
.set_type_axes_names({"data_type"})
.set_min_samples(4)
.add_int64_axis("cardinality", {0, 1000})
.add_int64_axis("run_length", {1, 32});
.add_int64_axis("cardinality", {1, 1000, 10'000, 100'000, 1'000'000})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any resource/infrastructure constraints that we need to consider here? We can perhaps remove one or two entries from here if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop 1. Given the table size, it's effectively the same as 1000.
Also, should 1M just be 0? Not sure if the goal is to have unique elements (AFAIK row count is typically lower than 1M).

.add_int64_axis("run_length", {1, 32, 64});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, missed this the first time around. To me, 32 is already a very high run length. I think we should instead add 4 or 8. I'm fine with 64 if you do see a significant difference in file size and/or performance compared to 32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No significant differences with 64 for the benchmarks we ran either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for 4 or 8, then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All done!


NVBENCH_BENCH_TYPES(BM_parq_write_io_compression, NVBENCH_TYPE_AXES(io_list, compression_list))
.set_name("parquet_write_io_compression")
Expand Down
Loading