-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend the Parquet writer's dictionary encoding benchmark. #16591
Changes from 8 commits
c7b13fd
f255fba
340b88b
26c3a89
c17a4b9
1535fee
a29fb5f
0cc2bd6
d8f984d
b00850d
507ae3d
fef3179
bfe3b13
97b766f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -202,8 +202,8 @@ NVBENCH_BENCH_TYPES(BM_parq_write_encode, NVBENCH_TYPE_AXES(d_type_list)) | |
.set_name("parquet_write_encode") | ||
.set_type_axes_names({"data_type"}) | ||
.set_min_samples(4) | ||
.add_int64_axis("cardinality", {0, 1000}) | ||
.add_int64_axis("run_length", {1, 32}); | ||
.add_int64_axis("cardinality", {1, 1000, 10'000, 100'000, 1'000'000}) | ||
.add_int64_axis("run_length", {1, 32, 64}); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, missed this the first time around. To me, 32 is already a very high run length. I think we should instead add 4 or 8. I'm fine with 64 if you do see a significant difference in file size and/or performance compared to 32. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No significant differences with 64 for the benchmarks we ran either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I vote for 4 or 8, then There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All done! |
||
|
||
NVBENCH_BENCH_TYPES(BM_parq_write_io_compression, NVBENCH_TYPE_AXES(io_list, compression_list)) | ||
.set_name("parquet_write_io_compression") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any resource/infrastructure constraints that we need to consider here? We can perhaps remove one or two entries from here if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop
1
. Given the table size, it's effectively the same as 1000.Also, should 1M just be 0? Not sure if the goal is to have unique elements (AFAIK row count is typically lower than 1M).