-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Row AVG
accumulator support Decimal type
#5973
Conversation
AVG
accumulator support Decimal type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- @yjshen could you please review the changes to the row format?
DataType::Decimal128(p, s) => { | ||
match accessor.get_u64_opt(self.state_index()) { | ||
None => Ok(ScalarValue::Decimal128(None, p, s)), | ||
Some(0) => Ok(ScalarValue::Decimal128(None, p, s)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we translate 0
--> null
here? (it is also done for Float64 below)?
I see you are just following the existing pattern, but it seems like this could be incorrect?
Maybe we could add a test that calls AVG
on (-1
and 1
) to see if we get 0
or NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do some test on this today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb, the value at the state_index
is for the count rather than the sum. When the count is 0, for the average, it should be NULL.
For query 18, I think the plan is problematic, it is the Join order and build side selection, the bottleneck is not the Aggregations.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mingmwang @yahoNanJing! Looks great to me.
Could we also add two roundtrip (DecimalArray
-> vec<u8>
-> DecimalArray
) tests in datafusion/row/src/lib.rs
for the newly introduced Decimal type? One for a null-free case and one for a nullable case.
Sure, I will add the two tests in a following PR. |
🎉 |
Which issue does this PR close?
Closes #5892.
Rationale for this change
Improve the Aggregate performance for decimal type
What changes are included in this PR?
Are these changes tested?
I had test this on my local Mac.
For TPCH-q17, there is at about 25% improvement.
Before this PR:
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(17), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
Query 17 iteration 0 took 2623.2 ms and returned 1 rows
Query 17 iteration 1 took 2580.4 ms and returned 1 rows
Query 17 iteration 2 took 2575.7 ms and returned 1 rows
Query 17 avg time: 2593.11 ms
After this PR:
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(17), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
Query 17 iteration 0 took 1920.4 ms and returned 1 rows
Query 17 iteration 1 took 1877.0 ms and returned 1 rows
Query 17 iteration 2 took 1878.1 ms and returned 1 rows
Query 17 avg time: 1891.81 ms
For TPCH-q18, it seems there is little improvement.
Before this PR:
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(18), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
Query 18 iteration 0 took 1359.0 ms and returned 57 rows
Query 18 iteration 1 took 1285.2 ms and returned 57 rows
Query 18 iteration 2 took 1278.0 ms and returned 57 rows
Query 18 avg time: 1307.40 ms
After this PR:
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: Some(18), debug: false, iterations: 3, partitions: 1, batch_size: 8192, path: "./parquet_data", file_format: "parquet", mem_table: false, output_path: None, disable_statistics: false, enable_scheduler: false }
Query 18 iteration 0 took 1335.8 ms and returned 57 rows
Query 18 iteration 1 took 1257.7 ms and returned 57 rows
Query 18 iteration 2 took 1237.9 ms and returned 57 rows
Query 18 avg time: 1277.12 ms
Are there any user-facing changes?