Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single_distinct_to_groupby no longer drops qualifiers #4050

Merged
merged 11 commits into from
Nov 2, 2022

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Oct 31, 2022

Which issue does this PR close?

Closes #4049
Closes #3820

Includes changes from #4084

Rationale for this change

Simple bug fix so that we can serialize plans containing a single distinct aggregate

What changes are included in this PR?

  • New unit test to demonstrate the issue
  • Move TableReference from sql to common
  • Update logic in DFSChema::index_of_column_by_name

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Oct 31, 2022
@andygrove andygrove added the bug Something isn't working label Nov 1, 2022
@andygrove andygrove marked this pull request as draft November 1, 2022 01:31
@andygrove andygrove marked this pull request as ready for review November 1, 2022 14:31
@andygrove andygrove requested a review from alamb November 1, 2022 14:31
@github-actions github-actions bot added the sql SQL Planner label Nov 1, 2022
"| 2 | 2 |",
"| 4 | 1 |",
"+-----+------------------------+",
"+-------+------------------------+",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this plan would produce t.val as the column name but the query above it (SELECT val, count(distinct dict) FROM t GROUP BY val) still produced val 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only fixed this for the single_distinct_to_groupby case but it would be good to fix this consistently for all cases. I will see if I can fix in this PR, or file a new issue if it not trivial

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually. I'm confused now as well ... perhaps this name should not have been qualified in the first place? I'll take another look

Copy link
Member Author

@andygrove andygrove Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this now. The test above is referencing a column named t.val and the physical plan drops column qualifiers:

        Expr::Column(c) => {
            let idx = input_dfschema.index_of_column(c)?;
            Ok(Arc::new(Column::new(&c.name, idx)))
        }

The updated test is referencing a column with an alias of t.val, and the physical plan uses the alias name, hence the different output.

@@ -137,12 +137,14 @@ fn optimize(plan: &LogicalPlan) -> Result<LogicalPlan> {
// - aggr expr
let mut alias_expr: Vec<Expr> = Vec::new();
for (alias, original_field) in group_expr_alias {
alias_expr.push(col(&alias).alias(original_field.name()));
alias_expr.push(col(&alias).alias(original_field.qualified_name()));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change

Comment on lines 2 to +3
Projection: part.p_brand, part.p_type, part.p_size, COUNT(DISTINCT partsupp.ps_suppkey) AS supplier_cnt
Projection: group_alias_0 AS p_brand, group_alias_1 AS p_type, group_alias_2 AS p_size, COUNT(alias1) AS COUNT(DISTINCT partsupp.ps_suppkey)
Projection: group_alias_0 AS part.p_brand, group_alias_1 AS part.p_type, group_alias_2 AS part.p_size, COUNT(alias1) AS COUNT(DISTINCT partsupp.ps_suppkey)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this logical plan was invalid before this PR because the outer projection was looking for part.p_brand and the inner projection only provided p_brand.

@andygrove andygrove marked this pull request as draft November 2, 2022 13:47
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Nov 2, 2022
@andygrove andygrove marked this pull request as ready for review November 2, 2022 15:10
@andygrove andygrove marked this pull request as draft November 2, 2022 15:21
@andygrove andygrove marked this pull request as ready for review November 2, 2022 15:26
@andygrove andygrove requested a review from alamb November 2, 2022 15:29
@andygrove andygrove requested a review from Dandandan November 2, 2022 16:00
Comment on lines +462 to +481
let fields = result[0]
.schema()
.fields()
.iter()
.map(|f| {
let simple_name = match f.name().find('.') {
Some(i) => f.name()[i + 1..].to_string(),
_ => f.name().to_string(),
};
f.clone().with_name(simple_name)
})
.collect();
let result_schema = SchemaRef::new(Schema::new(fields));
let result = result
.iter()
.map(|b| {
RecordBatch::try_new(result_schema.clone(), b.columns().to_vec())
.map_err(|e| e.into())
})
.collect::<Result<Vec<_>>>()?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't use the physical schema from query execution because it has ?table? as the qualifier so this code just strips field names down to simple names

Comment on lines +207 to +218
// the original field may now be aliased with a name that matches the
// original qualified name
let table_ref: TableReference = field.name().as_str().into();
match table_ref {
TableReference::Partial { schema, table } => {
schema == qq && table == name
}
TableReference::Full { schema, table, .. } => {
schema == qq && table == name
}
_ => false,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not crazy about this but I don't see how else we can handle this unless we introduce some sort of alias-with-qualifier concept

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps aliases should be Column instead of String?

enum Expr {
  Alias(expr: Box<Expr>, alias: Column>
  ...
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think alias's can be fully qualified identifiers

I think the core disconnect is that datafusion 'flattens' compound identifiers into strings for column names which can then be referred to by anything that reads the plan that makes them

Postgres deals with this type of issue by using indexes to identify columns, but that ends up with its own problems as the indexes invariably get mixed up sometimes and debugging it is super tricky.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @andygrove -- thank you


let bytes = logical_plan_to_bytes(&plan)?;
let logical_round_trip = logical_plan_from_bytes(&bytes, &ctx)?;
assert_eq!(format!("{:?}", plan), format!("{:?}", logical_round_trip));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines +207 to +218
// the original field may now be aliased with a name that matches the
// original qualified name
let table_ref: TableReference = field.name().as_str().into();
match table_ref {
TableReference::Partial { schema, table } => {
schema == qq && table == name
}
TableReference::Full { schema, table, .. } => {
schema == qq && table == name
}
_ => false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think alias's can be fully qualified identifiers

I think the core disconnect is that datafusion 'flattens' compound identifiers into strings for column names which can then be referred to by anything that reads the plan that makes them

Postgres deals with this type of issue by using indexes to identify columns, but that ends up with its own problems as the indexes invariably get mixed up sometimes and debugging it is super tricky.

@andygrove
Copy link
Member Author

Thanks for the review @alamb. With this PR we can now support 21 out of 22 TPC-H queries in DataFusion and Ballista. Getting close ..

@alamb
Copy link
Contributor

alamb commented Nov 2, 2022

So close!

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Nov 2, 2022
@andygrove andygrove merged commit 8d6448e into apache:master Nov 2, 2022
@andygrove andygrove deleted the single-distinct-serde branch November 2, 2022 20:01
@ursabot
Copy link

ursabot commented Nov 2, 2022

Benchmark runs are scheduled for baseline = 396b5aa and contender = 8d6448e. 8d6448e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core DataFusion crate optimizer Optimizer rules sql SQL Planner
Projects
None yet
3 participants