Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement qualified expression alias #8008

Closed
gruuya opened this issue Oct 31, 2023 · 9 comments · Fixed by #7981
Closed

Implement qualified expression alias #8008

gruuya opened this issue Oct 31, 2023 · 9 comments · Fixed by #7981
Labels
enhancement New feature or request

Comments

@gruuya
Copy link
Contributor

gruuya commented Oct 31, 2023

Mailing List Thread: https://lists.apache.org/thread/9bvwjsxrt0o5tqfyf47wfw81z0kjgp7o

Is your feature request related to a problem or challenge?

The problem I'm encountering is related to #7981. Namely:

  • when a plan gets optimized into a equivalent representation using other primitive plans it is necessary to have the optimized output match the schema of the originally created plan
  • the primitive plan combination in question can involve columnized expressions as output field names (e.g. with aggregations), so in order to align it with the original schema aliasing is required
  • there in lies the problem—all logical plans with a schema use exprlist_to_fields to generate the initial schema, however this function will always result in a unqualified field for Expr::Alias unlike for Expr::Column, hence the schemas can never match

Describe the solution you'd like

Introduce a new enum variant along the lines of:

pub struct QualifiedAlias {
    pub expr: Box<Expr>,
    pub relation: OwnedTableReference,
    pub name: String,
}

or alternatively extend the existing alias expression to accommodate for an optional relation:

pub struct Alias {
    pub expr: Box<Expr>,
    pub relation: Option<OwnedTableReference>,
    pub name: String,
}

This would allow for 1-1 mapping between fields and column aliases.

Describe alternatives you've considered

Everything else seems like a hack:

  • always strip qualifiers with DFSchema::strip_qualifiers when building the schema in plan constructors; this will result in conflicts for joins
  • push the qualifier down into the field name like in a5cff4e; the issue here is the unexpected derived column naming
  • make ExprSchemable::to_field try to parse the qualifer and name from an alias name; seems error prone, confusing and potentially problematic for column names with dots

Additional context

It seems like previously this could have been done with Projection::try_new_with_schema, but it looks like the overriding schema isn't being propagated through the optimizers after #7919.

@gruuya gruuya added the enhancement New feature or request label Oct 31, 2023
@gruuya
Copy link
Contributor Author

gruuya commented Nov 3, 2023

To make this more concrete, here's a test (e.g. in datafusion/optimizer/src/common_subexpr_eliminate.rs) that demonstrates the issue

#[test]
fn plan_schema_changed() -> Result<()> {
    let original_plan = test_table_scan()?;

    // Transformed plan, that has columns renamed
    let new_plan = LogicalPlanBuilder::from(original_plan.clone())
        .aggregate(vec![col("a")], vec![max(col("b")), min(col("c"))])?
        .build()?;

    // Attempt to rename columns to original
    let projection_plan = LogicalPlanBuilder::from(new_plan)
        .project(vec![
            col("a"),
            col("MAX(test.b)").alias("b"),
            col("MIN(test.c)").alias("c"),
        ])?
        .build()?;

    // This will fail since the qualifiers are stripped when building a schema from any expression other than `Expr::Column`
    assert_eq!(
        original_plan.schema(),
        projection_plan.schema(),
    );

    Ok(())
}

This fails with:

assertion `left == right` failed
  left: DFSchema { fields: [DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "a", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "b", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "c", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }
 right: DFSchema { fields: [DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "a", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "b", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "c", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1, 2], nullable: false, mode: Single }] } }

Note that above the functional dependencies are also different, but that is not what is being asserted in-between optimization runs.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2023

there in lies the problem—all logical plans with a schema use exprlist_to_fields to generate the initial schema, however this function will always result in a unqualified field for Expr::Alias unlike for Expr::Column, hence the schemas can never match

Somehow, Projection doesn't seem to have this problem.

Is the problem in your example that the original schema is test.b (not b) so when you alias MAX(test.b) to b you are in fact changing the schema.

Shouldn't it be something more like this (I can't remember if you have to explicitly build qualified identifiers specially):

    // Attempt to rename columns to original
    let projection_plan = LogicalPlanBuilder::from(new_plan)
        .project(vec![
            col("a"),
            col("MAX(test.b)").alias("test.b"),
            col("MIN(test.c)").alias("test.c"),
        ])?
        .build()?;

@gruuya
Copy link
Contributor Author

gruuya commented Nov 4, 2023

Somehow, Projection doesn't seem to have this problem.

Yeah I think this problem only surfaces when an initial plan gets transformed into a combination of other plans (one of which is an aggregation) during optimizations, such as in the case of DISTINCT ON (though not bare DISTINCT, since then the grouping fields have the same relation/name as the initial selection list), but we still want to abide by the original plan schema (as is the case in all logical optimizations).

Shouldn't it be something more like this (I can't remember if you have to explicitly build qualified identifiers specially):

Not really, because the name of the alias always gets crammed into the DFField::name:

  left: DFSchema { fields: [DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "a", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "b", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "c", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [] } }
 right: DFSchema { fields: [DFField { qualifier: Some(Bare { table: "test" }), field: Field { name: "a", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "test.b", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }, DFField { qualifier: None, field: Field { name: "test.c", data_type: UInt32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} } }], metadata: {}, functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1, 2], nullable: false, mode: Single }] } }

In fact that's why I opened this issue, because I see no way of building qualified fields via expressions currently.

LogicalPlan::SubqueryAlias does address the problem partially, but is not a general solution. In other words the base plan could involve joins or derived fields, so the output schema may include fields with no qualifiers as well as fields with different qualifiers and there's no adequate way to match that schema. Whereas if there was a solution like above ExprSchemable::to_fields could be extended so that for each individual field besides the name a relation could be passed on, e.g. something like:

diff --git a/datafusion/expr/src/expr_schema.rs b/datafusion/expr/src/expr_schema.rs
index 025b74eb5..61d1ffe53 100644
--- a/datafusion/expr/src/expr_schema.rs
+++ b/datafusion/expr/src/expr_schema.rs
@@ -295,6 +295,13 @@ impl ExprSchemable for Expr {
                 self.nullable(input_schema)?,
             )
             .with_metadata(self.metadata(input_schema)?)),
+            Expr::QualifiedAlias(qa) => Ok(DFField::new(
+                qa.relation.clone(),
+                &qa.name,
+                self.get_type(input_schema)?,
+                self.nullable(input_schema)?,
+            )
+                .with_metadata(self.metadata(input_schema)?)),
             _ => Ok(DFField::new_unqualified(
                 &self.display_name()?,
                 self.get_type(input_schema)?,

and so the correct projection would look something like:

        let projection_plan = LogicalPlanBuilder::from(new_plan)
            .project(
                new_plan
                    .schema()
                    .fields()
                    .iter()
                    .zip(original_plan.schema().fields().iter())
                    .map(|(new_field, old_field)| {
                        col(new_field.name())
                            .alias_qualified(old_field.qualifier(), old_field.name())
                    }),
            )?
            .build()?;

@waynexia
Copy link
Member

waynexia commented Nov 6, 2023

I think one reason causing this issue is that the qualifier is sometimes present in the expr name (or column name), and sometimes is not:

however this function will always result in a unqualified field for Expr::Alias unlike for Expr::Column, hence the schemas can never match

In the previous example:

.project(vec![
            col("a"),
            col("MAX(test.b)").alias("b"),
            col("MIN(test.c)").alias("c"),
        ])?

The first column's name is a, but this doesn't mean it doesn't have a qualifier, and the qualifier won't appear. The qualifier is always a pain point when aliasing the new expr. We have to choose whether or not to include the qualifier in the alias string.

From this point, maybe Alias should contain a qualifier or relation like Column? But I am concerned that having both Alias and AliasQualified at the same time would be confusing.

@gruuya
Copy link
Contributor Author

gruuya commented Nov 6, 2023

I think one reason causing this issue is that the qualifier is sometimes present in the expr name (or column name), and sometimes is not:

Actually the qualifier is never inferred from an expression name, as demonstrated in ExprSchemable::to_fields, it is left blank in all cases except Expr::Column in which case it gets passed on to the schema/field.

The first column's name is a, but this doesn't mean it doesn't have a qualifier, and the qualifier won't appear. The qualifier is always a pain point when aliasing the new expr. We have to choose whether or not to include the qualifier in the alias string.

In case of projecting col("a") above the reason that column does have a qualifier is that LogicalPlanBuilder::project first calls normalize_col against the original schema, which populates the qualifier into the Expr::Column expression. Only then does it call Projection::try_new with the normalized expressions, which in turn invokes ExprSchemable::to_fields to get the final schema. Note that this can't work for renamed columns since they're not present in the input schema. Again, including the qualifier in the alias name itself will only result in the qualifier becoming a part of the DFField::field's name, but will leave the actual DFField::qualifier empty.

From this point, maybe Alias should contain a qualifier or relation like Column? But I am concerned that having both Alias and AliasQualified at the same time would be confusing.

Yeah, having both might be confusing (was inspired by Expr::QualifiedWildcard and Expr::Wildcard), though extending the existing enum could also be more work.

@alamb
Copy link
Contributor

alamb commented Nov 9, 2023

I think the reasoning in this ticket makes sense to me. Thank you for raising it @gruuya

I do agree with @waynexia that adding a Expr::QualifiedAlias would both be confusing as well make subtle bugs more likely (e.g. using Expr::Alias instead of using Expr::QualiiedAlias) and thus adding an optional relation to Expr::Alias is probably the "cleanest" way to do this

Adding such a field would, of course, as @gruuya says, force us to go through many locations in the code to add handling of the new relation but this is probably a valuable exercise to ensure the relation is handled when appropriate

Perhaps @jackwener or @liukun4515 or @ozankabak have some additional thoughts to share

@alamb
Copy link
Contributor

alamb commented Nov 9, 2023

There is also a mailing list thread about this: https://lists.apache.org/thread/9bvwjsxrt0o5tqfyf47wfw81z0kjgp7o. I am copying what @jdye64 said here so the discussion is in the same place:

From: Jeremy Dyer [email protected]Subject: Re: [DataFusion] Introduce qualified alias expressionsDate: 2023-11-08 19:23 -05:00List: [email protected]
From an Arrow Datafusion Python consumer standpoint this would make using datafusion so so much easier. Can’t
speak to the ramifications to upstream or other projects but would love to see this and help if needed

Thanks,
Jeremy Dyer

@alamb
Copy link
Contributor

alamb commented Nov 9, 2023

BTW here is a proposal for removing Expr::QualifiedWildcard: #8105

@gruuya
Copy link
Contributor Author

gruuya commented Nov 10, 2023

I've taken the liberty of implementing this via 8ebdc13 inside the existing PR #7981, as it turned out to be simpler than I anticipated.

If it's preferable for this to be done separately please let me know, and I'll open a new PR instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants