Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move the type coercion out of the optimizer and refactor the optimizer #3582

Closed
liukun4515 opened this issue Sep 22, 2022 · 10 comments
Closed
Labels
enhancement New feature or request

Comments

@liukun4515
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

When I do this pr https://github.com/apache/arrow-datafusion/pull/3396/files#
I am stuck by many issue about the type in the optimizer framework.

In other SQL or database system, the data type coercion should be done before any optimization.

But we do it in the physical phase, after the @andygrove work and add the TypeCoercion rule in the optimizer. It can be done in the logical phase.

After the https://github.com/apache/arrow-datafusion/pull/3396/files# merge, we need to do the refactor.

If the data type is not right, all the operation and optimization will meet some wired issue.

cc @andygrove @alamb

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@liukun4515 liukun4515 added the enhancement New feature or request label Sep 22, 2022
@alamb
Copy link
Contributor

alamb commented Sep 22, 2022

I agree that type coercion doesn't really belong in an "optimizer" as it actually changes the meaning of the exprs (on purpose) where the other optimzier passes are supposed to keep the meaning of the plan the same, but make it faster in some way.

I recommend changing the code to explicitly run the type coercion logic immediately prior to running any other optimization passes. I also think the type coercion logic should not depend on any other simplification (such as constant folding) -- it should depend only on the types.

Once type coercion is done, then we can run the expr simplifier pass to clean up / simplify the expressions

@Dandandan
Copy link
Contributor

I agree - the type coercion should be a mandatory pass that is done after planning.

@liukun4515
Copy link
Contributor Author

I agree that type coercion doesn't really belong in an "optimizer" as it actually changes the meaning of the exprs (on purpose) where the other optimzier passes are supposed to keep the meaning of the plan the same, but make it faster in some way.

I recommend changing the code to explicitly run the type coercion logic immediately prior to running any other optimization passes. I also think the type coercion logic should not depend on any other simplification (such as constant folding) -- it should depend only on the types.

Once type coercion is done, then we can run the expr simplifier pass to clean up / simplify the expressions

You got it.

@andygrove
Copy link
Member

Spark has an Analysis phase that runs before the optimizer. Maybe we can learn from their model. Within the analysis phase, they have a number of type coercion rules that they run:

  override def typeCoercionRules: List[Rule[LogicalPlan]] =
    UnpivotCoercion ::
    WidenSetOperationTypes ::
    new CombinedTypeCoercionRule(
      InConversion ::
      PromoteStrings ::
      DecimalPrecision ::
      BooleanEquality ::
      FunctionArgumentConversion ::
      ConcatCoercion ::
      MapZipWithCoercion ::
      EltCoercion ::
      CaseWhenCoercion ::
      IfCoercion ::
      StackCoercion ::
      Division ::
      IntegralDivision ::
      ImplicitTypeCasts ::
      DateTimeOperations ::
      WindowFrameCoercion ::
      StringLiteralCoercion :: Nil) :: Nil

@liukun4515
Copy link
Contributor Author

Spark has an Analysis phase that runs before the optimizer. Maybe we can learn from their model. Within the analysis phase, they have a number of type coercion rules that they run:

  override def typeCoercionRules: List[Rule[LogicalPlan]] =
    UnpivotCoercion ::
    WidenSetOperationTypes ::
    new CombinedTypeCoercionRule(
      InConversion ::
      PromoteStrings ::
      DecimalPrecision ::
      BooleanEquality ::
      FunctionArgumentConversion ::
      ConcatCoercion ::
      MapZipWithCoercion ::
      EltCoercion ::
      CaseWhenCoercion ::
      IfCoercion ::
      StackCoercion ::
      Division ::
      IntegralDivision ::
      ImplicitTypeCasts ::
      DateTimeOperations ::
      WindowFrameCoercion ::
      StringLiteralCoercion :: Nil) :: Nil

Yes, I think we can follow other database or sql system.

But It's long way to go and refactor.

My plan is to do some refactor, and make good base for the next big refactor like the spark or other system.

@alamb
Copy link
Contributor

alamb commented Sep 23, 2022

My plan is to do some refactor, and make good base for the next big refactor like the spark or other system.

Sounds like a great plan to me

@alamb
Copy link
Contributor

alamb commented Oct 5, 2022

Here is a PR that contributes to this goal: #3728

@liukun4515
Copy link
Contributor Author

Here is a PR that contributes to this goal: #3728

After this pr merged I will do other Expr for type coercion

@liukun4515
Copy link
Contributor Author

after supporting all expr for type coercion, we can move this rule out of the optimizer, and use it as a separate module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants