Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression #25925

Closed
wants to merge 2 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Sep 25, 2019

What changes were proposed in this pull request?

This patch proposes to skip PlanExpression when doing subexpression elimination on executors.

Why are the changes needed?

Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields.

The NPE looks like:

[info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression *** FAILED *** (175 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID   3447, 10.0.0.196, executor driver): java.lang.NullPointerException
[info]  at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548)
[info]  at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36)
[info]  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425)
[info]  at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102)
[info]  at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test.

expr.find(_.isInstanceOf[LambdaVariable]).isDefined ||
// `PlanExpression` wraps query plan. To compare query plans of `PlanExpression` on executor,
// can cause unexpected error.
expr.isInstanceOf[PlanExpression[_]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to only skip it on executors? This may still be useful when we codegen at driver side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like check TaskContext.get != null?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's a good idea!

@viirya viirya changed the title [SPARK-29239][SQL] Subquery should not cause NPE when eliminating subexpression [SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression Sep 25, 2019
expr.find(_.isInstanceOf[LambdaVariable]).isDefined
expr.find(_.isInstanceOf[LambdaVariable]).isDefined ||
// `PlanExpression` wraps query plan. To compare query plans of `PlanExpression` on executor,
// can cause unexpected error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't any other reason, shall we mention NPE specifically instead of unexpected error? Both SPARK-29239 and SPARK-29221 are due to NPE.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding including NPE after unexpected error? IMHO unexpected error is actually correct (we can't predict which error we will get), but it would help much if we enumerate known errors as well.

Copy link
Contributor

@HeartSaVioR HeartSaVioR Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I realized my browser didn't show the comment from @viirya . It's just a 2 cents and like NPE seems OK to me.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111333 has finished for PR 25925 at commit 3a3bde0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 25, 2019

Test build #111359 has finished for PR 25925 at commit 75109a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expr.find(_.isInstanceOf[LambdaVariable]).isDefined ||
// `PlanExpression` wraps query plan. To compare query plans of `PlanExpression` on executor,
// can cause error like NPE.
(expr.isInstanceOf[PlanExpression[_]] && TaskContext.get != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity, does this issue happen in interpreted code path as well? e.g. we send PlanExpression to executor side and eval it, and hit NPE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC,EquivalentExpressions is only used in the codegen mode now, e.g., GenerateUnsafeProjection uses this class in common subexpr elimination, but `InterpretedUnsafeProject does not elimnate common subexprs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your question correctly. But PlanExpressions of a SparkPlan are evaluated and updated (e.g., ExecSubqueryExpression.updateResult) with values before a query begins to run. The values are kept in PlanExpression, and on executor side when to call eval of PlanExpression, it simply returns the kept value. I think we do not really evaluate a PlanExpression at executor side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, so the kept value is serialized and sent to executor side in interpreted code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue also reminds me that it's better to always do codegen at driver side, even if whole-stage-codegen is false. We can investigate it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Please let me know if you have some ideas later.

@cloud-fan cloud-fan closed this in b8b59d6 Sep 26, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@viirya can you send another PR to 2.4? I tried to backport but it has conflicts.

@viirya
Copy link
Member Author

viirya commented Sep 26, 2019

@cloud-fan Ok. Let me create a backport PR.

@viirya
Copy link
Member Author

viirya commented Sep 26, 2019

@cloud-fan I just tried to make a backport.

In branch-2.4, Project uses an UnsafeProjection API which is not covered by CODEGEN_FACTORY_MODE config yet. So can not have an end-to-end test like this.

WDYT? Should we make a backport with unit test against EquivalentExpressions, or skip backport?

@cloud-fan
Copy link
Contributor

do we have to trigger the bug via CODEGEN_FACTORY_MODE=CODEGEN_ONLY? This is a runtime exception and I think we can't fallback anyway.

@viirya
Copy link
Member Author

viirya commented Sep 26, 2019

do we have to trigger the bug via CODEGEN_FACTORY_MODE=CODEGEN_ONLY? This is a runtime exception and I think we can't fallback anyway.

It captures NonFatal so actually it can fallback when NPE.

@cloud-fan
Copy link
Contributor

It captures NonFatal when compiling the code, not running the code, right?

@viirya
Copy link
Member Author

viirya commented Sep 26, 2019

Yes, it captures NonFatal when compiling the code. NPE happens when generating the code to be compiled on executor.

@cloud-fan
Copy link
Contributor

ah I see. Then we can skip backporting as users can still run the query

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants