[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706

staple · 2014-08-01T02:12:51Z

LogicalPlan contains a ‘resolved’ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union plan’s implementation of ‘resolved’ verifies that the types of its children’s columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column. See SPARK-2781 for an example.

This patch adds two checks to the analyzer’s CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the ‘problem’ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough for this redundant checking to meaningfully impact performance.

Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts, and ExtractPythonUdfs rules manually after the analyzer runs. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer.

It’s worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable.

Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of policy differences between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as “true” / “false” while spark sql casts booleans to strings as “1” / “0” (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to “1”/“0” is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces “true”/“false” instead.) I believe “true” / “false” should be the behavior for spark sql - I changed the behavior so bools are converted to “true”/“false” to be consistent with hive, and none of the existing spark tests failed.

Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, “” ) -> “true” while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) -> “TRUE”. The change I’ve made just converts to lowercase strings in all cases. I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become “1” / “0”.

AmplabJenkins · 2014-08-01T02:17:31Z

Can one of the admins verify this patch?

pwendell · 2014-08-01T05:10:04Z

@staple can you add [SQL] to the title of this PR? That way it gets filtered properly by our internal sorting tools.

staple · 2014-08-01T05:21:51Z

Sure, fixed it.

marmbrus · 2014-09-03T04:38:33Z

Hi @staple, thanks for working on this and sorry it took me so long to review it! Now that the 1.1 release is on its way it would be great to revisit this. Overall, the changes you propose seem pretty reasonable to me. A few notes:

Can you update it this so that it merged cleanly?
We'll need to reconcile this with [SPARK-2917] [SQL] Avoid table creation in logical plan analyzing for CTAS #1846 (perhaps this should go in after?)

SparkQA · 2014-09-05T23:44:07Z

Can one of the admins verify this patch?

staple · 2014-09-09T17:09:46Z

Ok, I merged with master.

The primary master change resulting in a merge conflict with this patch was the addition of a call to the new ExtractPythonUdfs rule in HiveContext. For now I decided to leave the explicit call to the ExtractPythonUdfs rule in place, rather than incorporate it into the Analyzer along with the CreateTables and PreInsertionCasts rules I moved there. I am not aware that ExtractPythonUdfs can affect query plan resolution, and I haven’t checked what the consequences would be if the ExtractPythonUdfs rule were called multiple times (on multiple PythonUDFs) which could happen if the rule were added to an Analyzer batch.

staple · 2014-09-09T17:11:00Z

For #1846, either before or after works for me.

marmbrus · 2014-09-10T02:38:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+  /**
+   * Override to provide additional rules for the "Resolution" batch.
+   */
+  val extendedRules: List[Rule[LogicalPlan]] = Nil


In general I'd use the more general Seq instead of List for any declared interfaces.

I'd made it a List in order to use the triple colon concat operator in Analyzer for consistency with the existing code that uses double colon, but sure I can change to Seq instead since that's preferred.

marmbrus · 2014-09-10T02:39:29Z

Oh, hmmm, its actually probably a bug that ExtractPythonUdfs is not run to fixed point. I think it is safe for you to add it using extendedRules instead of how its done now.

marmbrus · 2014-09-10T02:39:36Z

ok to test

SparkQA · 2014-09-10T02:52:03Z

QA tests have started for PR 1706 at commit 80a27dc.

This patch merges cleanly.

staple · 2014-09-10T03:11:15Z

Ok, sure I'll move ExtractPythonUdfs as you suggest. Thanks for taking a look!

SparkQA · 2014-09-10T04:25:59Z

QA tests have finished for PR 1706 at commit 80a27dc.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Last(child: Expression) extends PartialAggregate with trees.UnaryNode[Expression]
- case class LastFunction(expr: Expression, base: AggregateExpression) extends AggregateFunction
- case class Abs(child: Expression) extends UnaryExpression

staple · 2014-09-10T18:18:28Z

Ok, I fixed the test failure (I forgot to update a commit after a merge) and addressed the review comments.

SparkQA · 2014-09-10T18:52:06Z

QA tests have started for PR 1706 at commit dd71904.

This patch merges cleanly.

SparkQA · 2014-09-10T18:53:04Z

QA tests have finished for PR 1706 at commit dd71904.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

staple · 2014-09-10T20:16:47Z

Hi, it looks like the above failure came from upstream, not this patch, fixed here:
[HOTFIX] Fix scala style issue introduced by #2276
26503fd

marmbrus · 2014-09-10T20:38:01Z

Yeah sorry our testing infra has been kinda flakey recently. Asked to test again.

marmbrus · 2014-09-10T20:39:58Z

Jenkins, test this please.

SparkQA · 2014-09-10T20:52:45Z

QA tests have started for PR 1706 at commit dd71904.

This patch merges cleanly.

SparkQA · 2014-09-10T22:29:40Z

QA tests have finished for PR 1706 at commit dd71904.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

staple · 2014-09-10T23:03:39Z

I fixed a compilation error that arose due to a non 'conflicting' merge issue with an upstream commit from earlier today.

marmbrus · 2014-09-10T23:14:49Z

Jenkins, test this please.

SparkQA · 2014-09-10T23:53:20Z

QA tests have started for PR 1706 at commit 32683c4.

This patch merges cleanly.

SparkQA · 2014-09-11T01:44:33Z

QA tests have finished for PR 1706 at commit 32683c4.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-09-11T04:02:18Z

Thanks for doing this! Merged to master.

…apache#1706)

staple changed the title ~~[SPARK-2781] Check resolution of LogicalPlans in Analyzer.~~ [SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. Aug 1, 2014

staple force-pushed the SPARK-2781 branch 2 times, most recently from 0b9878c to 80a27dc Compare September 9, 2014 17:08

marmbrus reviewed Sep 10, 2014
View reviewed changes

staple force-pushed the SPARK-2781 branch from 80a27dc to dd71904 Compare September 10, 2014 18:14

staple force-pushed the SPARK-2781 branch from dd71904 to 4ae94b4 Compare September 10, 2014 23:02

staple added 3 commits September 10, 2014 16:08

[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer.

701dcd2

Fix merge issue causing compilation failure.

915b690

Address review comments.

d49bfb3

staple added 2 commits September 10, 2014 16:08

Move ExtractPythonUdfs to Analyzer's extendedRules in HiveContext.

7c77fda

Fix compilation failure due to merge.

32683c4

staple force-pushed the SPARK-2781 branch from 4ae94b4 to 32683c4 Compare September 10, 2014 23:09

asfgit closed this in c27718f Sep 11, 2014

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://106366544 - Update to ADT 1.1.17 and Iceberg 0.13.0.13-preview (…

4deba2e

…apache#1706)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706

[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706

staple commented Aug 1, 2014

AmplabJenkins commented Aug 1, 2014

pwendell commented Aug 1, 2014

staple commented Aug 1, 2014

marmbrus commented Sep 3, 2014

SparkQA commented Sep 5, 2014

staple commented Sep 9, 2014

staple commented Sep 9, 2014

marmbrus Sep 10, 2014

staple Sep 10, 2014

marmbrus commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

marmbrus commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 11, 2014

marmbrus commented Sep 11, 2014

[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706

[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706

Conversation

staple commented Aug 1, 2014

AmplabJenkins commented Aug 1, 2014

pwendell commented Aug 1, 2014

staple commented Aug 1, 2014

marmbrus commented Sep 3, 2014

SparkQA commented Sep 5, 2014

staple commented Sep 9, 2014

staple commented Sep 9, 2014

marmbrus Sep 10, 2014

Choose a reason for hiding this comment

staple Sep 10, 2014

Choose a reason for hiding this comment

marmbrus commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

marmbrus commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

staple commented Sep 10, 2014

marmbrus commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 11, 2014

marmbrus commented Sep 11, 2014