-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2781][SQL] Check resolution of LogicalPlans in Analyzer. #1706
Conversation
Can one of the admins verify this patch? |
@staple can you add |
Sure, fixed it. |
Hi @staple, thanks for working on this and sorry it took me so long to review it! Now that the 1.1 release is on its way it would be great to revisit this. Overall, the changes you propose seem pretty reasonable to me. A few notes:
|
Can one of the admins verify this patch? |
0b9878c
to
80a27dc
Compare
Ok, I merged with master. The primary master change resulting in a merge conflict with this patch was the addition of a call to the new ExtractPythonUdfs rule in HiveContext. For now I decided to leave the explicit call to the ExtractPythonUdfs rule in place, rather than incorporate it into the Analyzer along with the CreateTables and PreInsertionCasts rules I moved there. I am not aware that ExtractPythonUdfs can affect query plan resolution, and I haven’t checked what the consequences would be if the ExtractPythonUdfs rule were called multiple times (on multiple PythonUDFs) which could happen if the rule were added to an Analyzer batch. |
For #1846, either before or after works for me. |
/** | ||
* Override to provide additional rules for the "Resolution" batch. | ||
*/ | ||
val extendedRules: List[Rule[LogicalPlan]] = Nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'd use the more general Seq
instead of List
for any declared interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd made it a List in order to use the triple colon concat operator in Analyzer for consistency with the existing code that uses double colon, but sure I can change to Seq instead since that's preferred.
Oh, hmmm, its actually probably a bug that |
ok to test |
QA tests have started for PR 1706 at commit
|
Ok, sure I'll move ExtractPythonUdfs as you suggest. Thanks for taking a look! |
QA tests have finished for PR 1706 at commit
|
Ok, I fixed the test failure (I forgot to update a commit after a merge) and addressed the review comments. |
QA tests have started for PR 1706 at commit
|
QA tests have finished for PR 1706 at commit
|
Yeah sorry our testing infra has been kinda flakey recently. Asked to test again. |
Jenkins, test this please. |
QA tests have started for PR 1706 at commit
|
QA tests have finished for PR 1706 at commit
|
I fixed a compilation error that arose due to a non 'conflicting' merge issue with an upstream commit from earlier today. |
Jenkins, test this please. |
QA tests have started for PR 1706 at commit
|
QA tests have finished for PR 1706 at commit
|
Thanks for doing this! Merged to master. |
LogicalPlan contains a ‘resolved’ attribute indicating that all of its execution requirements have been resolved. This attribute is not checked before query execution. The analyzer contains a step to check that all Expressions are resolved, but this is not equivalent to checking all LogicalPlans. In particular, the Union plan’s implementation of ‘resolved’ verifies that the types of its children’s columns are compatible. Because the analyzer does not check that a Union plan is resolved, it is possible to execute a Union plan that outputs different types in the same column. See SPARK-2781 for an example.
This patch adds two checks to the analyzer’s CheckResolution rule. First, each logical plan is checked to see if it is not resolved despite its children being resolved. This allows the ‘problem’ unresolved plan to be included in the TreeNodeException for reporting. Then as a backstop the root plan is checked to see if it is resolved, which recursively checks that the entire plan tree is resolved. Note that the resolved attribute is implemented recursively, and this patch also explicitly checks the resolved attribute on each logical plan in the tree. I assume the query plan trees will not be large enough for this redundant checking to meaningfully impact performance.
Because this patch starts validating that LogicalPlans are resolved before execution, I had to fix some cases where unresolved plans were passing through the analyzer as part of the implementation of the hive query system. In particular, HiveContext applies the CreateTables and PreInsertionCasts, and ExtractPythonUdfs rules manually after the analyzer runs. I moved these rules to the analyzer stage (for hive queries only), in the process completing a code TODO indicating the rules should be moved to the analyzer.
It’s worth noting that moving the CreateTables rule means introducing an analyzer rule with a significant side effect - in this case the side effect is creating a hive table. The rule will only attempt to create a table once even if its batch is executed multiple times, because it converts the InsertIntoCreatedTable plan it matches against into an InsertIntoTable. Additionally, these hive rules must be added to the Resolution batch rather than as a separate batch because hive rules rules may be needed to resolve non-root nodes, leaving the root to be resolved on a subsequent batch iteration. For example, the hive compatibility test auto_smb_mapjoin_14, and others, make use of a query plan where the root is a Union and its children are each a hive InsertIntoTable.
Mixing the custom hive rules with standard analyzer rules initially resulted in an additional failure because of policy differences between spark sql and hive when casting a boolean to a string. Hive casts booleans to strings as “true” / “false” while spark sql casts booleans to strings as “1” / “0” (causing the cast1.q test to fail). This behavior is a result of the BooleanCasts rule in HiveTypeCoercion.scala, and from looking at the implementation of BooleanCasts I think converting to to “1”/“0” is potentially a programming mistake. (If the BooleanCasts rule is disabled, casting produces “true”/“false” instead.) I believe “true” / “false” should be the behavior for spark sql - I changed the behavior so bools are converted to “true”/“false” to be consistent with hive, and none of the existing spark tests failed.
Finally, in some initial testing with hive it appears that an implicit type coercion of boolean to string results in a lowercase string, e.g. CONCAT( TRUE, “” ) -> “true” while an explicit cast produces an all caps string, e.g. CAST( TRUE AS STRING ) -> “TRUE”. The change I’ve made just converts to lowercase strings in all cases. I believe it is at least more correct than the existing spark sql implementation where all Cast expressions become “1” / “0”.