-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12558][SQL][WIP] AnalysisException when multiple functions applied in GROUP BY clause #10520
Conversation
@@ -233,6 +233,18 @@ private[hive] case class HiveGenericUDF(funcWrapper: HiveFunctionWrapper, childr | |||
udfType != null && udfType.deterministic() | |||
} | |||
|
|||
override def semanticEquals(other: Expression): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment why the version in Expression doesn't work? What about it is too strict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nongli Thanks a lot for your feedback. In this case , we have two instances of HiveGenericUDF one from the grouping expression and other from aggregation expression. While doing a semantic equality between them , we unwrap the case class and fall through to the last case statement which does equality between two HiveFunctionWrapper(s) and is failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nongli @cloud-fan We could also override equals in HiveFunctionWrapper to just compare the functionClassName. Would this be a better option ? Please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to override HiveFunctionWrapper.equals
, if the HiveFunctionWrapper.instance
has no contribution to equality.
@cloud-fan Hi Wenchen, thanks for your comments. I went through the processing. I think its ok to just override equals and compare the udf class names here. The number of arguments, their semantic equality and deterministic property is handled in the semanticEquals method in expression.scala. Please let me know what you think. |
@@ -125,6 +125,12 @@ private[hive] object HiveShim { | |||
// for Serialization | |||
def this() = this(null) | |||
|
|||
override def equals(other: Any): Boolean = other match { | |||
case a: HiveFunctionWrapper => | |||
functionClassName == a.functionClassName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yhuai , Does same function name means exactly same function in hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal I feel it is general fine to compare the class name (it is what we do in 1.5. In 1.5, instance is not a part of the constructor). We added instance when we added the support of hive's GenericUDFMacro. I guess we can add a check to see if GenericUDFMacro
is the class name. If so, we also check instance. Otherwise, we just use the class name. Also, I guess you need to override hashCode as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai Thanks Yin. Will make the changes per your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhuai Have addressed your comments. Can you please take a look to see if it looks ok ?
Can you explain more why the semanticEquals at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L152 fails? |
@yhuai Hi Yin, Here is what i had while answering the question from Nong Li. Please let me know if you have more questions that i need to find the answers for. In this case , we have two instances of HiveGenericUDF one from the grouping expression and other from aggregation expression. While doing a semantic equality between them , we unwrap the case class and fall through to the last case statement which does equality between two HiveFunctionWrapper(s) and is failing. Let me know what you think. |
so, two |
@yhuai Yes Yin. |
@dilipbiswal Sorry. I do not think I understand the cause. We only set the |
@yhuai Hi Yin, let me double check. As far as i remember, we come to the last case in expression.scala:semanticEqual with two HiveFunctionWrapper and was failing in equal. With a overriden equal it worked. Let me debug this further and get back... |
@yhuai Hi Yin, i remember that in this reproduction, the instances were actually having values i.e they were not null. I will double check on this and get back. |
@yhuai So the instance is authored through createFunction() while computing returnInspector. So we think we need to cache the instance in this case as its not a simpleUDF. Let me know your thoughts. |
@yhuai Hello Yin, did you have any further insights on this ? Is it safe to override equal to ignore the instance while comparing to see if two functions are same ? |
54eeec0
to
24b32d5
Compare
Objects.hashCode(functionClassName, instance) | ||
} else { | ||
functionClassName.hashCode() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we need to correct the format?
Test build #2371 has finished for PR 10520 at commit
|
@yhuai Hi Yin, Could you please look at the changes when you get a chance. It turns out that we can't compare the classname and instance for equality for GenericUDFMacro as the instances are different. I am instead comparing the underlying function that is held by GenericUDFMacro. Please let me know what you think. |
test this please |
Test build #49275 has finished for PR 10520 at commit
|
Can you remove the |
Actually, let me change the title while merging it. |
LGTM. Merging to master and branch 1.6. |
…in GROUP BY clause cloud-fan Can you please take a look ? In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue. Author: Dilip Biswal <[email protected]> Closes #10520 from dilipbiswal/spark-12558. (cherry picked from commit dc7b387) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveShim.scala
test("Hive UDF in group by") { | ||
Seq(Tuple1(1451400761)).toDF("test_date").registerTempTable("tab1") | ||
val count = sql("select date(cast(test_date as timestamp))" + | ||
" from tab1 group by date(cast(test_date as timestamp))").count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it will be good to use withTempTable, which will automatically drop the temp table. Also, it will be more robust if we create a temp function based on hive's org.apache.hadoop.hive.ql.udf.generic.GenericUDFToDate
. So, even if we later have a native function called date, we can still test the hive udf. We can do the test change in a follow-up pr (the pr can re-use the jira number).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realize it after I merge the PR. Let's have a pr to improve the test. Thanks!
@yhuai Thank you very much !! |
@yhuai I will make the change in a follow up PR re-using the same JIRA. |
Great. Thank you! |
…plied in GROUP BY clause Addresses the comments from Yin. #10520 Author: Dilip Biswal <[email protected]> Closes #10758 from dilipbiswal/spark-12558-followup.
…plied in GROUP BY clause Addresses the comments from Yin. #10520 Author: Dilip Biswal <[email protected]> Closes #10758 from dilipbiswal/spark-12558-followup. (cherry picked from commit db9a860) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDFSuite.scala
@cloud-fan Can you please take a look ?
In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue.