-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12321][SQL] JSON format for TreeNode (use reflection) #10311
Conversation
Test build #47742 has finished for PR 10311 at commit
|
Test build #47743 has finished for PR 10311 at commit
|
Scala always puts the parameter names in the def parameterNames: Seq[String] = {
import scala.reflect.runtime.universe._
val m = runtimeMirror(this.getClass.getClassLoader)
val classSymbol = m.staticClass(this.getClass.getName)
val t = classSymbol.selfType
val formalTypeArgs = t.typeSymbol.asClass.typeParams
val TypeRef(_, _, actualTypeArgs) = t
val constructorSymbol = t.member(nme.CONSTRUCTOR)
val params = if (constructorSymbol.isMethod) {
constructorSymbol.asMethod.paramss
} else {
// Find the primary constructor, and use its parameter ordering.
val primaryConstructorSymbol: Option[Symbol] = constructorSymbol.asTerm.alternatives.find(
s => s.isMethod && s.asMethod.isPrimaryConstructor)
if (primaryConstructorSymbol.isEmpty) {
sys.error("Internal SQL error: Product object did not have a primary constructor.")
} else {
primaryConstructorSymbol.get.asMethod.paramss
}
}
params.head.map(_.name.toString)
} scala> sql("SELECT 1").queryExecution.logical.parameterNames
res1: Seq[String] = List(projectList, child) Note thats just a quick draft and you could probably simplify it (or pull this out into a utility since I think we do it in several places now). |
Other small comments:
|
} | ||
} | ||
|
||
params.flatten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for curried constructor, they have more than 1 parameter lists and we should combine them.
} | ||
} | ||
|
||
params.flatten.map { p => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for curried constructor, they have more than 1 parameter lists and we should flatten and combine them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you double check that this doesn't cause problems with inner classes (like in the REPL).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked, this doesn't work for inner classes. So if a user creates an inner product class and use it as part of a TreeNode
, we can't deserialize it from JSON.
The problem is not about this flatten
, but we missing an extra parameter: the outer point. I think we can fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I was only asking to make sure we weren't breaking our inner class handling that we do for datasets since I think we are now sharing this code.
Test build #47819 has finished for PR 10311 at commit
|
Test build #47837 has finished for PR 10311 at commit
|
Test build #47826 has finished for PR 10311 at commit
|
var logicalRelations = logicalPlan.collect { case l: LogicalRelation => l } | ||
try { | ||
val jsonBackPlan = fromJSON(toJSON(logicalPlan)).asInstanceOf[LogicalPlan] | ||
val normalized = jsonBackPlan transformDown { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment why you need to do this? (i.e. RDDs/data are serializable to JSON)
This is looking really good! |
Test build #47927 has finished for PR 10311 at commit
|
import org.apache.spark.sql.catalyst.plans._ | ||
import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Statistics} | ||
|
||
object TreeNodeJsonFormatter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this an object instead of inlining the code in TreeNode
or making it a trait, because we may need to access some classes in core or hive module to special handle them, and then we may have to move it to other module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to avoid moving them out. Better would be to have a trait JsonSerializable
that we can mix in or something. We should also figure out if we really need to be able to serialize them or not. For leaf nodes I think grabbing the schema and some details would be sufficient.
now it can pass all tests in the sql module! |
Test build #47980 has finished for PR 10311 at commit
|
It's weird that I can't reproduce the failed tests reported by jenkins locally. Adding better error message to see what's happening. |
Test build #48012 has finished for PR 10311 at commit
|
Test build #48014 has finished for PR 10311 at commit
|
Test build #48026 has finished for PR 10311 at commit
|
@@ -463,4 +479,244 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { | |||
} | |||
s"$nodeName(${args.mkString(",")})" | |||
} | |||
|
|||
def toJSON: String = { | |||
pretty(render(jsonValue)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want this to be pretty by default. Thats going to make it harder to parse in Spark SQL.
Only one minor comment, otherwise LGTM. |
Test build #48027 has finished for PR 10311 at commit
|
Test build #48051 has finished for PR 10311 at commit
|
Thanks, merging to master. |
It looks like this patch may have introduced some rare, non-deterministic flakiness into certain tests. See https://spark-tests.appspot.com/tests/org.apache.spark.sql.hive.execution.SQLQuerySuite/udf_java_method, for example, which has two examples of cases where logical plans could not be parsed to JSON: https://spark-tests.appspot.com/test-logs/33413063 |
hi @JoshRosen , I opend #10430 for hot fix, if you think this is emergent. I'll spend more time to come up with a proper fix. |
Hi @cloud-fan can you explain in which cases we can use this feature or the motivation for this? |
@scwf the goal here is to be able to do later analysis of queries that users are running with Spark / Spark SQL. |
Get it thanks @marmbrus :) |
#10311 introduces some rare, non-deterministic flakiness for hive udf tests, see #10311 (comment) I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization. Author: Wenchen Fan <[email protected]> Closes #10430 from cloud-fan/hot-fix.
An alternative solution for apache#10295 , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in `TreeNode`. Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field `num-children` to each plan node, so that we can reconstruct the tree from the list. example json: logical plan tree: ``` [ { "class" : "org.apache.spark.sql.catalyst.plans.logical.Sort", "num-children" : 1, "order" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder", "num-children" : 1, "child" : 0, "direction" : "Ascending" }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "i", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 10, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] } ] ], "global" : false, "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.Project", "num-children" : 1, "projectList" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.Alias", "num-children" : 1, "child" : 0, "name" : "i", "exprId" : { "id" : 10, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Add", "num-children" : 2, "left" : 0, "right" : 1 }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Literal", "num-children" : 0, "value" : "1", "dataType" : "integer" } ], [ { "class" : "org.apache.spark.sql.catalyst.expressions.Alias", "num-children" : 1, "child" : 0, "name" : "j", "exprId" : { "id" : 11, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Multiply", "num-children" : 2, "left" : 0, "right" : 1 }, { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] }, { "class" : "org.apache.spark.sql.catalyst.expressions.Literal", "num-children" : 0, "value" : "2", "dataType" : "integer" } ] ], "child" : 0 }, { "class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation", "num-children" : 0, "output" : [ [ { "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference", "num-children" : 0, "name" : "a", "dataType" : "integer", "nullable" : true, "metadata" : { }, "exprId" : { "id" : 0, "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6" }, "qualifiers" : [ ] } ] ], "data" : [ ] } ] ``` Author: Wenchen Fan <[email protected]> Closes apache#10311 from cloud-fan/toJson-reflection.
#10311 introduces some rare, non-deterministic flakiness for hive udf tests, see #10311 (comment) I can't reproduce it locally, and may need more time to investigate, a quick solution is: bypass hive tests for json serialization. Author: Wenchen Fan <[email protected]> Closes #10430 from cloud-fan/hot-fix. (cherry picked from commit 8543997) Signed-off-by: Michael Armbrust <[email protected]>
An alternative solution for #10295 , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in
TreeNode
.Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field
num-children
to each plan node, so that we can reconstruct the tree from the list.example json:
logical plan tree: