[SPARK-12813][SQL] Eliminate serialization for back to back operations #10747

marmbrus · 2016-01-13T23:58:14Z

The goal of this PR is to eliminate unnecessary translations when there are back-to-back MapPartitions operations. In order to achieve this I also made the following simplifications:

Operators no longer have hold encoders, instead they have only the expressions that they need. The benefits here are twofold: the expressions are visible to transformations so go through the normal resolution/binding process. now that they are visible we can change them on a case by case basis.
Operators no longer have type parameters. Since the engine is responsible for its own type checking, having the types visible to the complier was an unnecessary complication. We still leverage the scala compiler in the companion factory when constructing a new operator, but after this the types are discarded.

Deferred to a follow up PR:

Remove as much of the resolution/binding from Dataset/GroupedDataset as possible. We should still eagerly check resolution and throw an error though in the case of mismatches for an as operation.
Eliminate serializations in more cases by adding more cases to EliminateSerialization

marmbrus · 2016-01-13T23:58:35Z

/cc @cloud-fan

SparkQA · 2016-01-14T00:16:24Z

Test build #49353 has finished for PR 10747 at commit 4615c96.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait ObjectOperator extends LogicalPlan
- case class MapPartitions(
- case class AppendColumns(
- case class MapGroups(
- case class CoGroup(
- trait ObjectOperator extends SparkPlan
- case class MapPartitions(
- case class AppendColumns(
- case class MapGroups(
- case class CoGroup(

marmbrus · 2016-01-14T00:54:09Z

test this please

SparkQA · 2016-01-14T02:36:57Z

Test build #49360 has finished for PR 10747 at commit ee7f3c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-01-14T08:20:17Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/EliminateSerializationSuite.scala

+  test("back to back MapPartitions") {
+    val input = LocalRelation('_1.int, '_2.int)
+    val plan =
+      MapPartitions(func,


should have a test case that tests a plan that cannot be eliminated?

Oh yeah, I guess I forgot to push it.

There's a test here and and in end-to-end one in DatasetSuite now.

cloud-fan · 2016-01-14T18:26:17Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

-        encoderFor[U],
-        encoderFor[U].schema.toAttributes,
-        logicalPlan))
+      MapPartitions[T, U](func, logicalPlan))


This is different from the previous one, we only pass the type parameter T to MapPartitions and build a new encoder there which is unresolved, while before this PR we pass a resolvedTEncoder. Do we break the life cycle of encoder in this PR?

This is just pushing the lifecycle of the encoder into the analyzer / physical operators where it belongs.

SparkQA · 2016-01-14T20:14:42Z

Test build #49404 has finished for PR 10747 at commit ecde6e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-14T21:01:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala

+    input: Expression,
+    serializer: Seq[NamedExpression],
+    child: LogicalPlan) extends UnaryNode with ObjectOperator {
+  override def output: Seq[Attribute] = serializer.map(_.toAttribute)


can we just use serializer.map(_.toAttribute.newInstance) here? then we don't need to add NamedExpression.newInstance

That would return different expressionIds anytime the function was called. Where as we want to fix the expression IDs when the NamedExpression is created.

SparkQA · 2016-01-15T00:17:02Z

Test build #49417 has finished for PR 10747 at commit c34aacf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-01-15T00:36:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BoundAttribute.scala

@@ -31,7 +31,7 @@ import org.apache.spark.sql.types._
 case class BoundReference(ordinal: Int, dataType: DataType, nullable: Boolean)
  extends LeafExpression with NamedExpression {


unrelated question: why BoundReference extends NamedExpression?

Its kinda of a hack, but sometimes after transforms we end up with BoundReferences in the place of fields that were AttributeReference and so there were class cast exceptions. We might be able to remove this some day or now?

cloud-fan · 2016-01-15T00:41:50Z

LGTM

marmbrus · 2016-01-15T00:53:06Z

Thanks for reviewing! Merging to master.

[SPARK-12813][SQL] Eliminate serialization for back to back operations

4615c96

marmbrus added 2 commits January 13, 2016 16:21

Merge remote-tracking branch 'apache/master' into encoderExpressions

4c19ecb

style

ee7f3c6

rxin reviewed Jan 14, 2016
View reviewed changes

more readability fixes

ecde6e5

cloud-fan reviewed Jan 14, 2016
View reviewed changes

add more comments

c34aacf

cloud-fan reviewed Jan 15, 2016
View reviewed changes

asfgit closed this in cc7af86 Jan 15, 2016

marmbrus deleted the encoderExpressions branch March 8, 2016 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12813][SQL] Eliminate serialization for back to back operations #10747

[SPARK-12813][SQL] Eliminate serialization for back to back operations #10747

marmbrus commented Jan 13, 2016

marmbrus commented Jan 13, 2016

SparkQA commented Jan 14, 2016

marmbrus commented Jan 14, 2016

SparkQA commented Jan 14, 2016

rxin Jan 14, 2016

marmbrus Jan 14, 2016

marmbrus Jan 14, 2016

cloud-fan Jan 14, 2016

marmbrus Jan 14, 2016

SparkQA commented Jan 14, 2016

cloud-fan Jan 14, 2016

marmbrus Jan 14, 2016

cloud-fan Jan 14, 2016

SparkQA commented Jan 15, 2016

cloud-fan Jan 15, 2016

marmbrus Jan 15, 2016

cloud-fan commented Jan 15, 2016

marmbrus commented Jan 15, 2016

		@@ -31,7 +31,7 @@ import org.apache.spark.sql.types._
		case class BoundReference(ordinal: Int, dataType: DataType, nullable: Boolean)
		extends LeafExpression with NamedExpression {

[SPARK-12813][SQL] Eliminate serialization for back to back operations #10747

[SPARK-12813][SQL] Eliminate serialization for back to back operations #10747

Conversation

marmbrus commented Jan 13, 2016

marmbrus commented Jan 13, 2016

SparkQA commented Jan 14, 2016

marmbrus commented Jan 14, 2016

SparkQA commented Jan 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 15, 2016

marmbrus commented Jan 15, 2016