[SPARK-6986][SQL] Use Serializer2 in more cases. #5849

yhuai · 2015-05-01T22:16:19Z

With 0a2b15c, the serialization stream and deserialization stream has enough information to determine it is handling a key-value pari, a key, or a value. It is safe to use SparkSqlSerializer2 in more cases.

AmplabJenkins · 2015-05-01T22:17:10Z

Merged build triggered.

AmplabJenkins · 2015-05-01T22:17:15Z

Merged build started.

SparkQA · 2015-05-01T22:17:50Z

Test build #31612 has started for PR 5849 at commit 5073c54.

SparkQA · 2015-05-01T22:43:37Z

Test build #31612 has finished for PR 5849 at commit 5073c54.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- "public class " + className + extendsText + " implements java.io.Serializable
- class DataFrameStatFunctions(object):

AmplabJenkins · 2015-05-01T22:43:41Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-01T22:43:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31612/
Test FAILed.

AmplabJenkins · 2015-05-02T05:47:09Z

Merged build triggered.

AmplabJenkins · 2015-05-02T05:47:18Z

Merged build started.

SparkQA · 2015-05-02T05:48:48Z

Test build #31657 has started for PR 5849 at commit 8627238.

SparkQA · 2015-05-02T08:18:49Z

Test build #31657 timed out for PR 5849 at commit 8627238 after a configured wait of 150m.

AmplabJenkins · 2015-05-02T08:18:55Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-02T08:18:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31657/
Test FAILed.

yhuai · 2015-05-04T01:18:31Z

Seems the reason of those test failures is that we are buffering records in the reader side of the shuffle process and we are currently using mutable rows, which require explicitly copy when we use buffer.

@sryza Is there any place in the sort based shuffle that we buffer key-value pairs?

AmplabJenkins · 2015-05-06T20:37:13Z

Merged build triggered.

AmplabJenkins · 2015-05-06T20:37:21Z

Merged build started.

SparkQA · 2015-05-06T20:39:06Z

Test build #32025 has started for PR 5849 at commit 39179da.

SparkQA · 2015-05-06T20:59:10Z

Test build #32025 has finished for PR 5849 at commit 39179da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JoinedRow6 extends Row
- case class WindowSpecDefinition(
- case class WindowSpecReference(name: String) extends WindowSpec
- sealed trait FrameBoundary
- case class ValuePreceding(value: Int) extends FrameBoundary
- case class ValueFollowing(value: Int) extends FrameBoundary
- case class SpecifiedWindowFrame(
- trait WindowFunction extends Expression
- case class UnresolvedWindowFunction(
- case class UnresolvedWindowExpression(
- case class WindowExpression(
- case class WithWindowDefinition(
- case class Window(
- case class Window(
- case class ComputedWindow(

AmplabJenkins · 2015-05-06T20:59:14Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-06T20:59:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32025/
Test FAILed.

yhuai · 2015-05-06T22:19:38Z

stack trace is

java.lang.UnsupportedOperationException
    at org.apache.spark.util.collection.ChainedBufferOutputStream.write(ChainedBuffer.scala:137)
    at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
    at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:293)
    at org.apache.spark.sql.execution.SparkSqlSerializer2$$anonfun$createSerializationFunction$1.apply(SparkSqlSerializer2.scala:187)
    at org.apache.spark.sql.execution.Serializer2SerializationStream.writeKey(SparkSqlSerializer2.scala:65)
    at org.apache.spark.util.collection.PartitionedSerializedPairBuffer.insert(PartitionedSerializedPairBuffer.scala:74)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

@sryza Is there any particular reason that write(b: Int) is not supported in ChainedBufferOutputStream? Or, we just have not implemented it?

AmplabJenkins · 2015-05-07T06:32:12Z

Merged build triggered.

AmplabJenkins · 2015-05-07T06:32:20Z

Merged build started.

SparkQA · 2015-05-07T06:34:06Z

Test build #32084 has started for PR 5849 at commit 7d94b87.

SparkQA · 2015-05-07T08:24:49Z

Test build #32084 has finished for PR 5849 at commit 7d94b87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-07T08:24:53Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-07T08:24:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32084/
Test PASSed.

JoshRosen · 2015-05-07T21:45:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer2.scala

  }

  override def readKey[T: ClassTag](): T = {
-    readKeyFunc()
-    key.asInstanceOf[T]
+    readKeyFunc().asInstanceOf[T]


Does it make a performance difference if we move the cast to the line where we define readKeyFunc? If we did that, I think we'd be doing one cast vs. casting on each record.

Ah, I guess we don't have the class tag for T when we create the deserialization function, so this approach looks fine to me.

JoshRosen · 2015-05-07T22:33:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer2.scala

    () => {
      // If the schema is null, the returned function does nothing when it get called.
      if (schema != null) {
        var i = 0
+        val mutableRow = new GenericMutableRow(schema.length)


@yhuai and I chatted about this offline. The reason that we need to perform this copy is because this patch allows SqlSerializer2 to be used in cases where the shuffle performs a sort. In HashShuffleReader, Spark ends up passing the iterator returned from this deserializer to ExternalSorter, which buffers rows because it needs to sort them based on their contents.

I think that we only need to copy the row in cases where we're shuffling with a key ordering. To avoid unnecessary copying in other cases, I think that we can extend SparkSqlSerializer2's constructor to accept a boolean flag that indicates whether we should copy, and should thread that flag all the way down to here. In Exchange, where we create the serializer, we can check whether the shuffle will use a keyOrdering; if it does, then we'll enable copying. Avoiding this copy in other cases should provide a nice performance boost for aggregation queries.

… merge join.

AmplabJenkins · 2015-05-07T22:57:13Z

Merged build triggered.

AmplabJenkins · 2015-05-07T22:57:19Z

Merged build started.

SparkQA · 2015-05-07T22:58:59Z

Test build #32158 has started for PR 5849 at commit 53a5eaa.

SparkQA · 2015-05-08T00:53:22Z

Test build #32158 has finished for PR 5849 at commit 53a5eaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ElementwiseProduct extends UnaryTransformer[Vector, Vector, ElementwiseProduct]
- class ElementwiseProduct(val scalingVector: Vector) extends VectorTransformer
- class RegressionMetrics(JavaModelWrapper):

AmplabJenkins · 2015-05-08T00:53:26Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-08T00:53:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32158/
Test PASSed.

JoshRosen · 2015-05-08T01:01:58Z

LGTM overall, especially since this code seems to be well covered by tests.

yhuai · 2015-05-08T03:59:17Z

Thanks for reviewing it. I am merging it to master and branch 1.4.

With 0a2b15c, the serialization stream and deserialization stream has enough information to determine it is handling a key-value pari, a key, or a value. It is safe to use `SparkSqlSerializer2` in more cases. Author: Yin Huai <[email protected]> Closes #5849 from yhuai/serializer2MoreCases and squashes the following commits: 53a5eaa [Yin Huai] Josh's comments. 487f540 [Yin Huai] Use BufferedOutputStream. 8385f95 [Yin Huai] Always create a new row at the deserialization side to work with sort merge join. c7e2129 [Yin Huai] Update tests. 4513d13 [Yin Huai] Use Serializer2 in more places. (cherry picked from commit 3af423c) Signed-off-by: Yin Huai <[email protected]>

With apache@0a2b15c, the serialization stream and deserialization stream has enough information to determine it is handling a key-value pari, a key, or a value. It is safe to use `SparkSqlSerializer2` in more cases. Author: Yin Huai <[email protected]> Closes apache#5849 from yhuai/serializer2MoreCases and squashes the following commits: 53a5eaa [Yin Huai] Josh's comments. 487f540 [Yin Huai] Use BufferedOutputStream. 8385f95 [Yin Huai] Always create a new row at the deserialization side to work with sort merge join. c7e2129 [Yin Huai] Update tests. 4513d13 [Yin Huai] Use Serializer2 in more places.

yhuai changed the title ~~[SPARK-6368][SQL][Follow-up] Use Serializer2 in more cases.~~ [SPARK-6986][SQL] Use Serializer2 in more cases. May 2, 2015

JoshRosen mentioned this pull request May 7, 2015

[SPARK-7375] [SQL] Avoid row copying in exchange when sort.serializeMapOutputs takes effect #5948

Closed

JoshRosen reviewed May 7, 2015
View reviewed changes

yhuai added 5 commits May 7, 2015 15:53

Use Serializer2 in more places.

4513d13

Update tests.

c7e2129

Always create a new row at the deserialization side to work with sort…

8385f95

… merge join.

Use BufferedOutputStream.

487f540

Josh's comments.

53a5eaa

asfgit closed this in 3af423c May 8, 2015

[SPARK-6986][SQL] Use Serializer2 in more cases. #5849

[SPARK-6986][SQL] Use Serializer2 in more cases. #5849

Conversation

yhuai commented May 1, 2015

AmplabJenkins commented May 1, 2015

AmplabJenkins commented May 1, 2015

SparkQA commented May 1, 2015

SparkQA commented May 1, 2015

AmplabJenkins commented May 1, 2015

AmplabJenkins commented May 1, 2015

AmplabJenkins commented May 2, 2015

AmplabJenkins commented May 2, 2015

SparkQA commented May 2, 2015

SparkQA commented May 2, 2015

AmplabJenkins commented May 2, 2015

AmplabJenkins commented May 2, 2015

yhuai commented May 4, 2015

AmplabJenkins commented May 6, 2015

AmplabJenkins commented May 6, 2015

SparkQA commented May 6, 2015

SparkQA commented May 6, 2015

AmplabJenkins commented May 6, 2015

AmplabJenkins commented May 6, 2015

yhuai commented May 6, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

SparkQA commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

JoshRosen May 7, 2015

Choose a reason for hiding this comment

JoshRosen May 7, 2015

Choose a reason for hiding this comment

JoshRosen May 7, 2015

Choose a reason for hiding this comment

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

JoshRosen commented May 8, 2015

yhuai commented May 8, 2015