[SPARK-24117][SQL] Unified the getSizePerRow #21189

wangyum · 2018-04-28T11:26:42Z

What changes were proposed in this pull request?

This pr unified the getSizePerRow because getSizePerRow is used in many places. For example:

How was this patch tested?

Exist tests

SparkQA · 2018-04-28T15:15:33Z

Test build #89955 has finished for PR 21189 at commit cd41538.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-04-30T16:43:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/sources/memoryV2.scala

@@ -178,7 +179,7 @@ class MemoryDataWriter(partition: Int, outputMode: OutputMode)
 * Used to query the data that has been written into a [[MemorySinkV2]].
 */
 case class MemoryPlanV2(sink: MemorySinkV2, override val output: Seq[Attribute]) extends LeafNode {
-  private val sizePerRow = output.map(_.dataType.defaultSize).sum


@tdas @jose-torres Is that possible this can be zero? see https://github.com/wangyum/spark/blob/cd415381386f0ac5c29cd6dab57ceafc86e96adf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala#L34-L35

I wouldn't think it's possible.

gatorsmile · 2018-04-30T16:48:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/MemorySinkSuite.scala


    sink.addBatch(1, 4 to 6)
    plan.invalidateStatsCache()
-    assert(plan.stats.sizeInBytes === 24)
+    assert(plan.stats.sizeInBytes === 72)


MemorySinkV2 is mainly for testing. I think the stats changes will not impact anything, right? @tdas @jose-torres

It shouldn't impact anything, but abstractly it seems strange that this unification would cause the stats to change? What are we doing differently to cause this, and how confident are we this won't happen to production sinks?

It seems we forgot to count the row object overhead (8 bytes) before in memory stream.

gatorsmile · 2018-04-30T16:50:39Z

...pache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala

-    val childRowSize = p.child.output.map(_.dataType.defaultSize).sum + 8
-    val outputRowSize = p.output.map(_.dataType.defaultSize).sum + 8
+    val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
+    val outputRowSize = EstimationUtils.getSizePerRow(p.output)


cc @juliuszsompolski @cloud-fan

cloud-fan · 2018-05-03T01:43:50Z

LGTM

SparkQA · 2018-05-08T13:16:01Z

Test build #90365 has finished for PR 21189 at commit f72084e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-05-08T15:43:23Z

thanks, merging to master!

Unified the getSizePerRow

cd41538

gatorsmile reviewed Apr 30, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into SPARK-24117

f72084e

asfgit closed this in 487faf1 May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24117][SQL] Unified the getSizePerRow #21189

[SPARK-24117][SQL] Unified the getSizePerRow #21189

wangyum commented Apr 28, 2018

SparkQA commented Apr 28, 2018

gatorsmile Apr 30, 2018 •

edited

Loading

jose-torres May 1, 2018

gatorsmile Apr 30, 2018

jose-torres May 1, 2018

cloud-fan May 2, 2018

jose-torres May 2, 2018

gatorsmile Apr 30, 2018

cloud-fan commented May 3, 2018

SparkQA commented May 8, 2018

cloud-fan commented May 8, 2018

[SPARK-24117][SQL] Unified the getSizePerRow #21189

[SPARK-24117][SQL] Unified the getSizePerRow #21189

Conversation

wangyum commented Apr 28, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 28, 2018

gatorsmile Apr 30, 2018 • edited Loading

Choose a reason for hiding this comment

jose-torres May 1, 2018

Choose a reason for hiding this comment

gatorsmile Apr 30, 2018

Choose a reason for hiding this comment

jose-torres May 1, 2018

Choose a reason for hiding this comment

cloud-fan May 2, 2018

Choose a reason for hiding this comment

jose-torres May 2, 2018

Choose a reason for hiding this comment

gatorsmile Apr 30, 2018

Choose a reason for hiding this comment

cloud-fan commented May 3, 2018

SparkQA commented May 8, 2018

cloud-fan commented May 8, 2018

gatorsmile Apr 30, 2018 •

edited

Loading