Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24117][SQL] Unified the getSizePerRow #21189

Closed
wants to merge 2 commits into from
Closed

[SPARK-24117][SQL] Unified the getSizePerRow #21189

wants to merge 2 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Apr 28, 2018

What changes were proposed in this pull request?

This pr unified the getSizePerRow because getSizePerRow is used in many places. For example:

  1. LocalRelation.scala#L80
  2. SizeInBytesOnlyStatsPlanVisitor.scala#L36

How was this patch tested?

Exist tests

@SparkQA
Copy link

SparkQA commented Apr 28, 2018

Test build #89955 has finished for PR 21189 at commit cd41538.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -178,7 +179,7 @@ class MemoryDataWriter(partition: Int, outputMode: OutputMode)
* Used to query the data that has been written into a [[MemorySinkV2]].
*/
case class MemoryPlanV2(sink: MemorySinkV2, override val output: Seq[Attribute]) extends LeafNode {
private val sizePerRow = output.map(_.dataType.defaultSize).sum
Copy link
Member

@gatorsmile gatorsmile Apr 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think it's possible.


sink.addBatch(1, 4 to 6)
plan.invalidateStatsCache()
assert(plan.stats.sizeInBytes === 24)
assert(plan.stats.sizeInBytes === 72)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemorySinkV2 is mainly for testing. I think the stats changes will not impact anything, right? @tdas @jose-torres

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't impact anything, but abstractly it seems strange that this unification would cause the stats to change? What are we doing differently to cause this, and how confident are we this won't happen to production sinks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we forgot to count the row object overhead (8 bytes) before in memory stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM then

val childRowSize = p.child.output.map(_.dataType.defaultSize).sum + 8
val outputRowSize = p.output.map(_.dataType.defaultSize).sum + 8
val childRowSize = EstimationUtils.getSizePerRow(p.child.output)
val outputRowSize = EstimationUtils.getSizePerRow(p.output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 8, 2018

Test build #90365 has finished for PR 21189 at commit f72084e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 487faf1 May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants