-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23214][SQL] cached data should not carry extra hint info #20394
Conversation
// However, we also want to keep the hint info after cache lookup. Here we skip the hint | ||
// node, so that the returned caching plan won't replace the hint node and drop the hint info | ||
// from the original plan. | ||
case hint: ResolvedHint => hint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small clean up for #20365
@@ -62,7 +62,7 @@ case class InMemoryRelation( | |||
@transient child: SparkPlan, | |||
tableName: Option[String])( | |||
@transient var _cachedColumnBuffers: RDD[CachedBatch] = null, | |||
val batchStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator, | |||
val sizeInBytesStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated but this name is more accurate.
val df1 = spark.createDataFrame(Seq((1, "4"), (2, "2"))).toDF("key", "value") | ||
val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"))).toDF("key", "value") | ||
val df1 = Seq((1, "4"), (2, "2")).toDF("key", "value") | ||
val df2 = Seq((1, "1"), (2, "2")).toDF("key", "value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some code style fixing
Test build #86637 has finished for PR 20394 at commit
|
@@ -62,7 +62,7 @@ case class InMemoryRelation( | |||
@transient child: SparkPlan, | |||
tableName: Option[String])( | |||
@transient var _cachedColumnBuffers: RDD[CachedBatch] = null, | |||
val batchStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator, | |||
val sizeInBytesStats: LongAccumulator = child.sqlContext.sparkContext.longAccumulator, | |||
statsOfPlanToCache: Statistics = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove null
LGTM |
}.getOrElse(currentFragment) | ||
lookupCachedData(currentFragment) | ||
.map(_.cachedRepresentation.withOutput(currentFragment.output)) | ||
.getOrElse(currentFragment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @cloud-fan . I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
// returned by cache lookup should not have hint info. If we lookup the cache with a | ||
// semantically same plan with a different hint info, `CacheManager.useCachedData` will take | ||
// care of it and retain the hint info in the lookup input plan. | ||
statsOfPlanToCache.copy(hints = HintInfo()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I agree with this. If we cache a plan with a hint, then it is reasonable to expect that the hint is still in the plan. We do the same with temporary views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new behavior we introduced in 2.3. I will first keep the behavior unchanged and merge it to 2.3.
We can have more discussion in the next release.
Test build #86643 has finished for PR 20394 at commit
|
LGTM Thanks! Merged to master/2.3 |
## What changes were proposed in this pull request? This is a regression introduced by #19864 When we lookup cache, we should not carry the hint info, as this cache entry might be added by a plan having hint info, while the input plan for this lookup may not have hint info, or have different hint info. ## How was this patch tested? a new test. Author: Wenchen Fan <[email protected]> Closes #20394 from cloud-fan/cache. (cherry picked from commit 5b5447c) Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
This is a regression introduced by #19864
When we lookup cache, we should not carry the hint info, as this cache entry might be added by a plan having hint info, while the input plan for this lookup may not have hint info, or have different hint info.
How was this patch tested?
a new test.