[SPARK-26917][SQL] Further reduce locks in CacheManager #24028

DaveDeCaprio · 2019-03-08T18:53:50Z

What changes were proposed in this pull request?

Further load increases in our production environment have shown that even the read locks can cause some contention, since they contain a mechanism that turns a read lock into an exclusive lock if a writer has been starved out. This PR reduces the potential for lock contention even further than #23833. Additionally, it uses more idiomatic scala than the previous implementation.

@cloud-fan & @gatorsmile This is a relatively minor improvement to the previous CacheManager changes. At this point, I think we finally are doing the minimum possible amount of locking.

How was this patch tested?

Has been tested on a live system where the blocking was causing major issues and it is working well.
CacheManager has no explicit unit test but is used in many places internally as part of the SharedState.

merge in spark

kiszk · 2019-03-09T02:01:39Z

looks reasonable

srowen

Seems OK to me. I can't think of a material difference in behavior here that could cause a problem. The tradeoff seems to be the cost of clone() vs less time spent with the lock. I could believe that's a win but wonder if it's only a win under heavy load or at scale -- would it materially slow things down for other cases?

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

maropu · 2019-03-10T00:37:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

    }
+    val plansToUncache = cachedDataCopy.filter(cd => shouldRemove(cd.plan))


It seems we always use whole loops when performance should be a concern:
https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex
No actual performance impact between the two pattens on your heavy workload?

I suspect that it's the logic in shouldRemove that takes the time here, and can be done without the lock.

Yes, the problem is that the "shouldRemove" function is passed into this call. If that call is expensive, it causes the lock to be held for arbitrarily long amounts of time.

"shouldRemove" when called from "recacheByPath" causes a full traversal of the entire logical plan tree for every cached plan. In the process of doing this it will regenerate path strings for every file referenced by every single plan. In our situations at least this is easily many orders of magnitude more memory overhead than a shallow copy of the list.

felixcheung

is this going to pull pressure on the heap or GC with the "temp" copy?

kiszk · 2019-03-10T13:29:04Z

Good question. I had the same question in my mind. I just think that it is ok since this is shallow copy for a list of objects with two fields.
It would be good if there is statistics regarding # of objects in multiple scenario.

srowen · 2019-03-10T13:56:10Z

I don't think this will be a significant allocation as it's a shallow copy. At smaller scale, it shouldn't matter much. I'd take the win at larger scale

cloud-fan · 2019-03-11T04:50:03Z

ok to test

cloud-fan · 2019-03-11T04:53:45Z

shall we use some "copy-on-write" thread-safe collections instead of doing clone manually?

SparkQA · 2019-03-11T07:05:01Z

Test build #103294 has finished for PR 24028 at commit 27ab9f0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-11T07:06:24Z

retest this please

SparkQA · 2019-03-11T11:13:51Z

Test build #103304 has finished for PR 24028 at commit 27ab9f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

DaveDeCaprio · 2019-03-11T18:55:55Z

In response to @cloud-fan I updated this PR to stored the cached data in a scala immutable IndexedSeq.

As a result of this, there is no longer a need for read locks at all. cachedData is changed to a var and all we need is a write lock for things that want to update the var. From a code clarity perspective I think this solution is cleaner. In addition to removing read locks, it removes several java to scala collection conversions.

srowen

Looks good. I couldn't see any new ways that concurrent updates to cachedData could cause a problem. I suppose you could clear it while uncaching plans and add something back after it was cleared, but that was already possible.

SparkQA · 2019-03-11T23:18:33Z

Test build #103344 has finished for PR 24028 at commit 1d6f84a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

maropu

Looks good except for one minor comment

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

SparkQA · 2019-03-12T18:07:00Z

Test build #103375 has finished for PR 24028 at commit a5977f0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-12T23:06:28Z

Test build #103384 has finished for PR 24028 at commit 1de029c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

SparkQA · 2019-03-13T05:54:33Z

Test build #103400 has finished for PR 24028 at commit 755d484.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-13T07:26:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

  }

  /** Checks if the cache is empty. */
-  def isEmpty: Boolean = readLock {


can we remove the readLock method?

I think we can remove writeLock as well, and simply use this.synchronized

I made this change, but while doing that I realized there was an issue with the earlier change to the "partition" function. See comment below

val (plansToUncache, remainingPlans) = cachedData.partition(cd => shouldRemove(cd.plan))
// If a new plan is cached by a different thread at this point, it will be in the cachedData object,
// but not in plansToUncache or remainingPlans. So the next line will remove it.
writeLock {
cachedData = remainingPlans
}

So I reverted this back to the previous behavior.

cloud-fan · 2019-03-13T16:08:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-    }
-  }
+  @transient @volatile
+  private var cachedData = IndexedSeq[CachedData]()


we'd better add some comments to explain the access pattern of cachedData. e.g. adding/removing elements should be done by creating a new seq, and locked by this.

cloud-fan · 2019-03-13T16:09:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

@@ -101,11 +79,11 @@ class CacheManager extends Logging {
        sparkSession.sessionState.executePlan(planToCache).executedPlan,
        tableName,
        planToCache)
-      writeLock {
+	    this.synchronized {


nit: please use space instead of tab.

cloud-fan · 2019-03-13T16:11:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-      }
-      cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
+    val plansToUncache = cachedData.filter(cd => shouldRemove(cd.plan))
+	  this.synchronized {


SparkQA · 2019-03-13T17:55:04Z

Test build #103443 has finished for PR 24028 at commit d402533.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-13T18:41:45Z

Test build #103448 has started for PR 24028 at commit 1d693af.

DaveDeCaprio · 2019-03-14T02:23:21Z

It looks like the build for the latest commit got messed up. It says here that it is still going, but on the Jenkins server it looks like it completed.

cloud-fan · 2019-03-14T03:19:10Z

LGTM, pending jenkins

maropu · 2019-03-14T06:27:26Z

It seems the Jenkins run stopped in the middle?

maropu · 2019-03-14T06:27:32Z

retest this please

SparkQA · 2019-03-14T07:05:01Z

Test build #103478 has finished for PR 24028 at commit 1d693af.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-14T07:09:31Z

retest this please

SparkQA · 2019-03-14T11:17:13Z

Test build #103485 has finished for PR 24028 at commit 1d693af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

SparkQA · 2019-03-14T19:49:13Z

Test build #103504 has finished for PR 24028 at commit 1bb2511.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-15T02:13:45Z

thanks, merging to master!

kiszk · 2019-03-25T03:00:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

+    val needToRecache = cachedData.filter(condition)
+    this.synchronized {
+      // Remove the cache entry before creating a new ones.
+      cachedData = cachedData.filterNot(cd => needToRecache.exists(_ eq cd))


I talked with @maropu. Since write to a volatile reference is atomic, I think that we could remove this.sychronization if we would allow this operation cachedData.filterNot(cd => needToRecache.exists(_ eq cd)) to run on multiple threads.

Since the performance bottleneck was in read lock, I do not recommend to apply this change very soon.

The write itself is atomic, but you need the this.synchronized to ensure that the value you are writing is using the most up to date value of cachedData. The value is read as part of cachedData.filterNot, and you need to make sure cachedData doesn't change between the time when it is read and when it is written.

DaveDeCaprio added 8 commits November 27, 2018 13:18

Merge pull request #1 from apache/master

22bd4bd

merge in spark

Merge branch 'master' of https://github.com/apache/spark

9219c01

Merge branch 'master' of https://github.com/apache/spark

771d12d

Merge branch 'master' of https://github.com/apache/spark

0277891

Merge branch 'master' of https://github.com/apache/spark

0971a06

Merge branch 'master' of https://github.com/apache/spark

697796e

Merge branch 'master' of https://github.com/apache/spark

375ee6a

Reduced readLocks since they cna cause contention also.

fc7487b

srowen reviewed Mar 9, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Outdated Show resolved Hide resolved

Removed extra code

27ab9f0

maropu reviewed Mar 10, 2019

View reviewed changes

felixcheung reviewed Mar 10, 2019

View reviewed changes

srowen approved these changes Mar 11, 2019

View reviewed changes

Changed cachedData to a var of an immutable IndexedSeq.

1d6f84a

srowen approved these changes Mar 11, 2019

View reviewed changes

maropu reviewed Mar 11, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Show resolved Hide resolved

maropu approved these changes Mar 11, 2019

View reviewed changes

cloud-fan reviewed Mar 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Show resolved Hide resolved

Minor updates from code review. Added volatile

a5977f0

maropu reviewed Mar 12, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Outdated Show resolved Hide resolved

Put annotations on one line.

755d484

maropu approved these changes Mar 13, 2019

View reviewed changes

cloud-fan reviewed Mar 13, 2019

View reviewed changes

Removed read and write lock functions.

032c606

cloud-fan reviewed Mar 13, 2019

View reviewed changes

Added comments.

d402533

Fixed comment formatting

1d693af

maropu approved these changes Mar 14, 2019

View reviewed changes

maropu mentioned this pull request Mar 14, 2019

[SPARK-27134][SQL] array_distinct function does not work correctly with columns containing array of array #24073

Closed

cloud-fan reviewed Mar 14, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala Outdated Show resolved Hide resolved

Changed to reference equals for plan removal.

1bb2511

cloud-fan closed this in 8819eab Mar 15, 2019

dongjoon-hyun mentioned this pull request Mar 18, 2019

[SPARK-25196][SQL] Extends the analyze column command for cached tables #24047

Closed

kiszk reviewed Mar 25, 2019

View reviewed changes

		}
		val plansToUncache = cachedDataCopy.filter(cd => shouldRemove(cd.plan))

[SPARK-26917][SQL] Further reduce locks in CacheManager #24028

[SPARK-26917][SQL] Further reduce locks in CacheManager #24028

Conversation

DaveDeCaprio commented Mar 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

kiszk commented Mar 9, 2019

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

kiszk commented Mar 10, 2019

srowen commented Mar 10, 2019

cloud-fan commented Mar 11, 2019

cloud-fan commented Mar 11, 2019

SparkQA commented Mar 11, 2019

maropu commented Mar 11, 2019

SparkQA commented Mar 11, 2019

DaveDeCaprio commented Mar 11, 2019

srowen left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 11, 2019

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 12, 2019

SparkQA commented Mar 12, 2019

SparkQA commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 13, 2019

SparkQA commented Mar 13, 2019

DaveDeCaprio commented Mar 14, 2019

cloud-fan commented Mar 14, 2019

maropu commented Mar 14, 2019

maropu commented Mar 14, 2019

SparkQA commented Mar 14, 2019

dilipbiswal commented Mar 14, 2019

SparkQA commented Mar 14, 2019

SparkQA commented Mar 14, 2019

cloud-fan commented Mar 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment