[SPARK-20100][SQL] Refactor SessionState initialization #17433

hvanhovell · 2017-03-26T10:07:07Z

What changes were proposed in this pull request?

The current SessionState initialization code path is quite complex. A part of the creation is done in the SessionState companion objects, a part of the creation is one inside the SessionState class, and a part is done by passing functions.

This PR refactors this code path, and consolidates SessionState initialization into a builder class. This SessionState will not do any initialization and just becomes a place holder for the various Spark SQL internals. This also lays the ground work for two future improvements:

This provides us with a start for removing the HiveSessionState. Removing the HiveSessionState would also require us to move resource loading into a separate class, and to (re)move metadata hive.
This makes it easier to customize the Spark Session. Currently you will need to create a custom version of the builder. I have added hooks to facilitate this. A future step will be to create a semi stable API on top of this.

How was this patch tested?

Existing tests.

hvanhovell · 2017-03-26T10:08:30Z

cc @cloud-fan

hvanhovell · 2017-03-26T10:08:40Z

cc @kunalkhamar

SparkQA · 2017-03-26T12:36:07Z

Test build #75234 has finished for PR 17433 at commit 711582f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BaseSessionStateBuilder(
class SessionStateBuilder(
class SessionFunctionResourceLoader(session: SparkSession) extends FunctionResourceLoader
class HiveSessionStateBuilder(session: SparkSession, parentState: Option[SessionState] = None)

gatorsmile · 2017-03-27T04:37:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala

@@ -42,6 +43,8 @@ class SparkPlanner(
      InMemoryScans ::
      BasicOperators :: Nil)

+  def extraPlanningStrategies: Seq[Strategy] = Nil


How about adding comments for this func too?

cloud-fan · 2017-03-27T04:47:28Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+   * [[SessionState]]'s clone functionality. Make sure to override this when implementing your own
+   * [[SessionStateBuilder]].
+   */
+  protected def newBuilder: NewBuilder


shall we make it a method instead of returning a function?

Nit: newBuilder -> newBuilder()

If I make this a method, then it captures the entire builder and as a result the parent SessionState; the latter can cause some issues for garbage collection if we use cloneSession a lot and have relatively short lived sessions.

Should we move this to be the first def in BaseSessionStateBuilder, for readability? A future api user would see it next to type NewBuilder and also realize they need to provide an implementation quicker.

gatorsmile · 2017-03-27T05:07:46Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+@InterfaceStability.Unstable
+class SessionStateBuilder(
+    session: SparkSession,
+    state: Option[SessionState] = None)


Keep the original name parentState?

gatorsmile · 2017-03-27T05:12:02Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

-    sparkConf.getAll.foreach { case (k, v) =>
-      sqlConf.setConfString(k, v)
+/**
+ * Session based [[FunctionResourceLoader]].


Actually, this is not session based. So far, the resource loader is session shared.

Ok, let me change this.

BTW: my first follow-up will be to move SessionState.addJar into the function resource loader, and have a Hive aware resource loader.

gatorsmile · 2017-03-27T06:03:07Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+  /**
+   * Build the [[SessionState]].
+   */
+  def build: SessionState = {


Nit: build -> build()

Yeah, build does have a side effect. Done.

gatorsmile · 2017-03-27T06:23:10Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+  /**
+   * Function used to make clones of the session state.
+   */
+  protected def createClone: (SparkSession, SessionState) => SessionState = {


Does that make sense to mark createClone final?

It does, however I would like to keep the builder as open as possible.

cloud-fan · 2017-03-27T06:31:07Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+/**
+ * Helper class for using SessionStateBuilders during tests.
+ */
+private[sql] trait WithTestConf { self: BaseSessionStateBuilder =>


if it's only used in test, shall we move this class to test?

This is used in TestHive which is part of main instead of test (no idea why that is btw).

gatorsmile · 2017-03-27T06:36:03Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+  /**
+   * Logical query plan analyzer for resolving unresolved attributes and relations.
+   *
+   * Note: this depends on the `conf` and `catalog` field.


Nit: field -> fields

Conceptually, yes, but we also expose SQLContext to the external data source APIs for relation resolution.

Yeah this also depends on the SessionState and its sqlContext wrapper. The main goal of these notes were to document the dependencies within the builder.

gatorsmile · 2017-03-27T06:43:12Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+ */
+@Experimental
+@InterfaceStability.Unstable
+abstract class BaseSessionStateBuilder(


I am wondering if we need to create a separate file for this class?

Moving to separate file may be a good idea.

gatorsmile · 2017-03-27T07:19:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

-      hadoopConf,
-      parser)
-
+  private[sql] def copyStateTo(target: SessionCatalog): Unit = {


How about changing it to copying the state from a source SessionCatalog to the current one?

private[sql] def copyState(source: SessionCatalog): Unit

So we have to synchronize on the source catalog (the target catalog is thread-safe since it is not visible to the outside world yet), and I would like to keep the locking internal to the source catalog. I will add some explanation.

gatorsmile · 2017-03-27T07:29:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -146,6 +141,11 @@ abstract class Optimizer(sessionCatalog: SessionCatalog, conf: CatalystConf)
        s.withNewPlan(newPlan)
    }
  }
+
+  /**
+   * Override to provide additional rules for the operator optimization batch.


Not sure whether we need to split the batch Operator Optimizations to smaller independent batches or move some rules out of this batch in the future. If so, the location of this rule becomes unstable. We might need to explain it in the comment.

Anything in catalyst can be changed between spark versions. This hook included.

SparkQA · 2017-03-27T13:01:23Z

Test build #75260 has finished for PR 17433 at commit ecf7998.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-03-27T16:10:28Z

Looks good to me. It looks much cleaner after this PR! Thank you!

It sounds like this PR also partially resolves the JIRA: https://issues.apache.org/jira/browse/SPARK-18127

kunalkhamar

Looking close!

kunalkhamar · 2017-03-27T20:00:09Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+  /**
+   * Internal catalog managing functions registered by the user.
+   *
+   * This either gets cloned from a pre-existing version or cloned from the build-in registry.


super nit: built-in registry

kunalkhamar · 2017-03-27T20:46:19Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+ */
+@Experimental
+@InterfaceStability.Unstable
+abstract class BaseSessionStateBuilder(


Moving to separate file may be a good idea.

kunalkhamar · 2017-03-27T21:12:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

- * @param queryExecutionCreator Lambda to create a [[QueryExecution]] from a [[LogicalPlan]]
- * @param plannerCreator Lambda to create a planner that takes into account Hive-specific strategies
+ * @param optimizer Logical query plan optimizer.
+ * @param planner Planner that converts optimized logical plans to physical plans


nit: the comment used to include "planner takes into account Hive-specific strategies", lets add that back for completeness?

kunalkhamar · 2017-03-27T21:21:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

  }

  /**
-   * Create an logical query plan `Analyzer` with rules specific to a `HiveSessionState`.
+   * An logical query plan `Analyzer` with rules specific to Hive.


super nit: "A logical ..."

kunalkhamar · 2017-03-27T21:46:11Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala

+   * [[SessionState]]'s clone functionality. Make sure to override this when implementing your own
+   * [[SessionStateBuilder]].
+   */
+  protected def newBuilder: NewBuilder


Should we move this to be the first def in BaseSessionStateBuilder, for readability? A future api user would see it next to type NewBuilder and also realize they need to provide an implementation quicker.

kunalkhamar · 2017-03-27T21:54:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala

-      initHelper.streamingQueryManager,
-      queryExecutionCreator,
-      initHelper.plannerCreator)
+    new TestHiveSessionStateBuilder(this, parentSessionState).build


kunalkhamar · 2017-03-27T21:58:05Z

sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala

-      }
-    })
+  override lazy val sessionState: SessionState = {
+    new TestSQLSessionStateBuilder(this, None).build


kunalkhamar · 2017-03-27T22:06:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

-        newSparkSession,
-        confCopy,
-        experimentalMethodsCopy))
+  def apply(session: SparkSession): SessionState = {


: HiveSessionState = {

kunalkhamar · 2017-03-27T22:12:33Z

sql/core/src/test/scala/org/apache/spark/sql/test/TestSQLContext.scala

@@ -19,7 +19,7 @@ package org.apache.spark.sql.test

 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SparkSession
-import org.apache.spark.sql.internal.{SessionState, SQLConf}
+import org.apache.spark.sql.internal._


import org.apache.spark.sql.internal.{SessionState, SessionStateBuilder, SQLConf, WithTestConf}

kunalkhamar · 2017-03-27T22:14:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala

 import org.apache.spark.sql.execution.datasources._
 import org.apache.spark.sql.hive.client.HiveClient
-import org.apache.spark.sql.internal.{SessionState, SharedState, SQLConf}
+import org.apache.spark.sql.internal._


import org.apache.spark.sql.internal.{BaseSessionStateBuilder, SessionFunctionResourceLoader, SessionState, SharedState, SQLConf}

SparkQA · 2017-03-28T01:28:34Z

Test build #75282 has finished for PR 17433 at commit 4c5ed96.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BaseSessionStateBuilder(

cloud-fan · 2017-03-28T02:07:39Z

LGTM, merging to master!

Refactor SessionState initialization.

711582f

gatorsmile reviewed Mar 27, 2017

View reviewed changes

cloud-fan reviewed Mar 27, 2017

View reviewed changes

gatorsmile reviewed Mar 27, 2017

View reviewed changes

cloud-fan reviewed Mar 27, 2017

View reviewed changes

gatorsmile reviewed Mar 27, 2017

View reviewed changes

Code review.

ecf7998

kunalkhamar reviewed Mar 27, 2017

View reviewed changes

Code review

4c5ed96

asfgit closed this in ea36116 Mar 28, 2017

kunalkhamar mentioned this pull request Mar 28, 2017

[SPARK-20048][SQL] Cloning SessionState does not clone query execution listeners #17379

Closed

[SPARK-20100][SQL] Refactor SessionState initialization #17433

[SPARK-20100][SQL] Refactor SessionState initialization #17433

Conversation

hvanhovell commented Mar 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell commented Mar 26, 2017

hvanhovell commented Mar 26, 2017

SparkQA commented Mar 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalkhamar Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 27, 2017

gatorsmile commented Mar 27, 2017

kunalkhamar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalkhamar Mar 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 28, 2017

cloud-fan commented Mar 28, 2017

hvanhovell Mar 27, 2017 •

edited

Loading

kunalkhamar Mar 27, 2017 •

edited

Loading

kunalkhamar Mar 27, 2017 •

edited

Loading