[SPARK-13923] [SQL] Implement SessionCatalog #11750

andrewor14 · 2016-03-16T02:50:10Z

What changes were proposed in this pull request?

As part of the effort to merge SQLContext and HiveContext, this patch implements an internal catalog called SessionCatalog that handles temporary functions and tables and delegates metastore operations to ExternalCatalog. Currently, this is still dead code, but in the future it will be part of SessionState and will replace o.a.s.sql.catalyst.analysis.Catalog.

A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use SessionCatalog to implement the parsed commands.

How was this patch tested?

800+ lines of tests in SessionCatalogSuite.

andrewor14 · 2016-03-16T02:53:22Z

@yhuai @rxin

rxin · 2016-03-16T03:10:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+   * If no such database is specified, create it in the current database.
+   */
+  def createTable(
+      currentDb: String,


it is somewhat strange to pass in currentDb and then rely on some table definition's database. Have you thought about just figuring out that part in the caller outside the session catalog? i.e. the catalog itself doesn't need to handle currentDb.

As discussed offline, let's move the tracking of current db into SessionCatalog.

as discussed offline, this is because we need to deal with temporary tables. An alternative that I will implement here is to just keep track of the current database in this class so we don't need to pass it in everywhere.

make sure we document that in the session catalog code and explain why we are tracking current db in here.

SparkQA · 2016-03-16T03:14:09Z

Test build #53263 has finished for PR 11750 at commit 3b2e48a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

This is a standalone commit such that in the future we can split it out into a separate patch if preferrable.

This allows us to not pass it into every single method like we used to before this commit.

yhuai · 2016-03-16T21:44:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

@@ -167,7 +170,7 @@ abstract class ExternalCatalog {
 * @param name name of the function
 * @param className fully qualified class name, e.g. "org.apache.spark.util.MyFunc"
 */
-case class CatalogFunction(name: String, className: String)
+case class CatalogFunction(name: FunctionIdentifier, className: String)


catalogFunc.name.funcName is kind of weird (we do not need to change it right now).

rxin · 2016-03-16T22:33:50Z

If there are no major problems, let's merge this as soon as tests pass. We can address minor comments as followups. This will unblock a bunch of other stuff.

yhuai · 2016-03-16T22:34:45Z

Yea, let's merge this once it passes tests.

nongli · 2016-03-16T23:58:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

+    if (tempTables.containsKey(name) && !ignoreIfExists) {
+      throw new AnalysisException(s"Temporary table '$name' already exists.")
+    }
+    tempTables.put(name, tableDefinition)


this isn't really ignoreIfExists but updateIfExists?

ignoreIfExists is like "ignoring the exception if the table already exists", but I guess it doesn't convey whether the table itself is overridden. I could rename this.

Let's make this name more explicit. I think in other places, ignoreIfExists means that if a table exists, we do nothing.

SparkQA · 2016-03-17T00:20:00Z

Test build #53358 has finished for PR 11750 at commit ad43a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T00:52:25Z

Test build #2647 has finished for PR 11750 at commit ad43a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T00:53:14Z

Test build #2648 has finished for PR 11750 at commit ad43a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T00:53:28Z

Test build #2646 has finished for PR 11750 at commit ad43a5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-03-17T00:59:20Z

OK. Let me merge this. Let's address comments in the follow-up pr.

## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch apache#11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <[email protected]> Closes apache#11750 from andrewor14/temp-catalog.

## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <[email protected]> Author: Yin Huai <[email protected]> Closes #11836 from andrewor14/use-session-catalog.

## What changes were proposed in this pull request? `SessionCatalog`, introduced in apache#11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <[email protected]> Author: Yin Huai <[email protected]> Closes apache#11836 from andrewor14/use-session-catalog.

## What changes were proposed in this pull request? This patch addresses the remaining comments left in apache#11750 and apache#11918 after they are merged. For a full list of changes in this patch, just trace the commits. ## How was this patch tested? `SessionCatalogSuite` and `CatalogTestCases` Author: Andrew Or <[email protected]> Closes apache#12006 from andrewor14/session-catalog-followup.

## What changes were proposed in this pull request? Session catalog was added in #11750. However, it doesn't really support temporary functions properly; right now we only store the metadata in the form of `CatalogFunction`, but this doesn't make sense for temporary functions because there is no class name. This patch moves the `FunctionRegistry` into the `SessionCatalog`. With this, the user can call `catalog.createTempFunction` and `catalog.lookupFunction` to use the function they registered previously. This is currently still dead code, however. ## How was this patch tested? `SessionCatalogSuite`. Author: Andrew Or <[email protected]> Closes #11972 from andrewor14/temp-functions.

andrewor14 changed the title ~~[SPARK-13923] Implement SessionCatalog~~ [SPARK-13923] [SQL] Implement SessionCatalog Mar 16, 2016

rxin reviewed Mar 16, 2016
View reviewed changes

Andrew Or added 20 commits March 16, 2016 14:35

Merge in @yhuai's changes

98c8a3b

Clean up table method signatures + add comments

1d12578

Do the same for functions and partitions

5bf695c

Take into account current database in table methods

39a153c

Refactor CatalogTable to use TableIdentifier

dd1fbae

This is a standalone commit such that in the future we can split it out into a separate patch if preferrable.

Refactor CatalogFunction to use FunctionIdentifier

2118212

Fix tests

6d530a9

Document and clean up function methods

196f7ce

Implement SessionCatalog using ExternalCatalog

5587a49

Fix style

90ccdbb

Clean up duplicate code in Table/FunctionIdentifier

caa4013

Refactor CatalogTestCases to make methods accessible

d3f252d

Fix infinite loop (woops)

2f5121b

Implement tests for databases and tables

7947445

Implement tests for table partitions

3da16fb

Implement tests for functions

8c84dd8

Add TODO

ff1c2c4

Keep track of current database in SessionCatalog

6d9fa2f

This allows us to not pass it into every single method like we used to before this commit.

Fix tests

08969cd

Expand test scope + clean up test code

ad43a5f

andrewor14 force-pushed the temp-catalog branch from 3b2e48a to ad43a5f Compare March 16, 2016 21:36

yhuai reviewed Mar 16, 2016
View reviewed changes

nongli reviewed Mar 16, 2016
View reviewed changes

asfgit closed this in ca9ef86 Mar 17, 2016

andrewor14 mentioned this pull request Mar 18, 2016

[SPARK-14014] [SQL] Replace existing catalog with SessionCatalog #11836

Closed

yhuai mentioned this pull request Mar 23, 2016

[SPARK-14014] [SQL] Replace existing catalog with SessionCatalog #11918

Closed

andrewor14 deleted the temp-catalog branch March 23, 2016 22:14

andrewor14 mentioned this pull request Mar 23, 2016

[SPARK-13923][SPARK-14014][SQL] Session catalog follow-ups #11923

Closed

This was referenced Mar 26, 2016

[SPARK-14013][SQL] Proper temp function support in catalog #11972

Closed

[SPARK-13923][SPARK-14014][SQL] Session catalog follow-ups #12006

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13923] [SQL] Implement SessionCatalog #11750

[SPARK-13923] [SQL] Implement SessionCatalog #11750

andrewor14 commented Mar 16, 2016

andrewor14 commented Mar 16, 2016

rxin Mar 16, 2016

rxin Mar 16, 2016

andrewor14 Mar 16, 2016

rxin Mar 16, 2016

SparkQA commented Mar 16, 2016

yhuai Mar 16, 2016

rxin commented Mar 16, 2016

yhuai commented Mar 16, 2016

nongli Mar 16, 2016

andrewor14 Mar 17, 2016

yhuai Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

yhuai commented Mar 17, 2016

[SPARK-13923] [SQL] Implement SessionCatalog #11750

[SPARK-13923] [SQL] Implement SessionCatalog #11750

Conversation

andrewor14 commented Mar 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

andrewor14 commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 16, 2016

Choose a reason for hiding this comment

rxin commented Mar 16, 2016

yhuai commented Mar 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

yhuai commented Mar 17, 2016