Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14014] [SQL] Replace existing catalog with SessionCatalog #11918

Closed
wants to merge 32 commits into from
Closed

[SPARK-14014] [SQL] Replace existing catalog with SessionCatalog #11918

wants to merge 32 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented Mar 23, 2016

What changes were proposed in this pull request?

SessionCatalog, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to ExternalCatalog. This functionality overlaps a lot with the existing analysis.Catalog.

As of this commit, SessionCatalog and ExternalCatalog will no longer be dead code. There are still things that need to be done after this patch, namely:

  • SPARK-14013: Properly implement temporary functions in SessionCatalog
  • SPARK-13879: Decide which DDL/DML commands to support natively in Spark
  • SPARK-?????: Implement the ones we do want to support through SessionCatalog.
  • SPARK-?????: Merge SQL/HiveContext

How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are SessionCatalogSuite and ExternalCatalogSuite.

NOTE: This one has an extra commit on top of #11836 for fixing python tests.

Andrew Or added 30 commits March 16, 2016 15:50
commit ad43a5f
Author: Andrew Or <[email protected]>
Date:   Wed Mar 16 14:35:02 2016 -0700

    Expand test scope + clean up test code

commit 08969cd
Author: Andrew Or <[email protected]>
Date:   Wed Mar 16 13:21:50 2016 -0700

    Fix tests

commit 6d9fa2f
Author: Andrew Or <[email protected]>
Date:   Wed Mar 16 12:31:52 2016 -0700

    Keep track of current database in SessionCatalog

    This allows us to not pass it into every single method like
    we used to before this commit.

commit ff1c2c4
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 19:42:22 2016 -0700

    Add TODO

commit 8c84dd8
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 19:41:30 2016 -0700

    Implement tests for functions

commit 3da16fb
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 19:04:03 2016 -0700

    Implement tests for table partitions

commit 7947445
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 18:52:30 2016 -0700

    Implement tests for databases and tables

commit 2f5121b
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 16:59:38 2016 -0700

    Fix infinite loop (woops)

commit d3f252d
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 16:12:55 2016 -0700

    Refactor CatalogTestCases to make methods accessible

commit caa4013
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 15:44:23 2016 -0700

    Clean up duplicate code in Table/FunctionIdentifier

commit 90ccdbb
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 15:33:30 2016 -0700

    Fix style

commit 5587a49
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 15:32:38 2016 -0700

    Implement SessionCatalog using ExternalCatalog

commit 196f7ce
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 14:39:22 2016 -0700

    Document and clean up function methods

commit 6d530a9
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 14:38:50 2016 -0700

    Fix tests

commit 2118212
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 14:33:20 2016 -0700

    Refactor CatalogFunction to use FunctionIdentifier

commit dd1fbae
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 14:22:37 2016 -0700

    Refactor CatalogTable to use TableIdentifier

    This is a standalone commit such that in the future we can split
    it out into a separate patch if preferrable.

commit 39a153c
Author: Andrew Or <[email protected]>
Date:   Tue Mar 15 13:53:42 2016 -0700

    Take into account current database in table methods

commit 5bf695c
Author: Andrew Or <[email protected]>
Date:   Mon Mar 14 17:14:59 2016 -0700

    Do the same for functions and partitions

commit 1d12578
Author: Andrew Or <[email protected]>
Date:   Mon Mar 14 16:27:11 2016 -0700

    Clean up table method signatures + add comments

commit 98c8a3b
Author: Andrew Or <[email protected]>
Date:   Thu Mar 10 16:35:35 2016 -0800

    Merge in @yhuai's changes
We need to be able to pass in ExternalCatalog in the constructor
of SQLContext and subclasses because these should be persistent
across sessions. Unfortunately without significant refactoring
in the HiveContext and TestHive code we cannot make this simple
change happen.
This failed because SessionCatalog does not implement
refreshTable. This is a bigger problem because SessionCatalog
has no notion of caching tables in the first place and so it
doesn't really make sense to implement refreshTable. More
refactoring involving HiveMetastoreCatalog is required to
make this work.
This commit deletes the trait analysis.Catalog and all of its
subclasses, with one notable exception: HiveMetastoreCatalog
is kept because a lot of existing functionality (like caching
data source tables) are still needed. All other occurrences
are now replaced with SessionCatalog.

Unfortunately, because HiveMetastoreCatalog is a massive
sprawl of unmaintainable code, there is no clean way to
integrate it nicely with the new HiveCatalog. The path of
least resistance, then, route previous usages of
HiveMetastoreCatalog through HiveCatalog. This requires
some whacky initialization order hacks because HMC takes
in HiveContext but HiveContext takes in HiveCatalog.
The biggest change here is moving HiveMetastoreCatalog from
HiveCatalog (the external one) to HiveSessionCatalog (the session
specific one). This is needed because HMC depends on a lot of
session specific things for, e.g. creating data source tables.
This was failing tests that do things with multiple sessions,
i.e. HiveQuerySuite.
There were some issues with case sensitivity analysis and error
messages not being exactly as expected. The latter is now relaxed
where possible.
Note: This commit ignores a test in CliSuite. There a future
timed out and I investigated for like half an hour and could
not figure out why. It has something to do with the way we set
the current database and executing commands with "-e". This
will take a little longer to debug so I prefer to do that in
a separate patch.
The problem was that the metadataHive didn't get any of the
spark.sql.* confs, so the barrier prefixes weren't actually set.
Thanks to @yhuai for uncovering this.
The issue is that after each test we only set the current
database in Hive but not the one in SessionCatalog. This means
the next test will create a table in the default database (since
we just pass CREATE TABLE commands to hive currently) but try
to resolve it in a database left over from a previous test.
We were expecting an "OK" that never came. This test is way
to specific anyway and is super brittle. It's also better to
alawys set the current database through the catalog so we don't
end up with mismatched current databases between Spark and Hive.
Every time we called TestHive.reset() we created a new temp
directory for derby, and then we would go ahead and override
the old one in the same TestHiveContext. This fails tests that
use multiple sessions for some reason. Setting the same confs
in metadataHive whenever we call reset() seems unnecessary,
so I removed it.
@@ -554,7 +554,7 @@ def tableNames(self, dbName=None):
>>> sqlContext.registerDataFrameAsTable(df, "table1")
>>> "table1" in sqlContext.tableNames()
True
>>> "table1" in sqlContext.tableNames("db")
>>> "table1" in sqlContext.tableNames("default")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the follow-up PR, let's have a test for a non-existent db (we should expect an error).

@yhuai
Copy link
Contributor Author

yhuai commented Mar 23, 2016

test this please

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #2668 has finished for PR 11918 at commit d3c40f7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #53938 has finished for PR 11918 at commit d3c40f7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai yhuai closed this Mar 23, 2016
@SparkQA
Copy link

SparkQA commented Mar 23, 2016

Test build #2669 has finished for PR 11918 at commit d3c40f7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 28, 2016
## What changes were proposed in this pull request?

This patch addresses the remaining comments left in apache#11750 and apache#11918 after they are merged. For a full list of changes in this patch, just trace the commits.

## How was this patch tested?

`SessionCatalogSuite` and `CatalogTestCases`

Author: Andrew Or <[email protected]>

Closes apache#12006 from andrewor14/session-catalog-followup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants