-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13139][SQL] Parse Hive DDL commands ourselves #11573
Conversation
Conflicts: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/bucket.scala
3766f83
to
a663b5c
Compare
buckets: Option[BucketSpec], | ||
// TODO: use `clustered` and `sorted` instead for simplicity | ||
noClustered: Boolean, | ||
noSorted: Boolean)(sql: String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya was there any reason why these have to be negative? It's much easier to understand if it's positive, i.e. clustered
and sorted
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because the corresponding token is TOK_NOT_CLUSTERED
and TOK_NOT_SORTED
. Yea we can use positive here.
Note: The only changes I made on top of #11048 is addressing the outstanding comments in that patch and some minor clean ups. It's entirely possible that there still are things that are missing or incorrect given the original patch was not reviewed completely yet. @hvanhovell @yhuai PTAL. |
Test build #52630 has finished for PR 11573 at commit
|
@@ -64,6 +85,53 @@ private[sql] class SparkQl(conf: ParserConf = SimpleParserConf()) extends Cataly | |||
val tableIdent = extractTableIdent(nameParts) | |||
RefreshTable(tableIdent) | |||
|
|||
case Token("TOK_CREATEDATABASE", Token(databaseName, Nil) :: createDatabaseArgs) => | |||
val Seq( | |||
allowExisting, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't allowExists
the exact opposite of IF NOT EXISTS
? I am asking because I have a similar case in my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite understand your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When i see a parameter that is named allowExists
I think the existence of the current table is allowed, and that as a consequence the current create table command is going to overwrite the existing table. Why not name this ifNotExists
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed it.
f54501f
to
6538829
Compare
This commit cleans up the case class signatures representing ALTER TABLE commands along with the code that parses them. This commit also adds a lot of missing documentation that was sorely needed for AlterTableCommandParser to be readable. The cleanup in this commit concerns only ALTER TABLE commands, but is expected to spread to other parts of the PR in subsequent commits.
6538829
to
6c3f994
Compare
@hvanhovell I've addressed most of your comments except a couple ones where I said I would fix later. Separately I've also significantly cleaned up the logic and signatures in the alter table code so you should have a look at that as well. It's probably quite different from what it looked like when you last reviewed this patch. I'll continue the cleanup in the rest of the patch shortly. I'm marking this as WIP in the mean time. |
@hvanhovell I believe as of the latest commit I've addressed all of your comments. This is ready from my side so I've removed the WIP tag. PTAL. |
Test build #52773 has finished for PR 11573 at commit
|
Test build #52770 has finished for PR 11573 at commit
|
*/ | ||
private[sql] case class BucketSpec( | ||
numBuckets: Int, | ||
bucketColumnNames: Seq[String], | ||
sortColumnNames: Seq[String]) | ||
sortColumns: Seq[(String, SortDirection)]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to have SortDirection
defined explicitly. So, when we create it we know what will the order by.
However, this change implies that we can use descending
as the order, which is actually not allowed because UnsafeKVExternalSorter
only sort keys with ascending order (this sorter is used in DynamicPartitionWriterContainer
when we generate bucketing files). So, I think for now, it is better to revert this change. In future, we can revisit it when we store sort direction in metastore and we can actually sort rows in a bucket file with a descending order.
ca64636
to
bd91b0f
Compare
} | ||
case _ => parseFailed("Invalid CREATE FUNCTION command", node) | ||
}.toMap | ||
CreateFunction(funcName, alias, resourcesMap, temp.isDefined)(node.source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this map look like if I have USING JAR 'jar1', JAR 'jar2', ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also have a test for this case.
I have left a few comments. It is a good starting point. Thank you for working on this! |
Test build #52937 has finished for PR 11573 at commit
|
OK. Let's merge this first (to avoid it has conflicts caused by other commits ). We can address the comments in a follow-up PR. |
Addressing outstanding comments in apache#11573. Jenkins, new test case in `DDLCommandSuite` Author: Andrew Or <[email protected]> Closes apache#11667 from andrewor14/ddl-parser-followups.
## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch #11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <[email protected]> Closes #11750 from andrewor14/temp-catalog.
Addressing outstanding comments in apache#11573. Jenkins, new test case in `DDLCommandSuite` Author: Andrew Or <[email protected]> Closes apache#11667 from andrewor14/ddl-parser-followups.
## What changes were proposed in this pull request? This patch is ported over from viirya's changes in apache#11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the `HiveCatalog` to process these DDLs. This is a pretext to merging `SQLContext` and `HiveContext`. Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog. ## How was this patch tested? Jenkins, new `DDLCommandSuite`, which comprises of about 40% of the changes here. Author: Andrew Or <[email protected]> Closes apache#11573 from andrewor14/parser-plus-plus.
Addressing outstanding comments in apache#11573. Jenkins, new test case in `DDLCommandSuite` Author: Andrew Or <[email protected]> Closes apache#11667 from andrewor14/ddl-parser-followups.
## What changes were proposed in this pull request? As part of the effort to merge `SQLContext` and `HiveContext`, this patch implements an internal catalog called `SessionCatalog` that handles temporary functions and tables and delegates metastore operations to `ExternalCatalog`. Currently, this is still dead code, but in the future it will be part of `SessionState` and will replace `o.a.s.sql.catalyst.analysis.Catalog`. A recent patch apache#11573 parses Hive commands ourselves in Spark, but still passes the entire query text to Hive. In a future patch, we will use `SessionCatalog` to implement the parsed commands. ## How was this patch tested? 800+ lines of tests in `SessionCatalogSuite`. Author: Andrew Or <[email protected]> Closes apache#11750 from andrewor14/temp-catalog.
What changes were proposed in this pull request?
This patch is ported over from @viirya's changes in #11048. Currently for most DDLs we just pass the query text directly to Hive. Instead, we should parse these commands ourselves and in the future (not part of this patch) use the
HiveCatalog
to process these DDLs. This is a pretext to mergingSQLContext
andHiveContext
.Note: As of this patch we still pass the query text to Hive. The difference is that we now parse the commands ourselves so in the future we can just use our own catalog.
How was this patch tested?
Jenkins, new
DDLCommandSuite
, which comprises of about 40% of the changes here.