[SPARK-19129] [SQL] SessionCatalog: Disallow empty part col values in partition spec #16583

gatorsmile · 2017-01-14T00:42:44Z

What changes were proposed in this pull request?

Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.

val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
spark.sql("alter table partitionedTable drop partition(partCol1='')")
spark.table("partitionedTable").show()

In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.

When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the SessionCatalog APIs.

How was this patch tested?

Added test cases

SparkQA · 2017-01-14T02:25:04Z

Test build #71358 has finished for PR 16583 at commit 1719286.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-14T04:25:03Z

retest this please

gatorsmile · 2017-01-14T04:25:23Z

cc @cloud-fan @ericl @yhuai

cloud-fan · 2017-01-14T05:19:06Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+    val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
+    df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
+    val e = intercept[AnalysisException] {
+      spark.sql("alter table partitionedTable drop partition(partCol1='')")


what's the behavior of hive? also throw exception?

Hive (v2.1.1) does not throw exception / error message here.

ALTER TABLE partitioned_table DROP PARTITION(ds = '') ; OK Time taken: 0.152 seconds

Given that (creating / inserting / querying) partitions with empty string is not allowed, DROP PARTITIONS going through seems inconsistent behavior to me. It might have made sense for supporting regexes but as per Hive language specification, partition spec has to be a plain string. If there is no way to create partitions with empty partition column name, allowing DROP seems werid. +1 for throwing exception ..... unless the general consensus about hive compatibility is to be exact same behavior (including such weirdness).

INSERT OVERWRITE TABLE partitioned_table PARTITION(ds = '') SELECT key AS user_id, value AS name FROM src; FAILED: SemanticException [Error 10006]: Line 1:49 Partition not found '''' ALTER TABLE partitioned_table ADD PARTITION(ds = '') ; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. partition spec is invalid; field ds does not exist or is empty DESC FORMATTED partitioned_table PARTITION(ds = '') ; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. cannot find field null from [0:user_id, 1:name] TRUNCATE TABLE partitioned_table PARTITION(ds = '') ; FAILED: SemanticException [Error 10006]: Partition not found {ds=}

@tejasapatil Thank you for your research

So far, we are not completely following Hive in the partition-related DDL commands. DROP PARTITION is an example. If the users-specified spec does not exist, we will throw an exception. Instead, Hive just silently ignores it without any exception, but Hive will always report which partition is dropped after the command. Thus, maybe we can improve this in the future PR.

Thus, this PR is to follow the same way to block the invalid inputs. That is, throwing an exception when the input partition spec is not valid.

SparkQA · 2017-01-14T06:45:59Z

Test build #71362 has finished for PR 16583 at commit 1719286.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-16T09:19:57Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala

@@ -937,10 +985,22 @@ class SessionCatalogSuite extends PlanTest {

  test("list partitions with invalid partial partition spec") {


shall we merge this test with https://github.com/apache/spark/pull/16583/files#diff-68b981fa0a91ef20dc032d93ad0fdc52R949?

The above one is verifying the catalog.listPartitionNames and this one is verifying catalog.listPartitions. Should we keep them separate?

cloud-fan · 2017-01-17T02:48:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -568,7 +569,9 @@ private[hive] class HiveClientImpl(
    val hiveTable = toHiveTable(table)
    val parts = spec match {
      case None => shim.getAllPartitions(client, hiveTable).map(fromHivePartition)
-      case Some(s) => client.getPartitions(hiveTable, s.asJava).asScala.map(fromHivePartition)
+      case Some(s) =>
+        assert(s.values.forall(_.nonEmpty), s"partition spec '$s' is invalid")


shall we also add the assert in getPartitionNames?

Yeah, it has the same issue.

SparkQA · 2017-01-17T06:33:35Z

Test build #71479 has started for PR 16583 at commit f1b6fe0.

cloud-fan · 2017-01-17T11:43:45Z

retest this please

SparkQA · 2017-01-17T14:22:50Z

Test build #71505 has finished for PR 16583 at commit f1b6fe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-17T18:03:43Z

LGTM, merging to master/2.1!

…partition spec Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error. ```Scala val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name") df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable") spark.sql("alter table partitionedTable drop partition(partCol1='')") spark.table("partitionedTable").show() ``` In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values. When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs. Added test cases Author: gatorsmile <[email protected]> Closes #16583 from gatorsmile/disallowEmptyPartColValue. (cherry picked from commit a23debd) Signed-off-by: Wenchen Fan <[email protected]>

…partition spec ### What changes were proposed in this pull request? Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error. ```Scala val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name") df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable") spark.sql("alter table partitionedTable drop partition(partCol1='')") spark.table("partitionedTable").show() ``` In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values. When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs. ### How was this patch tested? Added test cases Author: gatorsmile <[email protected]> Closes apache#16583 from gatorsmile/disallowEmptyPartColValue.

gatorsmile added 3 commits January 13, 2017 16:24

fix.

c1cdcad

fix message.

4d32864

improve test cases

1719286

cloud-fan reviewed Jan 14, 2017

View reviewed changes

cloud-fan reviewed Jan 16, 2017

View reviewed changes

cloud-fan reviewed Jan 17, 2017

View reviewed changes

address comments.

f1b6fe0

asfgit closed this in a23debd Jan 17, 2017

cxzl25 mentioned this pull request Aug 7, 2020

[SPARK-32508][SQL] Disallow empty part col values in partition spec before static partition writing #29316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19129] [SQL] SessionCatalog: Disallow empty part col values in partition spec #16583

[SPARK-19129] [SQL] SessionCatalog: Disallow empty part col values in partition spec #16583

gatorsmile commented Jan 14, 2017

SparkQA commented Jan 14, 2017

gatorsmile commented Jan 14, 2017

gatorsmile commented Jan 14, 2017

cloud-fan Jan 14, 2017

tejasapatil Jan 14, 2017

gatorsmile Jan 14, 2017

SparkQA commented Jan 14, 2017

cloud-fan Jan 16, 2017

gatorsmile Jan 16, 2017

cloud-fan Jan 17, 2017

gatorsmile Jan 17, 2017

SparkQA commented Jan 17, 2017

cloud-fan commented Jan 17, 2017

SparkQA commented Jan 17, 2017

cloud-fan commented Jan 17, 2017

		@@ -937,10 +985,22 @@ class SessionCatalogSuite extends PlanTest {

		test("list partitions with invalid partial partition spec") {

[SPARK-19129] [SQL] SessionCatalog: Disallow empty part col values in partition spec #16583

[SPARK-19129] [SQL] SessionCatalog: Disallow empty part col values in partition spec #16583

Conversation

gatorsmile commented Jan 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 14, 2017

gatorsmile commented Jan 14, 2017

gatorsmile commented Jan 14, 2017

cloud-fan Jan 14, 2017

Choose a reason for hiding this comment

tejasapatil Jan 14, 2017

Choose a reason for hiding this comment

gatorsmile Jan 14, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 14, 2017

cloud-fan Jan 16, 2017

Choose a reason for hiding this comment

gatorsmile Jan 16, 2017

Choose a reason for hiding this comment

cloud-fan Jan 17, 2017

Choose a reason for hiding this comment

gatorsmile Jan 17, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2017

cloud-fan commented Jan 17, 2017

SparkQA commented Jan 17, 2017

cloud-fan commented Jan 17, 2017