[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

cloud-fan · 2016-11-29T11:36:49Z

What changes were proposed in this pull request?

Spark SQL only has StringType, when reading hive table with varchar column, we will read that column as StringType. However, we still need to use varchar ObjectInspector to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string ObjectInspector to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of StructField, and use it when we convert it to a hive column.

How was this patch tested?

newly added regression test

cloud-fan · 2016-11-29T11:37:39Z

cc @yhuai @gatorsmile

SparkQA · 2016-11-29T12:58:29Z

Test build #69325 has finished for PR 16060 at commit 71c9dea.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-29T15:01:20Z

Test build #69327 has finished for PR 16060 at commit 419fc79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-29T17:25:32Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

@@ -51,9 +51,12 @@ private[spark] object HiveUtils extends Logging {
    sc
  }

-  /** The version of hive used internally by Spark SQL. */
+  // The version of hive used internally by Spark SQL.


why did you change the comment style?

rxin · 2016-11-29T17:26:28Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

  val hiveExecutionVersion: String = "1.2.1"

+  // The property key that is used to store the raw hive type string in the metadata of StructField.


I'd add a bit more color here, e.g. by adding an example: "For example, in the case where the Hive type is varchar, the type gets mapped to a string type in Spark SQL, but we need to preserve the original type in order to invoke the correct object inspector in Hive"

SparkQA · 2016-11-30T07:07:34Z

Test build #69389 has started for PR 16060 at commit 8b697be.

SparkQA · 2016-11-30T09:59:26Z

Test build #3446 has finished for PR 16060 at commit 8b697be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-30T17:47:05Z

Merging in master/branch-2.1.

… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes #16060 from cloud-fan/varchar. (cherry picked from commit 3f03c90) Signed-off-by: Reynold Xin <[email protected]>

… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes apache#16060 from cloud-fan/varchar.

…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see #16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes #16804 from hvanhovell/SPARK-19459.

…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes apache#16804 from hvanhovell/SPARK-19459.

read Hive orc table with varchar column

71c9dea

fix test

419fc79

cloud-fan changed the title ~~[SPARK-17897][SQL] read Hive orc table with varchar column should not fail~~ [SPARK-18220][SQL] read Hive orc table with varchar column should not fail Nov 29, 2016

rxin reviewed Nov 29, 2016

View reviewed changes

address comments

8b697be

asfgit closed this in 3f03c90 Nov 30, 2016

hvanhovell mentioned this pull request Feb 4, 2017

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata #16804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

cloud-fan commented Nov 29, 2016

cloud-fan commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

rxin Nov 29, 2016

rxin Nov 29, 2016

SparkQA commented Nov 30, 2016

SparkQA commented Nov 30, 2016

rxin commented Nov 30, 2016

		val hiveExecutionVersion: String = "1.2.1"

		// The property key that is used to store the raw hive type string in the metadata of StructField.

[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

Conversation

cloud-fan commented Nov 29, 2016

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Nov 29, 2016

SparkQA commented Nov 29, 2016

SparkQA commented Nov 29, 2016

rxin Nov 29, 2016

Choose a reason for hiding this comment

rxin Nov 29, 2016

Choose a reason for hiding this comment

SparkQA commented Nov 30, 2016

SparkQA commented Nov 30, 2016

rxin commented Nov 30, 2016