Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Spark SQL only has StringType, when reading hive table with varchar column, we will read that column as StringType. However, we still need to use varchar ObjectInspector to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string ObjectInspector to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of StructField, and use it when we convert it to a hive column.

How was this patch tested?

newly added regression test

@cloud-fan
Copy link
Contributor Author

cc @yhuai @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69325 has finished for PR 16060 at commit 71c9dea.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-17897][SQL] read Hive orc table with varchar column should not fail [SPARK-18220][SQL] read Hive orc table with varchar column should not fail Nov 29, 2016
@SparkQA
Copy link

SparkQA commented Nov 29, 2016

Test build #69327 has finished for PR 16060 at commit 419fc79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -51,9 +51,12 @@ private[spark] object HiveUtils extends Logging {
sc
}

/** The version of hive used internally by Spark SQL. */
// The version of hive used internally by Spark SQL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you change the comment style?

val hiveExecutionVersion: String = "1.2.1"

// The property key that is used to store the raw hive type string in the metadata of StructField.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a bit more color here, e.g. by adding an example: "For example, in the case where the Hive type is varchar, the type gets mapped to a string type in Spark SQL, but we need to preserve the original type in order to invoke the correct object inspector in Hive"

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69389 has started for PR 16060 at commit 8b697be.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #3446 has finished for PR 16060 at commit 8b697be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Nov 30, 2016

Merging in master/branch-2.1.

@asfgit asfgit closed this in 3f03c90 Nov 30, 2016
asfgit pushed a commit that referenced this pull request Nov 30, 2016
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes #16060 from cloud-fan/varchar.

(cherry picked from commit 3f03c90)
Signed-off-by: Reynold Xin <[email protected]>
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16060 from cloud-fan/varchar.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… fail

## What changes were proposed in this pull request?

Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side.

In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail.

This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column.

## How was this patch tested?

newly added regression test

Author: Wenchen Fan <[email protected]>

Closes apache#16060 from cloud-fan/varchar.
asfgit pushed a commit that referenced this pull request Feb 10, 2017
…tadata

## What changes were proposed in this pull request?
Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column.

This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see #16060 for more details on how the metadata is used.

## How was this patch tested?
Added a regression test to `OrcSourceSuite`.

Author: Herman van Hovell <[email protected]>

Closes #16804 from hvanhovell/SPARK-19459.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
…tadata

## What changes were proposed in this pull request?
Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column.

This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used.

## How was this patch tested?
Added a regression test to `OrcSourceSuite`.

Author: Herman van Hovell <[email protected]>

Closes apache#16804 from hvanhovell/SPARK-19459.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants