-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18220][SQL] read Hive orc table with varchar column should not fail #16060
Conversation
Test build #69325 has finished for PR 16060 at commit
|
Test build #69327 has finished for PR 16060 at commit
|
@@ -51,9 +51,12 @@ private[spark] object HiveUtils extends Logging { | |||
sc | |||
} | |||
|
|||
/** The version of hive used internally by Spark SQL. */ | |||
// The version of hive used internally by Spark SQL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you change the comment style?
val hiveExecutionVersion: String = "1.2.1" | ||
|
||
// The property key that is used to store the raw hive type string in the metadata of StructField. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a bit more color here, e.g. by adding an example: "For example, in the case where the Hive type is varchar, the type gets mapped to a string type in Spark SQL, but we need to preserve the original type in order to invoke the correct object inspector in Hive"
Test build #69389 has started for PR 16060 at commit |
Test build #3446 has finished for PR 16060 at commit
|
Merging in master/branch-2.1. |
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes #16060 from cloud-fan/varchar. (cherry picked from commit 3f03c90) Signed-off-by: Reynold Xin <[email protected]>
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes apache#16060 from cloud-fan/varchar.
… fail ## What changes were proposed in this pull request? Spark SQL only has `StringType`, when reading hive table with varchar column, we will read that column as `StringType`. However, we still need to use varchar `ObjectInspector` to read varchar column in hive table, which means we need to know the actual column type at hive side. In Spark 2.1, after apache#14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string `ObjectInspector` to read varchar column and fail. This PR keeps the original hive column type string in the metadata of `StructField`, and use it when we convert it to a hive column. ## How was this patch tested? newly added regression test Author: Wenchen Fan <[email protected]> Closes apache#16060 from cloud-fan/varchar.
…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see #16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes #16804 from hvanhovell/SPARK-19459.
…tadata ## What changes were proposed in this pull request? Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column. This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see apache#16060 for more details on how the metadata is used. ## How was this patch tested? Added a regression test to `OrcSourceSuite`. Author: Herman van Hovell <[email protected]> Closes apache#16804 from hvanhovell/SPARK-19459.
What changes were proposed in this pull request?
Spark SQL only has
StringType
, when reading hive table with varchar column, we will read that column asStringType
. However, we still need to use varcharObjectInspector
to read varchar column in hive table, which means we need to know the actual column type at hive side.In Spark 2.1, after #14363 , we parse hive type string to catalyst type, which means the actual column type at hive side is erased. Then we may use string
ObjectInspector
to read varchar column and fail.This PR keeps the original hive column type string in the metadata of
StructField
, and use it when we convert it to a hive column.How was this patch tested?
newly added regression test