-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-10136] [SQL] A more robust fix for SPARK-10136 #8361
Conversation
Although #8341 is hacky, it's still a valid fix. I won't list this PR as a blocker for 1.5 at this time since RC1 is already cut. But I'd like to have this merged into branch-1.5 if there's RC2. |
(No more test cases added in this PR because those added in #8341 already cover this issue.) |
Test build #41376 has finished for PR 8361 at commit
|
Test build #41407 has finished for PR 8361 at commit
|
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <[email protected]> Closes #8361 from liancheng/spark-10136/proper-version. (cherry picked from commit bf03fe6) Signed-off-by: Cheng Lian <[email protected]>
Merged to master and branch-1.5. |
PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of
LIST
backwards-compatibility rules. Let me have a try to give an explanation here.The structure of the problematic Parquet schema generated by parquet-avro is something like this:
(The schema generated by parquet-thrift is structurally similar, just replace the
array
at level 2 withf_tuple
, and the other one at level 3 withf_tuple_tuple
.)This structure consists of two nested legacy 2-level
LIST
-like structures:The repeated group type at level 2 is the element type of the outer array defined at level 1
This group should map to an
CatalystArrayConverter.ElementConverter
when building converters.The repeated primitive type at level 3 is the element type of the inner array defined at level 2
This group should also map to an
CatalystArrayConverter.ElementConverter
.The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a
RepeatedPrimitiveConverter
instead of aCatalystArrayConverter.ElementConverter
is created for it.According to parquet-format spec, unannotated repeated type shouldn't appear in a
LIST
- orMAP
-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear inLIST
-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.)As for the reason why level 2 isn't recognized as a list element type, it's because of the following
LIST
backwards-compatibility rule defined in the parquet-format spec:(The
array
part is for parquet-avro compatibility, while the_tuple
part is for parquet-thrift.)This rule is implemented in
CatalystSchemaConverter.isElementType
, but neglected inCatalystRowConverter.isElementType
. This PR delivers a more robust fix by adding this rule in the latter method.Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at PARQUET-364.