-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6775] [SPARK-6776] [SQL] [WIP] Refactors converters for converting Parquet records to row objects #5422
Conversation
@@ -202,6 +202,7 @@ final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableR | |||
case DoubleType => new MutableDouble | |||
case BooleanType => new MutableBoolean | |||
case LongType => new MutableLong | |||
case DateType => new MutableInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dates are represented as integers inside Catalyst. Since SpecificMutableRow
is only used inside Catalyst, it should be safe to specialize DateType
here.
Test build #29858 has finished for PR 5422 at commit
|
Test build #29859 has finished for PR 5422 at commit
|
02d78c4
to
2529b76
Compare
Test build #29868 has finished for PR 5422 at commit
|
Test build #29869 has finished for PR 5422 at commit
|
Test build #31021 has finished for PR 5422 at commit
|
Closing this as it's superceded by #6617 and upcoming follow-up PRs. |
This PR simplifies the
CatalystConverter
class hierarchy, and implements backwards-compatibility rules for them. The new class hierarchy basically resemblesAvroIndexedRecordConverter
in parquet-avro.This is still in WIP status. I'm opening this PR to check whether it passes Jenkins.
TODO:
Major changes:
Replaces
CatalystConverter
trait with a much simplerParentContainerUpdater
.Now instead of extending the original
CatalystConverter
trait, every converter class accepts an updater whose's responsible to set the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row.This simplifies the design since converters don't need to care about details of their parent converters anymore.
Replaces
CatalystRootConverter
,CatalystGroupConverter
andCatalystPrimitiveRowConverter
withCatalystRowConverter
Specifically, now all row objects are represented with
SpecificMutableRow
during conversion.Removes
CatalystArrayContainsNullConverter
andCatalystNativeArrayConverter
, revampedCatalystArrayConverter
CatalystNativeArrayConverter
should be designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.The new
CatalystArrayConverter
handles both nullable and non-nullable array elements.Implemented backwards-compatibility rules in
CatalystArrayConverter
When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see backwards-compatibility rules for MAP), we only need to deal with LIST (namely array) here.