-
Notifications
You must be signed in to change notification settings - Fork 308
Alias and Doc support in Spark Schema #51
base: master
Are you sure you want to change the base?
Conversation
Can someone review this? Please give some suggestion. |
|
||
To include aliases column in scheme invoke the method `addAvroAliasColumns()` of DataFrame. | ||
With alias while saving `saveAsAvroFile`, alias columns in DataFrame schema will not be include in Avro file. | ||
Alias will be available in `aliases` of avro schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it stored as "aliases" or "_aliases"?
I love the idea! And it seems to be working just fine. A few suggestions...
|
@@ -158,7 +158,12 @@ object AvroSaver { | |||
|
|||
while (convertersIterator.hasNext) { | |||
val converter = convertersIterator.next() | |||
record.put(fieldNamesIterator.next(), converter(rowIterator.next())) | |||
val fieldName = fieldNamesIterator.next() | |||
if(schema.getField(fieldName) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: space after if
.
Thanks for working on this.
|
@jendap Implemented review comments @JoshRosen Implemented review comments Build is failing due to some other reason. Kindly check |
@@ -29,6 +29,9 @@ import org.apache.avro.Schema.Type._ | |||
*/ | |||
private object SchemaConverters { | |||
|
|||
val METADATA_KEY_DOC = "doc"; | |||
val METADATA_KEY_ALIASES = "aliases"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good, you can put even the "_parent" here as a constant
@jendap Implemented your review comments. Please review. |
Current coverage is
|
I realize its optional, but implementing this feature as a scala function means that users in Python and R will not be able to use it. I think instead this should just be a parameter that is passed in the |
@marmbrus Implemented review comments. Added a option. Please review |
*/ | ||
implicit class AvroDataFrame(dataFrame: DataFrame) { | ||
def addAvroAliasColumns(): DataFrame = SchemaConverters.dataFrameWithAliasColumn(dataFrame) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of this entirely? I'd prefer to only have to support a single unified way to do this.
|
||
Avro schema's field 'doc' will be preserved in DataFrame's schema metadata. | ||
|
||
To include aliases column in scheme invoke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To include extra columns when aliases are specified in the schema, set withAlias
to true
, as follows:"
I would also call the property withAliases
, since we handle more than one.
Thanks again for working on this! I did another review pass. |
@marmbrus implemented review comments. Except the usage of "withAliasesFields" of in buildScan(...) of AvroRelation.scala, I was getting: |
What happened with this PR? Are you planning include this in future releases? It would be useful support aliases. |
Preserves avro doc and aliases while save
Include avro alias column in DataFrame schema