Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Alias and Doc support in Spark Schema #51

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

vasanthrj
Copy link

Preserves avro doc and aliases while save
Include avro alias column in DataFrame schema

@vasanthrj
Copy link
Author

Can someone review this? Please give some suggestion.


To include aliases column in scheme invoke the method `addAvroAliasColumns()` of DataFrame.
With alias while saving `saveAsAvroFile`, alias columns in DataFrame schema will not be include in Avro file.
Alias will be available in `aliases` of avro schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it stored as "aliases" or "_aliases"?

@jendap
Copy link
Contributor

jendap commented Jul 19, 2015

I love the idea! And it seems to be working just fine.

A few suggestions...

  1. Let's ask @marmbrus if "_doc" and "_aliases" are a good naming convention. It would be nice to make it consistent with other formats supporting it.

  2. Is the implicit addAvroAliasColumns a good choice? It would be on every DataFrame - even after modifying columns and wiping out their metadata. Would it make sense to make it a property on SQLContext the same way as parquet settings are done currently?

  3. It would be nice to add an assert to the place where files are read after saving. Check that doc and aliases made the loop to disk and back.

  4. Would it not be enough and more readable to add the doc and alias on one field in the test only? It does not have to be tested on all the different fields, right? It would save some 100 lines of code in tests ;-)

@@ -158,7 +158,12 @@ object AvroSaver {

while (convertersIterator.hasNext) {
val converter = convertersIterator.next()
record.put(fieldNamesIterator.next(), converter(rowIterator.next()))
val fieldName = fieldNamesIterator.next()
if(schema.getField(fieldName) != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: space after if.

@marmbrus
Copy link
Contributor

Thanks for working on this.

  • I'd consider making the addition of the alias columns just an option to the datasource, so that it is usable in languages other than scala.
  • Other users of metadata don't use _ so i'd just go with doc and aliases

@vasanthrj
Copy link
Author

@jendap Implemented review comments
Suggestion reply:
2. I added to dataframe since dataframe know the schema and can store metadata.Where aliases/doc are belong to a column data. I think parquet properties are related to deserialization, data type handling.

@JoshRosen Implemented review comments
@marmbrus adding alias column is optional where alias columns added after invoking addAvroAliasColumns()

Build is failing due to some other reason. Kindly check

@@ -29,6 +29,9 @@ import org.apache.avro.Schema.Type._
*/
private object SchemaConverters {

val METADATA_KEY_DOC = "doc";
val METADATA_KEY_ALIASES = "aliases";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good, you can put even the "_parent" here as a constant

@vasanthrj
Copy link
Author

@jendap Implemented your review comments. Please review.

@codecov-io
Copy link

Current coverage is 93.79%

Merging #51 into master will increase coverage by +0.39% as of ccc433e

@@            master     #51   diff @@
======================================
  Files            6       6       
  Stmts          273     306    +33
  Branches        45      57    +12
  Methods          0       0       
======================================
+ Hit            255     287    +32
  Partial          0       0       
- Missed          18      19     +1

Review entire Coverage Diff as of ccc433e

Powered by Codecov. Updated on successful CI builds.

@marmbrus
Copy link
Contributor

@marmbrus adding alias column is optional where alias columns added after invoking addAvroAliasColumns()

I realize its optional, but implementing this feature as a scala function means that users in Python and R will not be able to use it. I think instead this should just be a parameter that is passed in the OPTIONS clause or as an option(...) so that it works across languages.

@vasanthrj
Copy link
Author

@marmbrus Implemented review comments. Added a option. Please review

*/
implicit class AvroDataFrame(dataFrame: DataFrame) {
def addAvroAliasColumns(): DataFrame = SchemaConverters.dataFrameWithAliasColumn(dataFrame)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of this entirely? I'd prefer to only have to support a single unified way to do this.


Avro schema's field 'doc' will be preserved in DataFrame's schema metadata.

To include aliases column in scheme invoke
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To include extra columns when aliases are specified in the schema, set withAlias to true, as follows:"

I would also call the property withAliases, since we handle more than one.

@marmbrus
Copy link
Contributor

Thanks again for working on this! I did another review pass.

@vasanthrj
Copy link
Author

@marmbrus implemented review comments. Except the usage of "withAliasesFields" of in buildScan(...) of AvroRelation.scala, I was getting:
object not serializable (class: org.apache.avro.Schema$RecordSchema,

@gasparms
Copy link

gasparms commented Nov 22, 2017

What happened with this PR? Are you planning include this in future releases? It would be useful support aliases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants