Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Add support for complex union types #117

Closed
wants to merge 2 commits into from
Closed

Add support for complex union types #117

wants to merge 2 commits into from

Conversation

jasonxh
Copy link
Contributor

@jasonxh jasonxh commented Feb 10, 2016

A complex union will be converted to a struct type where field names are member0, member1, etc., in accordance with members of the union. This is consistent with the behavior when reading Parquet files. Field values are resolved following existing schema translation rules. The test case is a good example of this translation.

@codecov-io
Copy link

codecov-io commented Feb 10, 2016

Current coverage is 90.99% (diff: 86.66%)

Merging #117 into master will decrease coverage by 0.45%

@@             master       #117   diff @@
==========================================
  Files             5          5          
  Lines           304        322    +18   
  Methods         269        270     +1   
  Messages          0          0          
  Branches         35         52    +17   
==========================================
+ Hits            278        293    +15   
- Misses           26         29     +3   
  Partials          0          0          

Powered by Codecov. Last update 16ed688...b59c82a

@sweb
Copy link

sweb commented Apr 19, 2016

Hey!

are there any plans to merge this pull request in the next release? Since we are working with complex unions it would be of great help to us.

Thanks!
Florian

juyttenh pushed a commit to kingcontext/spark-avro that referenced this pull request Jun 9, 2016
@jasonxh
Copy link
Contributor Author

jasonxh commented Oct 15, 2016

Rebased on top of current master. In the meantime, I'll be maintaining patched artifacts for 2.x and 3.x versions in this bintray repo:
https://bintray.com/jasonxh/maven/spark-avro

@JoshRosen JoshRosen closed this Nov 22, 2016
@JoshRosen JoshRosen reopened this Nov 22, 2016
@JoshRosen JoshRosen self-assigned this Nov 22, 2016
@JoshRosen JoshRosen added this to the 3.1.0 milestone Nov 22, 2016
3. `union(something, null)`, where `something` is one of the supported Avro types listed above or is one of the supported `union` types.
1. `union(int, long)` will be mapped to `LongType`.
2. `union(float, double)` will be mapped to `DoubleType`.
3. `union(something, null)`, where `something` is any supported Avro type. This will be mapped to the same Spark SQL type as that of `something`, with `nullable` set to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me curious about the union(union(int, long), null) case. Let me quickly double-check to make sure that there's a test covering this corner-case to make sure that behavior hasn't changed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this should work as expected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I verified this as well.

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two minor style nits, but otherwise this looks good so I think that I'll merge it as soon as I run my own tests. I noticed that the old behavior doesn't seem to be 100% specified by unit tests, so I think that I'll submit my own followup PR to take each verifiable statement in the README and add a test case for it.

@@ -22,6 +22,8 @@ import java.nio.file.Files
import java.sql.Timestamp
import java.util.UUID

import org.apache.avro.generic.GenericData.{EnumSymbol, Fixed}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import should be grouped with the other Avro imports further down in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

2. `union(float, double)` will be mapped to `DoubleType`.
3. `union(something, null)`, where `something` is any supported Avro type. This will be mapped to the same Spark SQL type as that of `something`, with `nullable` set to `true`.

All other `union` types are considered complex. They will be mapped to `StructType` where field names are `member0`, `member1`, etc., in accordance with members of the `union`. This is consistent with the behavior when reading Parquet files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for remembering to add docs, by the way.

case other =>
sqlType match {
case t: StructType if t.fields.length == avroSchema.getTypes.size =>
val fieldConverters = t.fields zip avroSchema.getTypes map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style nit: we tend to prefer .zip(..) to infix notation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

s"This mix of union types is not supported (see README): $other")
case _ =>
// Convert complex unions to struct types where field names are member0, member1, etc.
// This is consistent with the behavior when reading Parquet files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "reading Parquet files", I guess you mean reading Avro records as Parquet messages using parquet-avro?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i meant avro parquet conversion. i've updated the comments

}
case _ => throw new IncompatibleSchemaException(
s"Cannot convert Avro schema to catalyst type because schema at path " +
s"${path.mkString(".")} is not compatible " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd prefer "not supported" instead of "not compatible". "Compatible" is a term usually used in scenarios like schema evolution/merging, while here it means Spark SQL doesn't recognize some specific Avro schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to keep the original wording (there's another reference right below). I believe it is referring to the incompatibility between source avro schema and target catalyst type. I'm fine with either wording. Let me know and I can change or leave both references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chatted with @liancheng about this and we think it's fine to keep the current wording.

@liancheng
Copy link
Contributor

Also LGTM. Just two minor comments. Thanks!

@JoshRosen
Copy link
Contributor

LGTM, so I'm going to merge this in to master now. Thanks for your contributions and patience, @jasonxh!

@JoshRosen JoshRosen closed this in 7945751 Nov 24, 2016
@jasonxh jasonxh deleted the hao/complex-union branch October 4, 2017 23:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants