-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling #2716
Conversation
QA tests have started for PR 2716 at commit
|
QA tests have started for PR 2716 at commit
|
QA tests have finished for PR 2716 at commit
|
Test PASSed. |
QA tests have finished for PR 2716 at commit
|
Test PASSed. |
@davies I believe this PR also relates to the features discussed in SPARK-2870. Since you are already doing schema inference over multiple rows with an optional sampling ratio in this PR, how far are we from being able to do full schema inference on RDDs of |
@nchammas This PR only fix the problem of having empty values in first few rows, it can not handle different types for one field (like what json() had done). Maybe we could support optional fields. |
QA tests have started for PR 2716 at commit
|
QA tests have finished for PR 2716 at commit
|
Test PASSed. |
Test FAILed. |
Test FAILed. |
QA tests have started for PR 2716 at commit
|
That sounds good to me! Schema inference is powerful feature, and it would be good to see it leverage this type of schema merging logic wherever possible. |
QA tests have finished for PR 2716 at commit
|
QA tests have started for PR 2716 at commit
|
It will be cool to reuse the schema merging code from JsonRDD, some concerns:
For example, if user have two rows:
JsonRDD will infer type the of SchemaRDD expects that each column should have same type, so I'd like to raise exception during inferring, not runtime. It's better to let user deal with the dirty datetypes before inferring schema .
If user really want the behavior that JsonRDD provided, it's easy to call
Do these seem reasonable? |
Ah yeah, I'm assuming that automatically converting objects to the inferred type is a good thing. When a user asks for the schema to be inferred, automatic type conversion should be expected behavior. I guess my perspective is that inferring schema is most useful when the schema is not consistent or known in advance. It should not just be a convenient way to get the schema for a dataset that has a consistent schema, though it can serve that purpose too.
Definitely. That's precisely the workaround I suggested in SPARK-2870, though it has the obvious JSON ser/de performance cost and seems "hacky". |
QA tests have finished for PR 2716 at commit
|
Oh, I see. If inferring types and converting objects need to be done in Python, seems it will be hard to reuse code in |
QA tests have started for PR 2716 at commit
|
QA tests have started for PR 2716 at commit
|
Test build #433 has finished for PR 2716 at commit
|
Test build #451 has started for PR 2716 at commit
|
Test build #451 has finished for PR 2716 at commit
|
|
||
Each row could be L{pyspark.sql.Row} object or namedtuple or objects, | ||
using dict is deprecated. | ||
|
||
If some of rows has different types with inferred types, it may cause | ||
runtime exceptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When samplingRatio
is specified, the schema is inferred by looking at the types of each row in the sampled dataset. Otherwise, the first 100 rows of the RDD are inspected. Nested collections are supported, which can include array, dict, list, Row, tuple, namedtuple, or object.
Each row could be L{pyspark.sql.Row} object or namedtuple or objects. Using top level dicts is deprecated, as this datatype is used to represent Maps.
If a single column has multiple distinct inferred types, it may cause runtime exceptions.
Minor comment on documentation wording. Otherwise this LGTM! |
@marmbrus fixed, thanks! |
Test build #22273 has started for PR 2716 at commit
|
Test build #22273 has finished for PR 2716 at commit
|
Test PASSed. |
Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
Test build #22496 has started for PR 2716 at commit
|
Test build #22496 has finished for PR 2716 at commit
|
Test FAILed. |
Test build #495 has started for PR 2716 at commit
|
Test build #495 has finished for PR 2716 at commit
|
Ping dashboard |
Sorry, for the delay here. If you can merge I'll try to squeeze this into 1.2. Thanks! |
Conflicts: python/pyspark/sql.py sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
@marmbrus waiting for jenkins |
Test build #22792 has started for PR 2716 at commit
|
Test build #508 has started for PR 2716 at commit
|
Test build #508 has finished for PR 2716 at commit
|
Test build #22792 has finished for PR 2716 at commit
|
Test PASSed. |
@marmbrus It's ready to go |
Thanks! Merged to master. |
This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling. If sampling is presented, it will infer schema from all the rows after sampling. Also, add samplingRatio for jsonFile() and jsonRDD() Author: Davies Liu <[email protected]> Author: Davies Liu <[email protected]> Closes #2716 from davies/infer and squashes the following commits: e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 567dc60 [Davies Liu] update docs 9767b27 [Davies Liu] Merge branch 'master' into infer e48d7fb [Davies Liu] fix tests 29e94d5 [Davies Liu] let NullType inherit from PrimitiveType ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 540d1d5 [Davies Liu] merge fields for StructType f93fd84 [Davies Liu] add more tests 3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD (cherry picked from commit 24544fb) Signed-off-by: Michael Armbrust <[email protected]>
This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
If sampling is presented, it will infer schema from all the rows after sampling.
Also, add samplingRatio for jsonFile() and jsonRDD()