[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling #2716

davies · 2014-10-08T19:34:56Z

This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

SparkQA · 2014-10-08T19:39:42Z

QA tests have started for PR 2716 at commit 3603e00.

This patch merges cleanly.

SparkQA · 2014-10-08T19:44:32Z

QA tests have started for PR 2716 at commit f93fd84.

This patch merges cleanly.

SparkQA · 2014-10-08T20:43:27Z

QA tests have finished for PR 2716 at commit 3603e00.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(DataType):

AmplabJenkins · 2014-10-08T20:43:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21480/Test PASSed.

SparkQA · 2014-10-08T20:48:00Z

QA tests have finished for PR 2716 at commit f93fd84.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(DataType):

AmplabJenkins · 2014-10-08T20:48:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21481/Test PASSed.

nchammas · 2014-10-08T22:23:31Z

@davies I believe this PR also relates to the features discussed in SPARK-2870. Since you are already doing schema inference over multiple rows with an optional sampling ratio in this PR, how far are we from being able to do full schema inference on RDDs of dicts?

davies · 2014-10-08T22:47:16Z

@nchammas This PR only fix the problem of having empty values in first few rows, it can not handle different types for one field (like what json() had done).

Maybe we could support optional fields.

SparkQA · 2014-10-08T23:34:38Z

QA tests have started for PR 2716 at commit 540d1d5.

This patch merges cleanly.

SparkQA · 2014-10-09T00:37:01Z

QA tests have finished for PR 2716 at commit 540d1d5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(DataType):

AmplabJenkins · 2014-10-09T00:37:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21489/Test PASSed.

AmplabJenkins · 2014-10-09T18:37:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21542/Test FAILed.

AmplabJenkins · 2014-10-09T18:42:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21543/Test FAILed.

SparkQA · 2014-10-09T18:58:45Z

QA tests have started for PR 2716 at commit 29e94d5.

This patch merges cleanly.

marmbrus · 2014-10-09T19:02:58Z

@davies thanks for working on this! I'm always a little embarrassed when I have to explain how the current code works :)

@yhuai, can you take a look at this? Perhaps we can extract the schema merging code from JSON and use it here as well after inferring the per row schema in python.

nchammas · 2014-10-09T19:07:58Z

Perhaps we can extract the schema merging code from JSON and use it here as well after inferring the per row schema in python.

That sounds good to me! Schema inference is powerful feature, and it would be good to see it leverage this type of schema merging logic wherever possible.

SparkQA · 2014-10-09T19:50:50Z

QA tests have finished for PR 2716 at commit 29e94d5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):

SparkQA · 2014-10-09T20:45:10Z

QA tests have started for PR 2716 at commit 29e94d5.

This patch merges cleanly.

davies · 2014-10-09T21:03:30Z

It will be cool to reuse the schema merging code from JsonRDD, some concerns:

the schema merging logic is special designed for JSON, such as it convert all the conflict types into StringType. This works for JSON, but not the best solution for Python RDD.

For example, if user have two rows:

{a: 2} 
{a: 'abc'}

JsonRDD will infer type the of a as StringType, then it will have runtime exception when trying to access it as IntType.

SchemaRDD expects that each column should have same type, so I'd like to raise exception during inferring, not runtime. It's better to let user deal with the dirty datetypes before inferring schema .

it can not handle MapType (this can be added).
Most of code in JsonRDD are inferring types and converting objects into inferred types, these two can only be done in Python. The merging part is not so huge, also we may expect different behavior in Python (more strictly).

If user really want the behavior that JsonRDD provided, it's easy to call

sqlContext.jsonRDD(rdd.map(json.dumps))

Do these seem reasonable?

nchammas · 2014-10-09T21:22:58Z

Most of code in JsonRDD are inferring types and converting objects into inferred types, these two can only be done in Python

Ah yeah, I'm assuming that automatically converting objects to the inferred type is a good thing. When a user asks for the schema to be inferred, automatic type conversion should be expected behavior.

I guess my perspective is that inferring schema is most useful when the schema is not consistent or known in advance. It should not just be a convenient way to get the schema for a dataset that has a consistent schema, though it can serve that purpose too.

If user really want the behavior that JsonRDD provided, it's easy to call
sqlContext.jsonRDD(rdd.map(json.dumps))
Do these seem reasonable?

Definitely. That's precisely the workaround I suggested in SPARK-2870, though it has the obvious JSON ser/de performance cost and seems "hacky".

SparkQA · 2014-10-09T21:40:45Z

QA tests have finished for PR 2716 at commit 29e94d5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):

yhuai · 2014-10-09T23:15:32Z

Oh, I see. If inferring types and converting objects need to be done in Python, seems it will be hard to reuse code in JsonRDD.

SparkQA · 2014-10-10T04:37:51Z

QA tests have started for PR 2716 at commit 29e94d5.

This patch merges cleanly.

SparkQA · 2014-10-10T04:39:36Z

QA tests have started for PR 2716 at commit e48d7fb.

This patch merges cleanly.

SparkQA · 2014-10-25T08:15:57Z

Test build #433 has finished for PR 2716 at commit 9767b27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-10-25T18:08:50Z

Test build #451 has started for PR 2716 at commit 9767b27.

This patch merges cleanly.

SparkQA · 2014-10-25T19:52:36Z

Test build #451 has finished for PR 2716 at commit 9767b27.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):

marmbrus · 2014-10-26T21:57:59Z

python/pyspark/sql.py


        Each row could be L{pyspark.sql.Row} object or namedtuple or objects,
        using dict is deprecated.

+        If some of rows has different types with inferred types, it may cause
+        runtime exceptions.


When samplingRatio is specified, the schema is inferred by looking at the types of each row in the sampled dataset. Otherwise, the first 100 rows of the RDD are inspected. Nested collections are supported, which can include array, dict, list, Row, tuple, namedtuple, or object.

Each row could be L{pyspark.sql.Row} object or namedtuple or objects. Using top level dicts is deprecated, as this datatype is used to represent Maps.

If a single column has multiple distinct inferred types, it may cause runtime exceptions.

marmbrus · 2014-10-26T21:58:35Z

Minor comment on documentation wording. Otherwise this LGTM!

davies · 2014-10-27T04:56:52Z

@marmbrus fixed, thanks!

SparkQA · 2014-10-27T04:59:45Z

Test build #22273 has started for PR 2716 at commit 567dc60.

This patch merges cleanly.

SparkQA · 2014-10-27T06:33:11Z

Test build #22273 has finished for PR 2716 at commit 567dc60.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):

AmplabJenkins · 2014-10-27T06:33:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22273/
Test PASSed.

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

SparkQA · 2014-10-30T00:07:33Z

Test build #22496 has started for PR 2716 at commit 34b5c63.

This patch merges cleanly.

SparkQA · 2014-10-30T01:29:48Z

Test build #22496 has finished for PR 2716 at commit 34b5c63.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AddWebUIFilter(filterName:String, filterParams: Map[String, String], proxyBase: String)
- case class RequestExecutors(requestedTotal: Int) extends CoarseGrainedClusterMessage
- case class KillExecutors(executorIds: Seq[String]) extends CoarseGrainedClusterMessage
- class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val actorSystem: ActorSystem)
- class NullType(PrimitiveType):

AmplabJenkins · 2014-10-30T01:29:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22496/
Test FAILed.

SparkQA · 2014-10-31T00:26:41Z

Test build #495 has started for PR 2716 at commit 34b5c63.

This patch merges cleanly.

SparkQA · 2014-10-31T01:57:48Z

Test build #495 has finished for PR 2716 at commit 34b5c63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2014-10-31T19:13:02Z

Ping dashboard

marmbrus · 2014-11-02T23:52:09Z

Sorry, for the delay here. If you can merge I'll try to squeeze this into 1.2. Thanks!

Conflicts: python/pyspark/sql.py sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

davies · 2014-11-03T02:33:13Z

@marmbrus waiting for jenkins

SparkQA · 2014-11-03T02:34:59Z

Test build #22792 has started for PR 2716 at commit e678f6d.

This patch merges cleanly.

SparkQA · 2014-11-03T02:35:55Z

Test build #508 has started for PR 2716 at commit 34b5c63.

This patch does not merge cleanly.

SparkQA · 2014-11-03T03:48:40Z

Test build #508 has finished for PR 2716 at commit 34b5c63.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):

SparkQA · 2014-11-03T04:22:02Z

Test build #22792 has finished for PR 2716 at commit e678f6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class NullType(PrimitiveType):
- // in some cases, such as when a class is enclosed in an object (in which case
- abstract class UserDefinedType[UserType] extends DataType with Serializable
- public abstract class UserDefinedType<UserType> extends DataType implements Serializable

AmplabJenkins · 2014-11-03T04:22:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22792/
Test PASSed.

davies · 2014-11-03T05:07:00Z

@marmbrus It's ready to go

marmbrus · 2014-11-03T21:18:15Z

Thanks! Merged to master.

This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling. If sampling is presented, it will infer schema from all the rows after sampling. Also, add samplingRatio for jsonFile() and jsonRDD() Author: Davies Liu <[email protected]> Author: Davies Liu <[email protected]> Closes #2716 from davies/infer and squashes the following commits: e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 567dc60 [Davies Liu] update docs 9767b27 [Davies Liu] Merge branch 'master' into infer e48d7fb [Davies Liu] fix tests 29e94d5 [Davies Liu] let NullType inherit from PrimitiveType ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer 540d1d5 [Davies Liu] merge fields for StructType f93fd84 [Davies Liu] add more tests 3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD (cherry picked from commit 24544fb) Signed-off-by: Michael Armbrust <[email protected]>

take more rows to infer schema, or infer the schema by sampling the RDD

3603e00

add more tests

f93fd84

merge fields for StructType

540d1d5

davies added 2 commits October 9, 2014 11:25

Merge branch 'master' of github.com:apache/spark into infer

ee5d524

let NullType inherit from PrimitiveType

29e94d5

fix tests

e48d7fb

marmbrus reviewed Oct 26, 2014
View reviewed changes

update docs

567dc60

Merge branch 'master' of github.com:apache/spark into infer

34b5c63

Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

Merge branch 'master' of github.com:apache/spark into infer

e678f6d

Conflicts: python/pyspark/sql.py sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

asfgit closed this in 24544fb Nov 3, 2014

[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling #2716

[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling #2716

Conversation

davies commented Oct 8, 2014

SparkQA commented Oct 8, 2014

SparkQA commented Oct 8, 2014

SparkQA commented Oct 8, 2014

AmplabJenkins commented Oct 8, 2014

SparkQA commented Oct 8, 2014

AmplabJenkins commented Oct 8, 2014

nchammas commented Oct 8, 2014

davies commented Oct 8, 2014

SparkQA commented Oct 8, 2014

SparkQA commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

AmplabJenkins commented Oct 9, 2014

SparkQA commented Oct 9, 2014

marmbrus commented Oct 9, 2014

nchammas commented Oct 9, 2014

SparkQA commented Oct 9, 2014

SparkQA commented Oct 9, 2014

davies commented Oct 9, 2014

nchammas commented Oct 9, 2014

SparkQA commented Oct 9, 2014

yhuai commented Oct 9, 2014

SparkQA commented Oct 10, 2014

SparkQA commented Oct 10, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

SparkQA commented Oct 25, 2014

marmbrus Oct 26, 2014

Choose a reason for hiding this comment

marmbrus commented Oct 26, 2014

davies commented Oct 27, 2014

SparkQA commented Oct 27, 2014

SparkQA commented Oct 27, 2014

AmplabJenkins commented Oct 27, 2014

SparkQA commented Oct 30, 2014

SparkQA commented Oct 30, 2014

AmplabJenkins commented Oct 30, 2014

SparkQA commented Oct 31, 2014

SparkQA commented Oct 31, 2014

davies commented Oct 31, 2014

marmbrus commented Nov 2, 2014

davies commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

davies commented Nov 3, 2014

marmbrus commented Nov 3, 2014