Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling #2716

Closed
wants to merge 10 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Oct 8, 2014

This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2716 at commit 3603e00.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2716 at commit f93fd84.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2716 at commit 3603e00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21480/Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have finished for PR 2716 at commit f93fd84.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21481/Test PASSed.

@nchammas
Copy link
Contributor

nchammas commented Oct 8, 2014

@davies I believe this PR also relates to the features discussed in SPARK-2870. Since you are already doing schema inference over multiple rows with an optional sampling ratio in this PR, how far are we from being able to do full schema inference on RDDs of dicts?

@davies
Copy link
Contributor Author

davies commented Oct 8, 2014

@nchammas This PR only fix the problem of having empty values in first few rows, it can not handle different types for one field (like what json() had done).

Maybe we could support optional fields.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2716 at commit 540d1d5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2716 at commit 540d1d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(DataType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21489/Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21542/Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21543/Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2716 at commit 29e94d5.

  • This patch merges cleanly.

@marmbrus
Copy link
Contributor

marmbrus commented Oct 9, 2014

@davies thanks for working on this! I'm always a little embarrassed when I have to explain how the current code works :)

@yhuai, can you take a look at this? Perhaps we can extract the schema merging code from JSON and use it here as well after inferring the per row schema in python.

@nchammas
Copy link
Contributor

nchammas commented Oct 9, 2014

Perhaps we can extract the schema merging code from JSON and use it here as well after inferring the per row schema in python.

That sounds good to me! Schema inference is powerful feature, and it would be good to see it leverage this type of schema merging logic wherever possible.

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2716 at commit 29e94d5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have started for PR 2716 at commit 29e94d5.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Oct 9, 2014

It will be cool to reuse the schema merging code from JsonRDD, some concerns:

  1. the schema merging logic is special designed for JSON, such as it convert all the conflict types into StringType. This works for JSON, but not the best solution for Python RDD.

For example, if user have two rows:

{a: 2} 
{a: 'abc'}

JsonRDD will infer type the of a as StringType, then it will have runtime exception when trying to access it as IntType.

SchemaRDD expects that each column should have same type, so I'd like to raise exception during inferring, not runtime. It's better to let user deal with the dirty datetypes before inferring schema .

  1. it can not handle MapType (this can be added).
  2. Most of code in JsonRDD are inferring types and converting objects into inferred types, these two can only be done in Python. The merging part is not so huge, also we may expect different behavior in Python (more strictly).

If user really want the behavior that JsonRDD provided, it's easy to call

sqlContext.jsonRDD(rdd.map(json.dumps))

Do these seem reasonable?

@nchammas
Copy link
Contributor

nchammas commented Oct 9, 2014

Most of code in JsonRDD are inferring types and converting objects into inferred types, these two can only be done in Python

Ah yeah, I'm assuming that automatically converting objects to the inferred type is a good thing. When a user asks for the schema to be inferred, automatic type conversion should be expected behavior.

I guess my perspective is that inferring schema is most useful when the schema is not consistent or known in advance. It should not just be a convenient way to get the schema for a dataset that has a consistent schema, though it can serve that purpose too.

If user really want the behavior that JsonRDD provided, it's easy to call
sqlContext.jsonRDD(rdd.map(json.dumps))
Do these seem reasonable?

Definitely. That's precisely the workaround I suggested in SPARK-2870, though it has the obvious JSON ser/de performance cost and seems "hacky".

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2716 at commit 29e94d5.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):

@yhuai
Copy link
Contributor

yhuai commented Oct 9, 2014

Oh, I see. If inferring types and converting objects need to be done in Python, seems it will be hard to reuse code in JsonRDD.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2716 at commit 29e94d5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 10, 2014

QA tests have started for PR 2716 at commit e48d7fb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #433 has finished for PR 2716 at commit 9767b27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #451 has started for PR 2716 at commit 9767b27.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #451 has finished for PR 2716 at commit 9767b27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):


Each row could be L{pyspark.sql.Row} object or namedtuple or objects,
using dict is deprecated.

If some of rows has different types with inferred types, it may cause
runtime exceptions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When samplingRatio is specified, the schema is inferred by looking at the types of each row in the sampled dataset. Otherwise, the first 100 rows of the RDD are inspected. Nested collections are supported, which can include array, dict, list, Row, tuple, namedtuple, or object.

Each row could be L{pyspark.sql.Row} object or namedtuple or objects. Using top level dicts is deprecated, as this datatype is used to represent Maps.

If a single column has multiple distinct inferred types, it may cause runtime exceptions.

@marmbrus
Copy link
Contributor

Minor comment on documentation wording. Otherwise this LGTM!

@davies
Copy link
Contributor Author

davies commented Oct 27, 2014

@marmbrus fixed, thanks!

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22273 has started for PR 2716 at commit 567dc60.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22273 has finished for PR 2716 at commit 567dc60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22273/
Test PASSed.

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22496 has started for PR 2716 at commit 34b5c63.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22496 has finished for PR 2716 at commit 34b5c63.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AddWebUIFilter(filterName:String, filterParams: Map[String, String], proxyBase: String)
    • case class RequestExecutors(requestedTotal: Int) extends CoarseGrainedClusterMessage
    • case class KillExecutors(executorIds: Seq[String]) extends CoarseGrainedClusterMessage
    • class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val actorSystem: ActorSystem)
    • class NullType(PrimitiveType):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22496/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #495 has started for PR 2716 at commit 34b5c63.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #495 has finished for PR 2716 at commit 34b5c63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor Author

davies commented Oct 31, 2014

Ping dashboard

@marmbrus
Copy link
Contributor

marmbrus commented Nov 2, 2014

Sorry, for the delay here. If you can merge I'll try to squeeze this into 1.2. Thanks!

Conflicts:
	python/pyspark/sql.py
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
@davies
Copy link
Contributor Author

davies commented Nov 3, 2014

@marmbrus waiting for jenkins

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22792 has started for PR 2716 at commit e678f6d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #508 has started for PR 2716 at commit 34b5c63.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #508 has finished for PR 2716 at commit 34b5c63.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22792 has finished for PR 2716 at commit e678f6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class NullType(PrimitiveType):
    • // in some cases, such as when a class is enclosed in an object (in which case
    • abstract class UserDefinedType[UserType] extends DataType with Serializable
    • public abstract class UserDefinedType<UserType> extends DataType implements Serializable

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22792/
Test PASSed.

@davies
Copy link
Contributor Author

davies commented Nov 3, 2014

@marmbrus It's ready to go

@marmbrus
Copy link
Contributor

marmbrus commented Nov 3, 2014

Thanks! Merged to master.

asfgit pushed a commit that referenced this pull request Nov 3, 2014
This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

Author: Davies Liu <[email protected]>
Author: Davies Liu <[email protected]>

Closes #2716 from davies/infer and squashes the following commits:

e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
567dc60 [Davies Liu] update docs
9767b27 [Davies Liu] Merge branch 'master' into infer
e48d7fb [Davies Liu] fix tests
29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
540d1d5 [Davies Liu] merge fields for StructType
f93fd84 [Davies Liu] add more tests
3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD

(cherry picked from commit 24544fb)
Signed-off-by: Michael Armbrust <[email protected]>
@asfgit asfgit closed this in 24544fb Nov 3, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants