Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5097][SQL] DataFrame #4173

Closed
wants to merge 25 commits into from
Closed

[SPARK-5097][SQL] DataFrame #4173

wants to merge 25 commits into from

Conversation

rxin
Copy link
Contributor

@rxin rxin commented Jan 23, 2015

This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.

  • Audit of the API
  • Documentation
  • More test cases to cover the new API
  • Python support
  • Type alias SchemaRDD

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26008 has started for PR 4173 at commit feb43ef.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26008 has finished for PR 4173 at commit feb43ef.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26008/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26010 has started for PR 4173 at commit 1532e1e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26010 has finished for PR 4173 at commit 1532e1e.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26010/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26011 has started for PR 4173 at commit bde6628.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26011 has finished for PR 4173 at commit bde6628.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26011/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26033 has started for PR 4173 at commit 23b2c2d.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26032/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26034 has started for PR 4173 at commit 38df669.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26033 has finished for PR 4173 at commit 23b2c2d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26033/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 23, 2015

Test build #26034 has finished for PR 4173 at commit 38df669.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26034/
Test FAILed.

Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr))
}

override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to do in Python/R if putting joinType at the end

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26102 has started for PR 4173 at commit d0ffd84.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26102 has finished for PR 4173 at commit d0ffd84.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26102/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 26, 2015

Test build #26104 has started for PR 4173 at commit a47e189.

  • This patch merges cleanly.

* // The following are equivalent:
* peopleDf.filter($"age" > 15)
* peopleDf.where($"age" > 15)
* peopleDf($"age > 15)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing closing quote?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26154/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26156 has finished for PR 4173 at commit e971078.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26156/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26157 has finished for PR 4173 at commit e971078.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26157/
Test FAILed.

def toDF: DataFrame = this

/** Return the schema of this [[DataFrame]]. */
override def schema: StructType = queryExecution.analyzed.schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a new higher-level type for schema as well ? It is painful as a user to dig into StructType etc. -- Similarly while applying a schema to an RDD it would be good to have a higher-level type / constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is dtypes here, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shivaram Are you asking for something like RowType extends StructType?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah dtypes is close to what I was talking about and it probably is sufficient to get the schema out. However while applying a schema to a RDD one still needs to construct a StructType etc. It'll be great to have a lightweight way of saying something like DataFrame(rdd, colNames=c("age", "name"), colTypes=c("int", "character"))

Davies Liu and others added 3 commits January 27, 2015 10:19
@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26175 has started for PR 4173 at commit 828f70d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26175 has finished for PR 4173 at commit 828f70d.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26175/
Test FAILed.


override def getItem(ordinal: Column): Column = GetItem(expr, ordinal.expr)

override def getField(fieldName: String): Column = GetField(expr, fieldName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might consider using apply instead or in addition to this.

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26187 has started for PR 4173 at commit 0a1a73b.

  • This patch merges cleanly.

*
* @param right Right side of the join.
* @param joinExprs Join expression.
* @param joinType One of: `inner`, `outer`, `left_outer`, `right_outer`, `semijoin`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't semi join have to specify left/right also?

@SparkQA
Copy link

SparkQA commented Jan 27, 2015

Test build #26187 has finished for PR 4173 at commit 0a1a73b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26187/
Test PASSed.

@rxin
Copy link
Contributor Author

rxin commented Jan 28, 2015

Alright I'm going to merge this one since it touches too many moving parts. I will submit another PR later today to update documentation and address Michael's comments. It will also add more test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants