[SPARK-5097][SQL] DataFrame #4173

rxin · 2015-01-23T08:04:39Z

This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.

SparkQA · 2015-01-23T08:07:32Z

Test build #26008 has started for PR 4173 at commit feb43ef.

This patch merges cleanly.

SparkQA · 2015-01-23T08:08:46Z

Test build #26008 has finished for PR 4173 at commit feb43ef.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T08:08:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26008/
Test FAILed.

SparkQA · 2015-01-23T08:22:44Z

Test build #26010 has started for PR 4173 at commit 1532e1e.

This patch merges cleanly.

SparkQA · 2015-01-23T08:23:37Z

Test build #26010 has finished for PR 4173 at commit 1532e1e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T08:23:39Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26010/
Test FAILed.

SparkQA · 2015-01-23T08:32:43Z

Test build #26011 has started for PR 4173 at commit bde6628.

This patch merges cleanly.

SparkQA · 2015-01-23T08:35:57Z

Test build #26011 has finished for PR 4173 at commit bde6628.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T08:35:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26011/
Test FAILed.

SparkQA · 2015-01-23T18:57:45Z

Test build #26033 has started for PR 4173 at commit 23b2c2d.

This patch merges cleanly.

AmplabJenkins · 2015-01-23T19:07:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26032/
Test FAILed.

SparkQA · 2015-01-23T19:07:37Z

Test build #26034 has started for PR 4173 at commit 38df669.

This patch merges cleanly.

SparkQA · 2015-01-23T20:22:09Z

Test build #26033 has finished for PR 4173 at commit 23b2c2d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T20:22:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26033/
Test FAILed.

SparkQA · 2015-01-23T20:36:57Z

Test build #26034 has finished for PR 4173 at commit 38df669.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-23T20:37:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26034/
Test FAILed.

davies · 2015-01-23T23:22:34Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+    Join(logicalPlan, right.logicalPlan, Inner, Some(joinExprs.expr))
+  }
+
+  override def join(right: DataFrame, joinType: String, joinExprs: Column): DataFrame = {


It's easier to do in Python/R if putting joinType at the end

SparkQA · 2015-01-26T18:42:44Z

Test build #26102 has started for PR 4173 at commit d0ffd84.

This patch merges cleanly.

SparkQA · 2015-01-26T18:43:36Z

Test build #26102 has finished for PR 4173 at commit d0ffd84.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-26T18:43:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26102/
Test FAILed.

SparkQA · 2015-01-26T18:52:47Z

Test build #26104 has started for PR 4173 at commit a47e189.

This patch merges cleanly.

pwendell · 2015-01-26T19:28:22Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+   *   // The following are equivalent:
+   *   peopleDf.filter($"age" > 15)
+   *   peopleDf.where($"age" > 15)
+   *   peopleDf($"age > 15)


missing closing quote?

AmplabJenkins · 2015-01-27T09:49:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26154/
Test FAILed.

SparkQA · 2015-01-27T10:15:26Z

Test build #26156 has finished for PR 4173 at commit e971078.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T10:15:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26156/
Test FAILed.

SparkQA · 2015-01-27T10:45:45Z

Test build #26157 has finished for PR 4173 at commit e971078.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T10:45:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26157/
Test FAILed.

shivaram · 2015-01-27T18:00:25Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+  def toDF: DataFrame = this
+
+  /** Return the schema of this [[DataFrame]]. */
+  override def schema: StructType = queryExecution.analyzed.schema


Can we add a new higher-level type for schema as well ? It is painful as a user to dig into StructType etc. -- Similarly while applying a schema to an RDD it would be good to have a higher-level type / constructor.

there is dtypes here, no?

@shivaram Are you asking for something like RowType extends StructType?

Yeah dtypes is close to what I was talking about and it probably is sufficient to get the schema out. However while applying a schema to a RDD one still needs to construct a StructType etc. It'll be great to have a lightweight way of saying something like DataFrame(rdd, colNames=c("age", "name"), colTypes=c("int", "character"))

fix collect with UDT and tests

SparkQA · 2015-01-27T19:17:36Z

Test build #26175 has started for PR 4173 at commit 828f70d.

This patch merges cleanly.

SparkQA · 2015-01-27T20:59:54Z

Test build #26175 has finished for PR 4173 at commit 828f70d.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T20:59:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26175/
Test FAILed.

marmbrus · 2015-01-27T21:46:55Z

sql/core/src/main/scala/org/apache/spark/sql/Column.scala

+
+  override def getItem(ordinal: Column): Column = GetItem(expr, ordinal.expr)
+
+  override def getField(fieldName: String): Column = GetField(expr, fieldName)


We might consider using apply instead or in addition to this.

SparkQA · 2015-01-27T21:47:43Z

Test build #26187 has started for PR 4173 at commit 0a1a73b.

This patch merges cleanly.

marmbrus · 2015-01-27T21:49:23Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala

+   *
+   * @param right Right side of the join.
+   * @param joinExprs Join expression.
+   * @param joinType One of: `inner`, `outer`, `left_outer`, `right_outer`, `semijoin`.


doesn't semi join have to specify left/right also?

SparkQA · 2015-01-27T23:25:19Z

Test build #26187 has finished for PR 4173 at commit 0a1a73b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-27T23:25:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26187/
Test PASSed.

rxin · 2015-01-28T00:08:04Z

Alright I'm going to merge this one since it touches too many moving parts. I will submit another PR later today to update documentation and address Michael's comments. It will also add more test.

rxin force-pushed the df1 branch from feb43ef to 1532e1e Compare January 23, 2015 08:19

davies reviewed Jan 23, 2015
View reviewed changes

pwendell reviewed Jan 26, 2015
View reviewed changes

shivaram reviewed Jan 27, 2015
View reviewed changes

Davies Liu and others added 3 commits January 27, 2015 10:19

fix collect with UDT and tests

6bf2b73

add repartition

257b9e6

Merge pull request #7 from davies/df

828f70d

fix collect with UDT and tests

rxin added 2 commits January 27, 2015 13:44

Mima.

23b4427

Merge branch 'df1' of github.com:rxin/spark into df1

0a1a73b

marmbrus reviewed Jan 27, 2015
View reviewed changes

asfgit closed this in 119f45d Jan 28, 2015

liancheng mentioned this pull request Feb 2, 2015

[SPARK-5461] [graphx] Add isCheckpointed, getCheckpointedFiles methods to Graph #4253

Closed

rxin deleted the df1 branch April 2, 2015 00:20


		override def getItem(ordinal: Column): Column = GetItem(expr, ordinal.expr)

		override def getField(fieldName: String): Column = GetField(expr, fieldName)

[SPARK-5097][SQL] DataFrame #4173

[SPARK-5097][SQL] DataFrame #4173

Conversation

rxin commented Jan 23, 2015

SparkQA commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

SparkQA commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

SparkQA commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

SparkQA commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

SparkQA commented Jan 23, 2015

AmplabJenkins commented Jan 23, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 26, 2015

SparkQA commented Jan 26, 2015

AmplabJenkins commented Jan 26, 2015

SparkQA commented Jan 26, 2015

Choose a reason for hiding this comment

AmplabJenkins commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2015

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2015

Choose a reason for hiding this comment

SparkQA commented Jan 27, 2015

AmplabJenkins commented Jan 27, 2015

rxin commented Jan 28, 2015