[SPARK-13553][SPARK-13554][SQL] Migrates basic inspection and typed relational operations from DataFrame to Dataset #11431

liancheng · 2016-02-29T12:38:35Z

What changes were proposed in this pull request?

This PR migrates basic inspection and typed relational operations from DataFrame to Dataset. This is the first step of unifying DataFrame and Dataset API.

TODO

Migrate explode operations.

How was this patch tested?

Corresponding test cases are migrated from DataFrameSuite to DatasetSuite. These newly added test cases all share the same "df-to-ds" prefix so that we can easily execute them under SBT using:

sql/test-only *.DatasetSuite -- -z "df-to-ds"

This prefix will be removed after migrating all the DataFrame operations.

liancheng · 2016-02-29T12:41:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          if n.outerPointer.isEmpty &&
+             n.cls.isMemberClass &&
+             !Modifier.isStatic(n.cls.getModifiers) =>
+          n.cls.getEnclosingClass


This change is included in PR #11421. Without this fix, we can't use case classes defined in SQLTestData as Dataset element type.

SparkQA · 2016-02-29T14:25:20Z

Test build #52187 has finished for PR 11431 at commit f38c016.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-29T14:31:57Z

Test build #52188 has finished for PR 11431 at commit ee9c432.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-02-29T17:41:23Z

Can you rebase to get rid of the merged commit?

…lational Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

yhuai · 2016-02-29T18:37:17Z

@liancheng I just pushed two commits to resolve the conflicts.

SparkQA · 2016-02-29T20:18:29Z

Test build #52191 has finished for PR 11431 at commit 6cb945f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-02-29T21:14:54Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    val sum = weights.sum
+    val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
+    normalizedCumWeights.sliding(2).map { x =>
+      new Dataset(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted)())


Do we need to pass encoder into newly created Datasets at here?

No, there's an implicit encoder defined in the constructor of Dataset.

davies · 2016-03-02T20:07:09Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+
+  /**
+   * Returns all column names and their data types as an array.
+   * @since 2.0.0


If the API is moved from DataFrame, should we also copy the @since? cc @rxin

I'd put 2.0 since it didn't exist on dataset before.

Since DataFrame will be an alias of Dataset, what will the doc for DataFrame looks like?

If we copy the versions, It's also weird that see a method of Dataset (1.3) is introduced before Dataset is introduced (1.6).

Yea I'd just have it as 2.0.

liancheng · 2016-03-11T01:21:08Z

Closing this in favor of #11443.

Migrates basic DataFrame inspection methods to Dataset

3f59569

liancheng reviewed Feb 29, 2016
View reviewed changes

Migrates basic inspection and typed relational DF operations to DS

ee9c432

liancheng force-pushed the df-to-ds-typed-relational branch from f38c016 to ee9c432 Compare February 29, 2016 12:48

liancheng mentioned this pull request Feb 29, 2016

[SPARK-13540][SQL] Supports using nested classes within Scala objects as Dataset element type #11421

Closed

yhuai added 2 commits February 29, 2016 10:34

Merge remote-tracking branch 'upstream/master' into df-to-ds-typed-re…

2d2a3da

…lational Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

Remove unnecessary changes and a minor comment change in DatasetSuite.

6cb945f

yhuai reviewed Feb 29, 2016
View reviewed changes

liancheng mentioned this pull request Mar 1, 2016

[SPARK-13244][SQL] Migrates DataFrame to Dataset #11443

Closed

3 tasks

davies reviewed Mar 2, 2016
View reviewed changes

liancheng closed this Mar 11, 2016

liancheng deleted the df-to-ds-typed-relational branch March 11, 2016 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13553][SPARK-13554][SQL] Migrates basic inspection and typed relational operations from DataFrame to Dataset #11431

[SPARK-13553][SPARK-13554][SQL] Migrates basic inspection and typed relational operations from DataFrame to Dataset #11431

liancheng commented Feb 29, 2016

liancheng Feb 29, 2016

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

rxin commented Feb 29, 2016

yhuai commented Feb 29, 2016

SparkQA commented Feb 29, 2016

yhuai Feb 29, 2016

liancheng Mar 1, 2016

davies Mar 2, 2016

rxin Mar 2, 2016

davies Mar 2, 2016

davies Mar 2, 2016

rxin Mar 2, 2016

liancheng commented Mar 11, 2016

[SPARK-13553][SPARK-13554][SQL] Migrates basic inspection and typed relational operations from DataFrame to Dataset #11431

[SPARK-13553][SPARK-13554][SQL] Migrates basic inspection and typed relational operations from DataFrame to Dataset #11431

Conversation

liancheng commented Feb 29, 2016

What changes were proposed in this pull request?

TODO

How was this patch tested?

Choose a reason for hiding this comment

SparkQA commented Feb 29, 2016

SparkQA commented Feb 29, 2016

rxin commented Feb 29, 2016

yhuai commented Feb 29, 2016

SparkQA commented Feb 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liancheng commented Mar 11, 2016