Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13553][SPARK-13554][SQL] Migrates basic inspection and typed relational operations from DataFrame to Dataset #11431

Closed
wants to merge 4 commits into from

Conversation

liancheng
Copy link
Contributor

What changes were proposed in this pull request?

This PR migrates basic inspection and typed relational operations from DataFrame to Dataset. This is the first step of unifying DataFrame and Dataset API.

TODO

  • Migrate explode operations.

How was this patch tested?

Corresponding test cases are migrated from DataFrameSuite to DatasetSuite. These newly added test cases all share the same "df-to-ds" prefix so that we can easily execute them under SBT using:

sql/test-only *.DatasetSuite -- -z "df-to-ds"

This prefix will be removed after migrating all the DataFrame operations.

if n.outerPointer.isEmpty &&
n.cls.isMemberClass &&
!Modifier.isStatic(n.cls.getModifiers) =>
n.cls.getEnclosingClass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is included in PR #11421. Without this fix, we can't use case classes defined in SQLTestData as Dataset element type.

@liancheng liancheng force-pushed the df-to-ds-typed-relational branch from f38c016 to ee9c432 Compare February 29, 2016 12:48
@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52187 has finished for PR 11431 at commit f38c016.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52188 has finished for PR 11431 at commit ee9c432.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 29, 2016

Can you rebase to get rid of the merged commit?

…lational

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
	sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
@yhuai
Copy link
Contributor

yhuai commented Feb 29, 2016

@liancheng I just pushed two commits to resolve the conflicts.

@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52191 has finished for PR 11431 at commit 6cb945f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new Dataset(sqlContext, Sample(x(0), x(1), withReplacement = false, seed, sorted)())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass encoder into newly created Datasets at here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's an implicit encoder defined in the constructor of Dataset.


/**
* Returns all column names and their data types as an array.
* @since 2.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the API is moved from DataFrame, should we also copy the @since? cc @rxin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put 2.0 since it didn't exist on dataset before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DataFrame will be an alias of Dataset, what will the doc for DataFrame looks like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we copy the versions, It's also weird that see a method of Dataset (1.3) is introduced before Dataset is introduced (1.6).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I'd just have it as 2.0.

@liancheng
Copy link
Contributor Author

Closing this in favor of #11443.

@liancheng liancheng closed this Mar 11, 2016
@liancheng liancheng deleted the df-to-ds-typed-relational branch March 11, 2016 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants