Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-2407: Added internal implementation of SQL SUBSTR() #1359

Closed
wants to merge 4 commits into from

Conversation

willb
Copy link
Contributor

@willb willb commented Jul 10, 2014

This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
and adds tests to verify correct operation.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build triggered.

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA tests have started for PR 1359. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/consoleFull


def nullable: Boolean = true
def dataType: DataType = {
if (str.dataType == BinaryType) str.dataType else StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Will -- I think we need to check if resolved is true here to not break Catalyst's contract. Similar to here.

@concretevitamin
Copy link
Contributor

Hey @willb - thanks for working on this, which is going to be very useful for Spark SQL! I left a couple minor comments. Another general concern is the performance of eval(). If there are ways to reduce branchings, or reduce the function call (not sure if @inline will help), we should probably do it.

@willb
Copy link
Contributor Author

willb commented Jul 10, 2014

Thanks for the comments, @concretevitamin. I've made the changes from your comments and will think about reducing branch overhead before pushing an update.

@willb
Copy link
Contributor Author

willb commented Jul 10, 2014

@concretevitamin Inlining Substring.slice seems to make a nontrivial difference (~11.5% speedup) on a very simple Substring.eval() microbenchmark.

@concretevitamin
Copy link
Contributor

This sounds pretty cool!

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA results for PR 1359:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/consoleFull

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16506/

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA tests have started for PR 1359. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 10, 2014

QA results for PR 1359:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/consoleFull

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16513/

override def children = str :: pos :: len :: Nil

@inline
def slice[T, C <% IndexedSeqOptimized[T,_]](str: C, startPos: Int, sliceLen: Int): Any = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to do this without view bounds? I think those are going to disappear soon: https://issues.scala-lang.org/browse/SI-7629

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I can do it with implicit parameters. I'll push an update once I've run the Catalyst suite against my change.

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 1359. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16543/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 1359:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16543/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA tests have started for PR 1359. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16544/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 11, 2014

QA results for PR 1359:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16544/consoleFull

@marmbrus
Copy link
Contributor

Hey Will, this looks great. Thanks for implementing! Mind fixing the conflict with master so I can merge?

willb added 2 commits July 14, 2014 17:55
This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
and adds tests to verify correct operation.

Squashes 8969d038 and e6419b4e.
* orders imports in stringOperations.scala
* Substring.dataType throws exception if children are unresolved
* inlines Substring.slice (~11.5% performance improvement on microbenchmark runs)
* adds a special `toString` case for two-argument SUBSTR expressions
* removes spurious I_ prefix to SUBSTR(ING) in HiveQL.scala

Thanks to @concretevitamin for prompt and useful feedback!
@marmbrus
Copy link
Contributor

add to whitelist

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA tests have started for PR 1359. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16652/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 15, 2014

QA results for PR 1359:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
case class Substring(str: Expression, pos: Expression, len: Expression) extends Expression {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16652/consoleFull

@marmbrus
Copy link
Contributor

I'm going to go ahead and merge this. Someone can make a follow-up with nullability narrowing.

@marmbrus
Copy link
Contributor

Thanks! Merged into master and 1.0.

@asfgit asfgit closed this in 61de65b Jul 15, 2014
asfgit pushed a commit that referenced this pull request Jul 15, 2014
This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
and adds tests to verify correct operation.

Author: William Benton <[email protected]>

Closes #1359 from willb/internalSqlSubstring and squashes the following commits:

ccedc47 [William Benton] Fixed too-long line.
a30a037 [William Benton] replace view bounds with implicit parameters
ec35c80 [William Benton] Adds fixes from review:
4f3bfdb [William Benton] Added internal implementation of SQL SUBSTR()

(cherry picked from commit 61de65b)
Signed-off-by: Michael Armbrust <[email protected]>
asfgit pushed a commit that referenced this pull request Jul 16, 2014
This is a follow-up of #1359 with nullability narrowing.

Author: Takuya UESHIN <[email protected]>

Closes #1426 from ueshin/issues/SPARK-2504 and squashes the following commits:

5157832 [Takuya UESHIN] Remove unnecessary white spaces.
80958ac [Takuya UESHIN] Fix nullability of Substring expression.

(cherry picked from commit 632fb3d)
Signed-off-by: Reynold Xin <[email protected]>
asfgit pushed a commit that referenced this pull request Jul 16, 2014
This is a follow-up of #1359 with nullability narrowing.

Author: Takuya UESHIN <[email protected]>

Closes #1426 from ueshin/issues/SPARK-2504 and squashes the following commits:

5157832 [Takuya UESHIN] Remove unnecessary white spaces.
80958ac [Takuya UESHIN] Fix nullability of Substring expression.
@chutium
Copy link
Contributor

chutium commented Jul 16, 2014

hi, it is really very useful for us, i tried this implementation from @willb , in spark-shell, i still got java.lang.UnsupportedOperationException by Query Plan, i made some change in SqlParser: chutium@1de83a7

@marmbrus
Copy link
Contributor

Awesome! Please submit a pull request with that addition.
On Jul 16, 2014 7:53 AM, "Teng Qiu" [email protected] wrote:

hi, it is really very useful for us, i tried this implementation from
@willb https://github.com/willb , in spark-shell, i still got
java.lang.UnsupportedOperationException by Query Plan, i made some change
in SqlParser: chutium/spark@1de83a7
chutium@1de83a7


Reply to this email directly or view it on GitHub
#1359 (comment).

@chutium
Copy link
Contributor

chutium commented Jul 16, 2014

PR submitted #1442

asfgit pushed a commit that referenced this pull request Jul 19, 2014
follow-up of #1359

Author: chutium <[email protected]>

Closes #1442 from chutium/master and squashes the following commits:

b49cc8a [chutium] SPARK-2407: Added Parser of SQL SUBSTRING() #1442
9a60ccf [chutium] SPARK-2407: Added Parser of SQL SUBSTR() #1442
06e933b [chutium] Merge https://github.com/apache/spark
c870172 [chutium] Merge https://github.com/apache/spark
094f773 [chutium] Merge https://github.com/apache/spark
88cb37d [chutium] Merge https://github.com/apache/spark
1de83a7 [chutium] SPARK-2407: Added Parse of SQL SUBSTR()
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This replaces the Hive UDF for SUBSTR(ING) with an implementation in Catalyst
and adds tests to verify correct operation.

Author: William Benton <[email protected]>

Closes apache#1359 from willb/internalSqlSubstring and squashes the following commits:

ccedc47 [William Benton] Fixed too-long line.
a30a037 [William Benton] replace view bounds with implicit parameters
ec35c80 [William Benton] Adds fixes from review:
4f3bfdb [William Benton] Added internal implementation of SQL SUBSTR()
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
This is a follow-up of apache#1359 with nullability narrowing.

Author: Takuya UESHIN <[email protected]>

Closes apache#1426 from ueshin/issues/SPARK-2504 and squashes the following commits:

5157832 [Takuya UESHIN] Remove unnecessary white spaces.
80958ac [Takuya UESHIN] Fix nullability of Substring expression.
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
follow-up of apache#1359

Author: chutium <[email protected]>

Closes apache#1442 from chutium/master and squashes the following commits:

b49cc8a [chutium] SPARK-2407: Added Parser of SQL SUBSTRING() apache#1442
9a60ccf [chutium] SPARK-2407: Added Parser of SQL SUBSTR() apache#1442
06e933b [chutium] Merge https://github.com/apache/spark
c870172 [chutium] Merge https://github.com/apache/spark
094f773 [chutium] Merge https://github.com/apache/spark
88cb37d [chutium] Merge https://github.com/apache/spark
1de83a7 [chutium] SPARK-2407: Added Parse of SQL SUBSTR()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants