Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection #5279

Closed
wants to merge 9 commits into from

Conversation

vlyubin
Copy link
Contributor

@vlyubin vlyubin commented Mar 31, 2015

i += 1
}

mutableRow
new GenericRowWithSchema(ar, schema)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite related to this PR, but I don't think it was necessary to use GenericMutableRow here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The original version uses a mutable row mostly because of the updates in the while loop I guess.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29441 has started for PR 5279 at commit ab7585b.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29441 has finished for PR 5279 at commit ab7585b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29441/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Mar 31, 2015

I know you are probably still working on this - any benchmark numbers?

@rxin
Copy link
Contributor

rxin commented Mar 31, 2015

cc @davies since you guys are both changing this part of the code lately.


dataType match {
// Check UDT first since UDTs can override other types
case udt: UserDefinedType[_] => (item) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please break the line after the first =>.

@liancheng
Copy link
Contributor

Went though very quickly for the first time, left some styling comments.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29499 has started for PR 5279 at commit 6a35425.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29499 has finished for PR 5279 at commit 6a35425.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29499/
Test PASSed.

@vlyubin
Copy link
Contributor Author

vlyubin commented Apr 1, 2015

Here are the benchmark numbers: http://pastie.org/private/6vg7kg2yfwop2ov5zu2eq
TLDR: On real cluster the new implementation made rdd() 1.9 times faster and toDF() 2.2 times faster.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29514 has started for PR 5279 at commit 99d333a.

@SparkQA
Copy link

SparkQA commented Apr 1, 2015

Test build #29514 has finished for PR 5279 at commit 99d333a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29514/
Test PASSed.

@vlyubin
Copy link
Contributor Author

vlyubin commented Apr 3, 2015

ping

@rxin
Copy link
Contributor

rxin commented Apr 3, 2015

@vlyubin who are you pinging? is this still "WIP"?

@vlyubin vlyubin changed the title [WIP] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection Apr 3, 2015
@vlyubin
Copy link
Contributor Author

vlyubin commented Apr 3, 2015

@rxin Sorry, I just removed the WIP tag. There isn't anything more to add, as it turned out that there are no places here where we could use SpecificMutableRow to speed things up.
So I'm pinging whoever wants to add more comments on the code ...

@rxin
Copy link
Contributor

rxin commented Apr 3, 2015

Jenkins, retest this please.

override def executeTake(limit: Int): Array[Row] =
rows.map(ScalaReflection.convertRowToScala(_, schema)).take(limit).toArray
override def executeCollect(): Array[Row] = {
val converters = schema.fields.map {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this pattern cannot be

val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter).toArray

?

It looks like this pretty efficiently handles this situation in the same way that you've extracted it here (even with calling convertRowWithConverters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, I'll update these.

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29896 has started for PR 5279 at commit dec6802.

@SparkQA
Copy link

SparkQA commented Apr 8, 2015

Test build #29897 has started for PR 5279 at commit c327bc9.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29896 has finished for PR 5279 at commit dec6802.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29896/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Apr 9, 2015

Test build #29897 has finished for PR 5279 at commit c327bc9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29897/
Test PASSed.

@liancheng
Copy link
Contributor

@vlyubin Would you mind to add [SQL] to the PR title? Thanks!

@vlyubin vlyubin changed the title [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection [SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing converters in ScalaReflection Apr 9, 2015
val convertedMap: HashMap[Any, Any] = HashMap()
while (iter.hasNext) {
val entry = iter.next()
convertedMap += Tuple2(keyConverter(entry.getKey), valueConverter(entry.getValue))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use
convertedMap(keyConverter(entry.getKey)) = valueConverter(entry.getValue)
to avoid creating a tuple.

@aarondav
Copy link
Contributor

aarondav commented Apr 9, 2015

LGTM, @marmbrus would you mind doing a final pass?

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #29988 has started for PR 5279 at commit 11a20ec.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #29988 has finished for PR 5279 at commit 11a20ec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29988/
Test PASSed.

}
}

/** Converts Catalyst types used internally in rows to standard Scala types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong comment style.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30051 has started for PR 5279 at commit e75a387.

@SparkQA
Copy link

SparkQA commented Apr 10, 2015

Test build #30051 has finished for PR 5279 at commit e75a387.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30051/
Test PASSed.

case 1 =>
val func = function.asInstanceOf[(Any) => Any]
val child0 = children(0)
lazy val converter0 = CatalystTypeConverters.createToScalaConverter(child0.dataType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use lazy val here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants