Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

Closed

Conversation

yanboliang
Copy link
Contributor

When serialize json string to row, reuse a mutable row for both each record and inner nested structure instead of creating a new one each time.

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27284 has started for PR 4527 at commit b0c2b14.

  • This patch merges cleanly.

@yanboliang
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27284 has finished for PR 4527 at commit b0c2b14.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27284/
Test FAILed.

@yanboliang
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27288 has started for PR 4527 at commit b0c2b14.

  • This patch merges cleanly.

@yanboliang yanboliang changed the title [SQL] Reuse mutable row for each record at jsonStringToRow [SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow Feb 11, 2015
@SparkQA
Copy link

SparkQA commented Feb 11, 2015

Test build #27288 has finished for PR 4527 at commit b0c2b14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27288/
Test PASSed.

parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema))
// Reuse the mutable row for each record, however we still need to
// create a new row for every nested struct type in each record
val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this inside of mapPartitions, to reduce the closure serialization overhead. And I didn't see any benefit when using the SpecificMutableRow, why not just use the GenericMutableRow instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, it's not appropriate to use SpecificMutableRow here. I will change back to GenericMutableRow.

@yhuai
Copy link
Contributor

yhuai commented Feb 11, 2015

Thank you for working on it.

Seems new SpecificMutableRow(schema.fields.map(_.dataType)) cannot handle nested structure. I think we need to use the schema to create the top level mutable row and all inner rows (for inner StructType).

@yhuai
Copy link
Contributor

yhuai commented Feb 11, 2015

Also, can you add performance numbers?

@yhuai
Copy link
Contributor

yhuai commented Feb 11, 2015

Oh, enforceCorrectType will take care inner structures by calling asRow.

It will be great if we can use mutable rows for inner structures as well.

@yanboliang yanboliang force-pushed the jsonStringToRowOptimization branch from b0c2b14 to c30a358 Compare February 12, 2015 11:53
@yanboliang
Copy link
Contributor Author

@chenghao-intel @yhuai
Thank you for your advice and it's very useful.
We can use mutable rows for both top level records and inner structures at present.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27351 has started for PR 4527 at commit c30a358.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27351 has finished for PR 4527 at commit c30a358.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27351/
Test PASSed.

@yanboliang yanboliang force-pushed the jsonStringToRowOptimization branch from c30a358 to 6cd26fe Compare February 15, 2015 09:14
@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27513 has started for PR 4527 at commit 6cd26fe.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27513 has finished for PR 4527 at commit 6cd26fe.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27513/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27514 has started for PR 4527 at commit 7039fa7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27514 has finished for PR 4527 at commit 7039fa7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27514/
Test FAILed.

@yanboliang yanboliang force-pushed the jsonStringToRowOptimization branch from 7039fa7 to 2d45c68 Compare February 15, 2015 15:16
@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27522 has started for PR 4527 at commit 2d45c68.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27524 has started for PR 4527 at commit 2286ac5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27524 has finished for PR 4527 at commit 2286ac5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27524/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Feb 15, 2015

Test build #27522 has finished for PR 4527 at commit 2d45c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27522/
Test PASSed.

@yanboliang
Copy link
Contributor Author

This improvement is very similar to #758, so I have run the similar performance test.
The benchmark suggests this optimization made the optimized version about 1.5x faster when scanning JSON table, but it depends on the JSON schema especially for whether different records have different schema.
For a JSON file with 188010 lines, the build scan consumed time is:
original: Takes 15598 ms
optimized: Takes 10152 ms

@yanboliang
Copy link
Contributor Author

@liancheng @rxin @marmbrus , can you review it ?

@rxin
Copy link
Contributor

rxin commented Mar 31, 2015

Thanks - sorry for not having looked at this earlier. Do you see any performance gains with this change? My understanding is that JSON is already very slow, and thus the code path is hard to optimize.

@yanboliang yanboliang closed this Apr 24, 2015
@yanboliang yanboliang deleted the jsonStringToRowOptimization branch April 24, 2015 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants