[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

-    parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema))
+    // Reuse the mutable row for each record, however we still need to 
+    // create a new row for every nested struct type in each record
+    val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))


Move this inside of mapPartitions, to reduce the closure serialization overhead. And I didn't see any benefit when using the SpecificMutableRow, why not just use the GenericMutableRow instead?

You are right, it's not appropriate to use SpecificMutableRow here. I will change back to GenericMutableRow.

yhuai · 2015-02-11T17:31:09Z

Thank you for working on it.

Seems new SpecificMutableRow(schema.fields.map(_.dataType)) cannot handle nested structure. I think we need to use the schema to create the top level mutable row and all inner rows (for inner StructType).

yhuai · 2015-02-11T17:34:57Z

Also, can you add performance numbers?

yhuai · 2015-02-11T17:47:41Z

Oh, enforceCorrectType will take care inner structures by calling asRow.

It will be great if we can use mutable rows for inner structures as well.

yanboliang · 2015-02-12T11:54:35Z

@chenghao-intel @yhuai
Thank you for your advice and it's very useful.
We can use mutable rows for both top level records and inner structures at present.

SparkQA · 2015-02-12T11:57:24Z

Test build #27351 has started for PR 4527 at commit c30a358.

This patch merges cleanly.

SparkQA · 2015-02-12T13:03:17Z

Test build #27351 has finished for PR 4527 at commit c30a358.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-12T13:03:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27351/
Test PASSed.

SparkQA · 2015-02-15T09:17:28Z

Test build #27513 has started for PR 4527 at commit 6cd26fe.

This patch merges cleanly.

SparkQA · 2015-02-15T09:18:27Z

Test build #27513 has finished for PR 4527 at commit 6cd26fe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-15T09:18:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27513/
Test FAILed.

SparkQA · 2015-02-15T09:22:24Z

Test build #27514 has started for PR 4527 at commit 7039fa7.

This patch merges cleanly.

SparkQA · 2015-02-15T09:34:24Z

Test build #27514 has finished for PR 4527 at commit 7039fa7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-15T09:34:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27514/
Test FAILed.

SparkQA · 2015-02-15T15:17:56Z

Test build #27522 has started for PR 4527 at commit 2d45c68.

This patch merges cleanly.

SparkQA · 2015-02-15T15:22:27Z

Test build #27524 has started for PR 4527 at commit 2286ac5.

This patch merges cleanly.

SparkQA · 2015-02-15T16:33:32Z

Test build #27524 has finished for PR 4527 at commit 2286ac5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-15T16:33:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27524/
Test PASSed.

SparkQA · 2015-02-15T16:36:59Z

Test build #27522 has finished for PR 4527 at commit 2d45c68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-15T16:37:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27522/
Test PASSed.

yanboliang · 2015-02-15T16:44:45Z

This improvement is very similar to #758, so I have run the similar performance test.
The benchmark suggests this optimization made the optimized version about 1.5x faster when scanning JSON table, but it depends on the JSON schema especially for whether different records have different schema.
For a JSON file with 188010 lines, the build scan consumed time is:
original: Takes 15598 ms
optimized: Takes 10152 ms

yanboliang · 2015-02-15T16:45:45Z

@liancheng @rxin @marmbrus , can you review it ?

rxin · 2015-03-31T07:19:16Z

Thanks - sorry for not having looked at this earlier. Do you see any performance gains with this change? My understanding is that JSON is already very slow, and thus the code path is hard to optimize.

yanboliang changed the title ~~[SQL] Reuse mutable row for each record at jsonStringToRow~~ [SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow Feb 11, 2015

chenghao-intel reviewed Feb 11, 2015
View reviewed changes

yanboliang force-pushed the jsonStringToRowOptimization branch from b0c2b14 to c30a358 Compare February 12, 2015 11:53

yanboliang force-pushed the jsonStringToRowOptimization branch from c30a358 to 6cd26fe Compare February 15, 2015 09:14

Yanbo Liang added 4 commits February 15, 2015 23:15

[SQL] Reuse mutable row for each record at jsonStringToRow

18e4ddc

Use mutable rows for inner structures

2f001ba

Use mutable row arrays for inner arrays

837785a

keep scala style

d97d7db

fix array reuse issue

2d45c68

yanboliang force-pushed the jsonStringToRowOptimization branch from 7039fa7 to 2d45c68 Compare February 15, 2015 15:16

keep scala style

2286ac5

yanboliang closed this Apr 24, 2015

yanboliang deleted the jsonStringToRowOptimization branch April 24, 2015 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

yanboliang commented Feb 11, 2015

SparkQA commented Feb 11, 2015

yanboliang commented Feb 11, 2015

SparkQA commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

yanboliang commented Feb 11, 2015

SparkQA commented Feb 11, 2015

SparkQA commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

chenghao-intel Feb 11, 2015

yanboliang Feb 12, 2015

yhuai commented Feb 11, 2015

yhuai commented Feb 11, 2015

yhuai commented Feb 11, 2015

yanboliang commented Feb 12, 2015

SparkQA commented Feb 12, 2015

SparkQA commented Feb 12, 2015

AmplabJenkins commented Feb 12, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

yanboliang commented Feb 15, 2015

yanboliang commented Feb 15, 2015

rxin commented Mar 31, 2015

[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527

Conversation

yanboliang commented Feb 11, 2015

SparkQA commented Feb 11, 2015

yanboliang commented Feb 11, 2015

SparkQA commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

yanboliang commented Feb 11, 2015

SparkQA commented Feb 11, 2015

SparkQA commented Feb 11, 2015

AmplabJenkins commented Feb 11, 2015

chenghao-intel Feb 11, 2015

Choose a reason for hiding this comment

yanboliang Feb 12, 2015

Choose a reason for hiding this comment

yhuai commented Feb 11, 2015

yhuai commented Feb 11, 2015

yhuai commented Feb 11, 2015

yanboliang commented Feb 12, 2015

SparkQA commented Feb 12, 2015

SparkQA commented Feb 12, 2015

AmplabJenkins commented Feb 12, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

SparkQA commented Feb 15, 2015

AmplabJenkins commented Feb 15, 2015

yanboliang commented Feb 15, 2015

yanboliang commented Feb 15, 2015

rxin commented Mar 31, 2015