-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5738] [SQL] Reuse mutable row for each record at jsonStringToRow #4527
Conversation
Test build #27284 has started for PR 4527 at commit
|
Test build #27284 has finished for PR 4527 at commit
|
Test FAILed. |
retest this please |
Test build #27288 has started for PR 4527 at commit
|
Test build #27288 has finished for PR 4527 at commit
|
Test PASSed. |
parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema)) | ||
// Reuse the mutable row for each record, however we still need to | ||
// create a new row for every nested struct type in each record | ||
val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this inside of mapPartitions
, to reduce the closure serialization overhead. And I didn't see any benefit when using the SpecificMutableRow
, why not just use the GenericMutableRow
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it's not appropriate to use SpecificMutableRow here. I will change back to GenericMutableRow.
Thank you for working on it. Seems |
Also, can you add performance numbers? |
Oh, It will be great if we can use mutable rows for inner structures as well. |
b0c2b14
to
c30a358
Compare
@chenghao-intel @yhuai |
Test build #27351 has started for PR 4527 at commit
|
Test build #27351 has finished for PR 4527 at commit
|
Test PASSed. |
c30a358
to
6cd26fe
Compare
Test build #27513 has started for PR 4527 at commit
|
Test build #27513 has finished for PR 4527 at commit
|
Test FAILed. |
Test build #27514 has started for PR 4527 at commit
|
Test build #27514 has finished for PR 4527 at commit
|
Test FAILed. |
7039fa7
to
2d45c68
Compare
Test build #27522 has started for PR 4527 at commit
|
Test build #27524 has started for PR 4527 at commit
|
Test build #27524 has finished for PR 4527 at commit
|
Test PASSed. |
Test build #27522 has finished for PR 4527 at commit
|
Test PASSed. |
This improvement is very similar to #758, so I have run the similar performance test. |
@liancheng @rxin @marmbrus , can you review it ? |
Thanks - sorry for not having looked at this earlier. Do you see any performance gains with this change? My understanding is that JSON is already very slow, and thus the code path is hard to optimize. |
When serialize json string to row, reuse a mutable row for both each record and inner nested structure instead of creating a new one each time.