Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26837][SQL] Pruning nested fields from object serializers #663

Merged
merged 1 commit into from
Mar 9, 2020

Conversation

rshkv
Copy link

@rshkv rshkv commented Mar 1, 2020

Proposed change

Picking up SPARK-26837 via apache#23740.

Bluf is that this will speed-up serialization by ignoring unused nested fields. Example provided was this

val data = Seq((("a", 1), 1), (("b", 2), 2), (("c", 3), 3))
val df = data.toDS().map(t => (t._1, t._2 + 1)).select("_1._1")

which would only select a, b, and c.

If I understand correctly, we already prune serialization for non-nested fields (the outer 1, 2, and 3) because we took SPARK-26619. After this we'll also prune serializers for the inner 1, 2, and 3.

How was this patch tested?

Upstream added tests.

## What changes were proposed in this pull request?

In SPARK-26619, we make change to prune unnecessary individual serializers when serializing objects. This is extension to SPARK-26619. We can further prune nested fields from object serializers if they are not used.

For example, in following query, we only use one field in a struct column:

```scala
val data = Seq((("a", 1), 1), (("b", 2), 2), (("c", 3), 3))
val df = data.toDS().map(t => (t._1, t._2 + 1)).select("_1._1")
```

So, instead of having a serializer to create a two fields struct, we can prune unnecessary field from it. This is what this PR proposes to do.

In order to make this change conservative and safer, a SQL config is added to control it. It is disabled by default.

TODO: Support to prune nested fields inside MapType's key and value.

## How was this patch tested?

Added tests.

Closes apache#23740 from viirya/nested-pruning-serializer-2.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@rshkv rshkv requested a review from robert3005 March 1, 2020 17:34
@robert3005
Copy link

👍

@robert3005 robert3005 merged commit f0be092 into master Mar 9, 2020
@robert3005 robert3005 deleted the wr/spark-23740 branch March 9, 2020 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants