[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

makagonov · 2018-03-22T21:47:26Z

What changes were proposed in this pull request?

As stated in Jira, when toJSON is called on a dataset, the result JSON string will not have keys displayed for StructTypes that have null value. This PR fixes the issue and writes field with "null" value.

How was this patch tested?

Added a unit test to JsonSuite.scala

…ull value for StructTypes

sameeragarwal · 2018-03-22T21:51:47Z

ok to test

SparkQA · 2018-03-23T00:59:25Z

Test build #4143 has finished for PR 20884 at commit 9faf853.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-23T12:11:36Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala

@@ -56,7 +56,7 @@ class JacksonGeneratorSuite extends SparkFunSuite {
    val gen = new JacksonGenerator(dataType, writer, option)
    gen.write(input)
    gen.flush()
-    assert(writer.toString === """[{}]""")
+    assert(writer.toString === """[{"a":null}]""")


I think previous result was a valid test case ..

@HyukjinKwon actually, it looks like the result should be [null] rather than [{}].
Look at the following repro from spark-shell (downloaded binaries):

scala> val df = sqlContext.sql(""" select array(cast(null as struct<k:string>)) as my_array""") df: org.apache.spark.sql.DataFrame = [my_array: array<struct<k:string>>] scala> df.printSchema root |-- my_array: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- k: string (nullable = true) scala> df.toJSON.collect().foreach(println) {"my_array":[null]} scala> df.select(to_json($"my_array")).collect().foreach(x => println(x(0))) [null]

In older version of JacksonGenerator, we had a filter by element value, and if it was null, gen.writeNull() was called no matter what the type was (old implementation). But currently, we're calling gen.writeStartObject()...gen.writeEndObject() no matter if the value is null.

I couldn't repro this with a query, but when StructsToJson is called from this unit test, it goes through JacksonGenerator.arrElementWriter which has lines

case st: StructType => (arr: SpecializedGetters, i: Int) => { writeObject(writeFields(arr.getStruct(i, st.length), st, rootFieldWriters)) }

that makes it print json object even there is null.

I'll look into this later and will try to find the easy workaround.

I think you should compare this:

scala> sql(""" select array(cast(null as struct<k:string>)) as my_array""").toJSON.collect().foreach(println) {"my_array":[null]} scala> sql(""" select array(struct(cast(null as string))) as my_array""").toJSON.collect().foreach(println) {"my_array":[{}]}

HyukjinKwon · 2018-03-23T12:11:55Z

Shall we add a configuration or an option to control its behaviour if this is something we need to support?

HyukjinKwon · 2018-03-24T02:41:42Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -1229,7 +1229,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
    val df2 = df1.toDF
    val result = df2.toJSON.collect()
    // scalastyle:off
-    assert(result(0) === "{\"f1\":1,\"f2\":\"A1\",\"f3\":true,\"f4\":[\"1\",\" A1\",\" true\",\" null\"]}")
+    assert(result(0) === "{\"f1\":1,\"f2\":\"A1\",\"f3\":true,\"f4\":[\"1\",\" A1\",\" true\",\" null\"],\"f5\":null}")


If we go the current way, it'd write out every null with every field:

{"a":null,"b":null,"c":null} {"a":null,"b":null,"c":1} {"a":1,"b":null,"c":1} {"a":1,"b":2,"c":3}

which I think's quite inefficient. Does that fix actually use case to be clear?

AmplabJenkins · 2018-06-09T00:12:21Z

Can one of the admins verify this patch?

HyukjinKwon · 2018-06-26T08:10:15Z

@sameeragarwal, do you see some values on this? FWIW, for me don't have preference. If you see some values on this, probably we could go with a configuration .. otherwise I would just like to suggest to close it if you feel in the same way with me.

sameeragarwal · 2018-06-29T16:14:03Z

@HyukjinKwon I think it should be okay to close this at least for now. Just to add a little context behind this change, Facebook relies on the toJSON method for cross engine (hive/presto etc.) unit testing and the slightly diverging semantics between engines (such as these) often caused problems. However, in the end, as things stacked up, we ended up implementing a custom JacksonGenerator.

HyukjinKwon · 2018-06-29T17:51:22Z

Thank you so much for sharing it.

jackylee-ch · 2019-10-12T00:47:45Z

Can we reopen this issue or let me open a new one?
toJSON is used in sparkmagic, which is widely used in jupyter, to get sql return results. with toJSON sparkmagic may return empty results.
Maybe adding a config is the best choice.

HyukjinKwon · 2019-10-23T10:57:18Z

Ah, I just saw this. Okay, thanks for fixing it.

[SPARK-23773][SQL] JacksonGenerator does not include keys that have n…

9faf853

…ull value for StructTypes

fixing broken unit tests

559c201

HyukjinKwon reviewed Mar 23, 2018

View reviewed changes

HyukjinKwon reviewed Mar 24, 2018

View reviewed changes

makagonov closed this Jun 29, 2018

jackylee-ch mentioned this pull request Oct 12, 2019

[SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values #26098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

makagonov commented Mar 22, 2018

sameeragarwal commented Mar 22, 2018

SparkQA commented Mar 23, 2018

HyukjinKwon Mar 23, 2018

sameeragarwal Mar 23, 2018

makagonov Mar 23, 2018

HyukjinKwon Mar 24, 2018

HyukjinKwon commented Mar 23, 2018 •

edited

Loading

HyukjinKwon Mar 24, 2018 •

edited

Loading

AmplabJenkins commented Jun 9, 2018

HyukjinKwon commented Jun 26, 2018

sameeragarwal commented Jun 29, 2018

HyukjinKwon commented Jun 29, 2018

jackylee-ch commented Oct 12, 2019

HyukjinKwon commented Oct 23, 2019

[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

Conversation

makagonov commented Mar 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

sameeragarwal commented Mar 22, 2018

SparkQA commented Mar 23, 2018

HyukjinKwon Mar 23, 2018

Choose a reason for hiding this comment

sameeragarwal Mar 23, 2018

Choose a reason for hiding this comment

makagonov Mar 23, 2018

Choose a reason for hiding this comment

HyukjinKwon Mar 24, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Mar 23, 2018 • edited Loading

HyukjinKwon Mar 24, 2018 • edited Loading

Choose a reason for hiding this comment

AmplabJenkins commented Jun 9, 2018

HyukjinKwon commented Jun 26, 2018

sameeragarwal commented Jun 29, 2018

HyukjinKwon commented Jun 29, 2018

jackylee-ch commented Oct 12, 2019

HyukjinKwon commented Oct 23, 2019

HyukjinKwon commented Mar 23, 2018 •

edited

Loading

HyukjinKwon Mar 24, 2018 •

edited

Loading