Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

Closed
wants to merge 2 commits into from

Conversation

makagonov
Copy link

What changes were proposed in this pull request?

As stated in Jira, when toJSON is called on a dataset, the result JSON string will not have keys displayed for StructTypes that have null value. This PR fixes the issue and writes field with "null" value.

How was this patch tested?

Added a unit test to JsonSuite.scala

@sameeragarwal
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #4143 has finished for PR 20884 at commit 9faf853.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -56,7 +56,7 @@ class JacksonGeneratorSuite extends SparkFunSuite {
val gen = new JacksonGenerator(dataType, writer, option)
gen.write(input)
gen.flush()
assert(writer.toString === """[{}]""")
assert(writer.toString === """[{"a":null}]""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previous result was a valid test case ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon actually, it looks like the result should be [null] rather than [{}].
Look at the following repro from spark-shell (downloaded binaries):

scala> val df = sqlContext.sql(""" select array(cast(null as struct<k:string>)) as my_array""")
df: org.apache.spark.sql.DataFrame = [my_array: array<struct<k:string>>]

scala> df.printSchema
root
 |-- my_array: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- k: string (nullable = true)
scala> df.toJSON.collect().foreach(println)
{"my_array":[null]}
scala> df.select(to_json($"my_array")).collect().foreach(x => println(x(0)))
[null]

In older version of JacksonGenerator, we had a filter by element value, and if it was null, gen.writeNull() was called no matter what the type was (old implementation). But currently, we're calling gen.writeStartObject()...gen.writeEndObject() no matter if the value is null.

I couldn't repro this with a query, but when StructsToJson is called from this unit test, it goes through JacksonGenerator.arrElementWriter which has lines

case st: StructType =>
(arr: SpecializedGetters, i: Int) => {
  writeObject(writeFields(arr.getStruct(i, st.length), st, rootFieldWriters))
}

that makes it print json object even there is null.

I'll look into this later and will try to find the easy workaround.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should compare this:

scala> sql(""" select array(cast(null as struct<k:string>)) as my_array""").toJSON.collect().foreach(println)
{"my_array":[null]}

scala> sql(""" select array(struct(cast(null as string))) as my_array""").toJSON.collect().foreach(println)
{"my_array":[{}]}

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Mar 23, 2018

Shall we add a configuration or an option to control its behaviour if this is something we need to support?

@@ -1229,7 +1229,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
val df2 = df1.toDF
val result = df2.toJSON.collect()
// scalastyle:off
assert(result(0) === "{\"f1\":1,\"f2\":\"A1\",\"f3\":true,\"f4\":[\"1\",\" A1\",\" true\",\" null\"]}")
assert(result(0) === "{\"f1\":1,\"f2\":\"A1\",\"f3\":true,\"f4\":[\"1\",\" A1\",\" true\",\" null\"],\"f5\":null}")
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go the current way, it'd write out every null with every field:

{"a":null,"b":null,"c":null}
{"a":null,"b":null,"c":1}
{"a":1,"b":null,"c":1}
{"a":1,"b":2,"c":3}

which I think's quite inefficient. Does that fix actually use case to be clear?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon
Copy link
Member

@sameeragarwal, do you see some values on this? FWIW, for me don't have preference. If you see some values on this, probably we could go with a configuration .. otherwise I would just like to suggest to close it if you feel in the same way with me.

@sameeragarwal
Copy link
Member

@HyukjinKwon I think it should be okay to close this at least for now. Just to add a little context behind this change, Facebook relies on the toJSON method for cross engine (hive/presto etc.) unit testing and the slightly diverging semantics between engines (such as these) often caused problems. However, in the end, as things stacked up, we ended up implementing a custom JacksonGenerator.

@makagonov makagonov closed this Jun 29, 2018
@HyukjinKwon
Copy link
Member

Thank you so much for sharing it.

@jackylee-ch
Copy link
Contributor

Can we reopen this issue or let me open a new one?
toJSON is used in sparkmagic, which is widely used in jupyter, to get sql return results. with toJSON sparkmagic may return empty results.
Maybe adding a config is the best choice.

@HyukjinKwon
Copy link
Member

Ah, I just saw this. Okay, thanks for fixing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants