Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23773][SQL] JacksonGenerator does not include keys that have null value for StructTypes #20884

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,10 @@ private[sql] class JacksonGenerator(
var i = 0
while (i < row.numFields) {
val field = schema(i)
if (!row.isNullAt(i)) {
gen.writeFieldName(field.name)
gen.writeFieldName(field.name)
if (row.isNullAt(i)) {
gen.writeNull()
} else {
fieldWriters(i).apply(row, i)
}
i += 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,7 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
test("to_json - array with single empty row") {
val inputSchema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
val input = new GenericArrayData(InternalRow(null) :: Nil)
val output = """[{}]"""
val output = """[{"a":null}]"""
checkEvaluation(
StructsToJson(Map.empty, Literal.create(input, inputSchema), gmtId),
output)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ class JacksonGeneratorSuite extends SparkFunSuite {
val gen = new JacksonGenerator(dataType, writer, option)
gen.write(input)
gen.flush()
assert(writer.toString === """[{}]""")
assert(writer.toString === """[{"a":null}]""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think previous result was a valid test case ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon actually, it looks like the result should be [null] rather than [{}].
Look at the following repro from spark-shell (downloaded binaries):

scala> val df = sqlContext.sql(""" select array(cast(null as struct<k:string>)) as my_array""")
df: org.apache.spark.sql.DataFrame = [my_array: array<struct<k:string>>]

scala> df.printSchema
root
 |-- my_array: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- k: string (nullable = true)
scala> df.toJSON.collect().foreach(println)
{"my_array":[null]}
scala> df.select(to_json($"my_array")).collect().foreach(x => println(x(0)))
[null]

In older version of JacksonGenerator, we had a filter by element value, and if it was null, gen.writeNull() was called no matter what the type was (old implementation). But currently, we're calling gen.writeStartObject()...gen.writeEndObject() no matter if the value is null.

I couldn't repro this with a query, but when StructsToJson is called from this unit test, it goes through JacksonGenerator.arrElementWriter which has lines

case st: StructType =>
(arr: SpecializedGetters, i: Int) => {
  writeObject(writeFields(arr.getStruct(i, st.length), st, rootFieldWriters))
}

that makes it print json object even there is null.

I'll look into this later and will try to find the easy workaround.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should compare this:

scala> sql(""" select array(cast(null as struct<k:string>)) as my_array""").toJSON.collect().foreach(println)
{"my_array":[null]}

scala> sql(""" select array(struct(cast(null as string))) as my_array""").toJSON.collect().foreach(println)
{"my_array":[{}]}

}

test("initial with StructType and write out an empty array") {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1229,7 +1229,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
val df2 = df1.toDF
val result = df2.toJSON.collect()
// scalastyle:off
assert(result(0) === "{\"f1\":1,\"f2\":\"A1\",\"f3\":true,\"f4\":[\"1\",\" A1\",\" true\",\" null\"]}")
assert(result(0) === "{\"f1\":1,\"f2\":\"A1\",\"f3\":true,\"f4\":[\"1\",\" A1\",\" true\",\" null\"],\"f5\":null}")
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go the current way, it'd write out every null with every field:

{"a":null,"b":null,"c":null}
{"a":null,"b":null,"c":1}
{"a":1,"b":null,"c":1}
{"a":1,"b":2,"c":3}

which I think's quite inefficient. Does that fix actually use case to be clear?

assert(result(3) === "{\"f1\":4,\"f2\":\"D4\",\"f3\":true,\"f4\":[\"4\",\" D4\",\" true\",\" 2147483644\"],\"f5\":2147483644}")
// scalastyle:on

Expand Down Expand Up @@ -1265,6 +1265,7 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
1.7976931348623157E308,
10,
21474836470L,
null,
"this is a simple string.")
)

Expand Down Expand Up @@ -2063,4 +2064,11 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
)
}
}

test("SPARK-23773: JacksonGenerator does not include keys " +
"that have null value for StructTypes") {
val df = sql("select NAMED_STRUCT('f1', 10, 'f2', null, 'f3', 15) as my_struct")
val result = df.toJSON.collect()(0)
assert(result === "{\"my_struct\":{\"f1\":10,\"f2\":null,\"f3\":15}}")
}
}