[SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values #26098

jackylee-ch · 2019-10-12T04:40:50Z

Why are the changes needed?

As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON.
For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data.

Does this PR introduce any user-facing change?

Example in spark-shell.
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"b":1}

scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"a":null,"b":1}

How was this patch tested?

Add new test to JacksonGeneratorSuite

… values

jackylee-ch · 2019-10-12T04:43:40Z

cc @HyukjinKwon @sameeragarwal @xuanyuanking

jackylee-ch · 2019-10-12T04:46:19Z

Previous discussion about this: SPARK-23773

jackylee-ch · 2019-10-12T06:37:31Z

cc @cloud-fan

dongjoon-hyun · 2019-10-12T23:02:19Z

Hi, @stczwd . Thank you for making a PR.
Does this PR resolve SPARK-23773 ? Otherwise, you need to file an independent JIRA issue for this.

dongjoon-hyun · 2019-10-12T23:03:11Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala

@@ -39,6 +39,18 @@ class JacksonGeneratorSuite extends SparkFunSuite {
    assert(writer.toString === """{"a":1}""")
  }

+  test("initial with StructType and write out an empty row with allowStructIncludeNull=true") {


This looks like a test case for bug. We need a JIRA issue ID prefix for the test case name.

jackylee-ch · 2019-10-13T02:49:51Z

Hi, @stczwd . Thank you for making a PR.
Does this PR resolve SPARK-23773 ? Otherwise, you need to file an independent JIRA issue for this.

Yes, this PR can resolve this problem

jackylee-ch · 2019-10-15T01:09:16Z

cc @dongjoon-hyun @xuanyuanking @cloud-fan @rdblue
any other question blocking this merged in？

rdblue · 2019-10-15T17:04:05Z

Sounds fine to me.

cloud-fan · 2019-10-16T03:41:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

@@ -76,6 +76,9 @@ private[sql] class JSONOptions(
  // Whether to ignore column of all null values or empty array/struct during schema inference
  val dropFieldIfAllNull = parameters.get("dropFieldIfAllNull").map(_.toBoolean).getOrElse(false)

+  // Whether to ignore column of all null during json generating
+  val structIngoreNull = parameters.getOrElse("structIngoreNull", "true").toBoolean


is it specific to struct type column? if not how about naming it ignroeNullFields?

It works on StructType, including struct field and struct inner data

how about top-level columns?

Yeap, it also works on that

then shall we pick a better name for this config?

okey, ignoreNullFields is much better than structIgnoreNull, I'll change it.

cloud-fan · 2019-10-16T03:42:37Z

The change looks reasonable. Do you know why the json data source ignore null fields at the first place?

cloud-fan · 2019-10-16T03:43:25Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala

+    val gen = new JacksonGenerator(dataType, writer, allowNullOption)
+    gen.write(input)
+    gen.flush()
+    assert(writer.toString === """{"a":null}""")


can we also test null inner field? e.g. {"a": {"b": null}}

Sure, I have added a test for this

jackylee-ch · 2019-10-16T07:50:26Z

The change looks reasonable. Do you know why the json data source ignore null fields at the first place?

@cloud-fan I don't know about this. Any ideas?

jackylee-ch · 2019-10-17T04:22:59Z

@cloud-fan @dongjoon-hyun
Any other questions about it?

dilipbiswal · 2019-10-17T04:26:51Z

@stczwd I have a question. After this change when we write to the json file, we will be able to preserve null values. Right ?

scala> spark.sql("select null as a, 1 as b").write.format("json").save("/tmp/jsonfile")

jackylee-ch · 2019-10-17T04:36:24Z

@dilipbiswal Yes, but case is not right. If you want to preserve null values, you must write like this to disable ignoreNullFields.

spark.sql("select null as a, 1 as b").write.format("json").option("ignoreNullFields", "false").save("/tmp/jsonfile")

dilipbiswal · 2019-10-17T04:48:12Z

@stczwd Thanks.. sorry .. i missed the newly added option.

cloud-fan · 2019-10-17T05:32:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1153,6 +1153,12 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val JSON_GENERATOR_IGNORE_NULL_FIELDS =
+    buildConf("spark.sql.jsonGenerator.nullFields.ignore")


config name has namespaces, we'd better avoid creating unnecessary namespaces. How about spark.sql.jsonGenerator.ignoreNullFields

okey, I will change it.

cloud-fan · 2019-10-17T05:33:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

@@ -76,6 +76,9 @@ private[sql] class JSONOptions(
  // Whether to ignore column of all null values or empty array/struct during schema inference
  val dropFieldIfAllNull = parameters.get("dropFieldIfAllNull").map(_.toBoolean).getOrElse(false)

+  // Whether to ignore null fields during json generating
+  val ignoreNullFields = parameters.getOrElse("ignoreNullFields", "true").toBoolean


SQLConf should also take effect here. How about

parameters.get("ignoreNullFields").map(_.toBoolean).getOrElse { SQLConf.get. jsonGeneratorIgnoreNullFields }

Perfect. Then the code in DataSet.scala is not necessary now.

jackylee-ch · 2019-10-17T09:54:10Z

@cloud-fan Any more questions?

cloud-fan · 2019-10-17T11:13:58Z

ok to test

xuanyuanking · 2019-10-17T13:38:48Z

The code changes LGTM.
I suggest adding more context in the PR description, we should clarify this PR is to add config to control the behavior added in SPARK-23773.

SparkQA · 2019-10-17T15:05:32Z

Test build #112218 has finished for PR 26098 at commit f8aea25.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-10-21T09:32:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

@@ -76,6 +77,10 @@ private[sql] class JSONOptions(
  // Whether to ignore column of all null values or empty array/struct during schema inference
  val dropFieldIfAllNull = parameters.get("dropFieldIfAllNull").map(_.toBoolean).getOrElse(false)

+  // Whether to ignore null fields during json generating
+  val ignoreNullFields = parameters.getOrElse("ignoreNullFields",


Hey, you should document this in DataFrameWrtier, DataStreamWrtier, readwriter.py

Okey, I'll try to follow up this PR, and add this to document

HyukjinKwon · 2019-10-21T09:35:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -1153,6 +1153,12 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val JSON_GENERATOR_IGNORE_NULL_FIELDS =
+    buildConf("spark.sql.jsonGenerator.ignoreNullFields")
+      .doc("If false, JacksonGenerator will generate null for null fields in Struct.")


If I were a user, I would have no idea what JacksonGenerator is ...

Maybe I can describe this in a better way.

HyukjinKwon · 2019-10-21T09:35:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+  val JSON_GENERATOR_IGNORE_NULL_FIELDS =
+    buildConf("spark.sql.jsonGenerator.ignoreNullFields")
+      .doc("If false, JacksonGenerator will generate null for null fields in Struct.")
+      .stringConf


Why is it a string conf? shouldn't it be a boolean conf?

It is only used in JasonOptions, where we need string value, and use toBoolean to get boolean conf

HyukjinKwon · 2019-10-21T09:39:01Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala

+    val gen = new JacksonGenerator(dataType, writer, allowNullOption)
+    gen.write(input)
+    gen.flush()
+    assert(writer.toString === """{"a":{"b":null}}""")


nit but I would call close with a try-catch for a best practice in a followup.

HyukjinKwon · 2019-10-21T09:43:18Z

cc @sameeragarwal (I vaguely remember, long time ago, your colleague opened a PR to support this case before)

HyukjinKwon · 2019-10-21T09:43:35Z

@stczwd can you make a followup to address the comments above?

jackylee-ch · 2019-10-21T15:45:33Z

Yeah, I'll make it

…elds in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes #26227 from stczwd/json-generator-doc. Authored-by: stczwd <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

zacikpetr · 2019-10-27T20:04:07Z

Hi,
Is there some repository I can try this solution?
I would expect it here https://repository.apache.org/content/repositories/snapshots/
Thank you for your help.

dongjoon-hyun · 2019-10-27T21:31:16Z

Hi, @zacikpetr .
Yes. It's there. You can try it.

https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-avro_2.12/3.0.0-SNAPSHOT/spark-avro_2.12-3.0.0-20191027.032954-246.jar

dongjoon-hyun · 2019-10-27T21:32:29Z

To make it sure, I triggered the snapshot publishing one minute ago.

zacikpetr · 2019-10-28T14:54:11Z

Thank you @dongjoon-hyun
Ok, now I am officially confused as I do not understand your build process.

For instance, I am looking for: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala, that should contain JSON_GENERATOR_IGNORE_NULL_FIELDS
Class should be part of spark-catalyst.

For version https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-catalyst_2.11/
only 2.4.5-SNAPSHOT gets updated
For version https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-catalyst_2.12/
only version 3.0.0-SNAPSHOT gets updated

I tried 2.11 v 3.0.0 and that is unreleased version.
So I try to switch to these updated versions.

I also do not understand how avro is connected to this.

Best Regards,

zacikpetr · 2019-10-28T15:29:35Z

Can you please also release 3.0.0 for 2.11 as I am stuck with 2.11 version?

cloud-fan · 2019-10-28T16:11:32Z

AFAIK scala 2.11 has been dropped in Spark 3.0, cc @srowen

srowen · 2019-10-28T16:18:01Z

This isn't quite the place to discuss, but yes there is no Scala 2.11 support in Spark 3.0.

zacikpetr · 2019-10-29T18:59:29Z

Ok, so keeping context – is there any chance this "issue" can be fixed in Spark 2?

I don't know who is in the lead of this. However, I know that there are currently not many modules supporting Scala 2.12 (spark-streaming-kafka_2.11, spark-csv_2.11, lot of connectors,...), and that makes the transition between Spark 2 and 3 difficult. This situation throws me into the category of "Spark self workaround developers" to keep Sparking in our project. Is there any plan, recommendation and thread to this topic you can redirect me to?

Thank you all for your time.

srowen · 2019-10-29T19:03:44Z

All of Spark has supported 2.12 since 2.4.0, not sure what you mean.

zacikpetr · 2019-10-29T19:13:01Z

For instance, https://github.com/databricks/spark-csv doesn't seem to support 2.12 version yet. Maybe I just do not understand concept. But I suppose that we cannot combine 2.11 and 2.12 modules. Another example is https://github.com/memsql/memsql-spark-connector

srowen · 2019-10-29T19:51:14Z

Per the repo, that was long ago pushed into Spark 2.x. CSV parsing is part of Spark and Spark supports 2.12. Third-party packages -- who knows. That memsql one maybe doesn't, but it also hasn't been updated since 2.0.x, so I don't even know if it works on 2.4 + Scala 2.11. All the less on Spark 3.0, likely. As we'll support scala 2.13 in Spark 3.x at some nearish point, we really can't keep supporting 2.11, which is EOL anyway.

(Any further discussion -> use [email protected])

* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} Add new test to JacksonGeneratorSuite Lead-authored-by: stczwd <[email protected]> Co-authored-by: Jackey Lee <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit with id 78b0cbe) * [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating # What changes were proposed in this pull request? Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py. Enable user to use ignoreNullFields in pyspark. ### Does this PR introduce any user-facing change? No ### How was this patch tested? run unit tests Closes apache#26227 from stczwd/json-generator-doc. Authored-by: stczwd <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> Co-authored-by: stczwd <[email protected]>

Add configuration to support JacksonGenrator to keep fields with null…

0b90344

… values

dongjoon-hyun reviewed Oct 12, 2019

View reviewed changes

dongjoon-hyun added the SQL label Oct 12, 2019

jackylee-ch changed the title ~~Add configuration to support JacksonGenrator to keep fields with null values~~ [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values Oct 13, 2019

jackylee-ch added 2 commits October 13, 2019 16:24

[SPARK-29444] add spark jira id in test

4f3c83f

[SPARK-29444] Fix suite tests

143b9e7

cloud-fan reviewed Oct 16, 2019

View reviewed changes

[SPARK-29444] add a suite test for struct inner null

f57bea2

[SPARK-29444] change config from structIgnoreNull to ignoreNullFields

eb123bf

cloud-fan reviewed Oct 17, 2019

View reviewed changes

[SPARK-29444] add SQLConf to JSONOptions

f8aea25

HyukjinKwon reviewed Oct 21, 2019

View reviewed changes

jackylee-ch deleted the json branch October 23, 2019 10:21

jackylee-ch mentioned this pull request Oct 23, 2019

[SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating #26227

Closed

bjornjorgensen mentioned this pull request Jan 23, 2022

[SPARK-38067][PYTHON] Preserve None values when saved to JSON. #35296

Closed

[SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values #26098

[SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values #26098

Conversation

jackylee-ch commented Oct 12, 2019

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

jackylee-ch commented Oct 12, 2019

jackylee-ch commented Oct 12, 2019

jackylee-ch commented Oct 12, 2019

dongjoon-hyun commented Oct 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch commented Oct 13, 2019

jackylee-ch commented Oct 15, 2019

rdblue commented Oct 15, 2019

cloud-fan Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Oct 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Oct 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch commented Oct 16, 2019

jackylee-ch commented Oct 17, 2019

dilipbiswal commented Oct 17, 2019

jackylee-ch commented Oct 17, 2019

dilipbiswal commented Oct 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch commented Oct 17, 2019

cloud-fan commented Oct 17, 2019

xuanyuanking commented Oct 17, 2019

SparkQA commented Oct 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 21, 2019

HyukjinKwon commented Oct 21, 2019

jackylee-ch commented Oct 21, 2019

zacikpetr commented Oct 27, 2019

dongjoon-hyun commented Oct 27, 2019

dongjoon-hyun commented Oct 27, 2019

zacikpetr commented Oct 28, 2019

zacikpetr commented Oct 28, 2019

cloud-fan commented Oct 28, 2019 • edited Loading

srowen commented Oct 28, 2019

zacikpetr commented Oct 29, 2019

srowen commented Oct 29, 2019

zacikpetr commented Oct 29, 2019 • edited Loading

srowen commented Oct 29, 2019 • edited Loading

cloud-fan Oct 16, 2019 •

edited

Loading

cloud-fan Oct 16, 2019 •

edited

Loading

cloud-fan commented Oct 28, 2019 •

edited

Loading

zacikpetr commented Oct 29, 2019 •

edited

Loading

srowen commented Oct 29, 2019 •

edited

Loading