Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating #26227

Closed
wants to merge 2 commits into from

Conversation

jackylee-ch
Copy link
Contributor

@jackylee-ch jackylee-ch commented Oct 23, 2019

What changes were proposed in this pull request?

Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.

Does this PR introduce any user-facing change?

No

How was this patch tested?

run unit tests

@jackylee-ch
Copy link
Contributor Author

cc @HyukjinKwon @cloud-fan

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Oct 23, 2019

Test build #112543 has finished for PR 26227 at commit 41384d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jackylee-ch jackylee-ch changed the title [SPARK-29444][FOLLOWUP] add doc for ignoreNullFields in json generating [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating Oct 24, 2019
.doc("If false, JacksonGenerator will generate null for null fields in Struct.")
.stringConf
.createWithDefault("true")
.doc("Whether to ignore null fields in column/struct during json generating. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just write like this:

Whether to ignore null fields when generating JSON objects in JSON data source and 
JSON functions such as to_json.
If false, it generates null for null fields in JSON objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, it's better.

@@ -687,6 +687,8 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* <li>`encoding` (by default it is not set): specifies encoding (charset) of saved json
* files. If it is not set, the UTF-8 charset will be used. </li>
* <li>`lineSep` (default `\n`): defines the line separator that should be used for writing.</li>
* <li>`ignoreNullFields` (default `true`): whether to ignore null fields in column/struct
* during json generating. </li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise.

@SparkQA
Copy link

SparkQA commented Oct 24, 2019

Test build #112600 has finished for PR 26227 at commit 40bb515.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.

@jackylee-ch jackylee-ch deleted the json-generator-doc branch October 25, 2019 00:21
agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022
* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values

As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON.
For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data.

Example in spark-shell.
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"b":1}

scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"a":null,"b":1}

Add new test to JacksonGeneratorSuite

Lead-authored-by: stczwd <[email protected]>
Co-authored-by: Jackey Lee <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

(cherry picked from commit with id 78b0cbe)

* [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating

# What changes were proposed in this pull request?
Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
run unit tests

Closes apache#26227 from stczwd/json-generator-doc.

Authored-by: stczwd <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

Co-authored-by: stczwd <[email protected]>
udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024
* [SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values

As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON.
For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data.

Example in spark-shell.
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"b":1}

scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"a":null,"b":1}

Add new test to JacksonGeneratorSuite

Lead-authored-by: stczwd <[email protected]>
Co-authored-by: Jackey Lee <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

(cherry picked from commit with id 78b0cbe)

* [SPARK-29444][FOLLOWUP] add doc and python parameter for ignoreNullFields in json generating

# What changes were proposed in this pull request?
Add description for ignoreNullFields, which is commited in apache#26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
run unit tests

Closes apache#26227 from stczwd/json-generator-doc.

Authored-by: stczwd <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

Co-authored-by: stczwd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants