[SPARK-13764][SQL] Parse modes in JSON data source #11756

HyukjinKwon · 2016-03-16T06:43:30Z

What changes were proposed in this pull request?

Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .

This PR adds the support for parse modes just like CSV data source. There are three modes below:

PERMISSIVE : When it fails to parse, this sets null to to field. This is a default mode when it has been this mode.
DROPMALFORMED: When it fails to parse, this drops the whole record.
FAILFAST: When it fails to parse, it just throws an exception.

This PR also make JSON data source share the ParseModes in CSV data source.

How was this patch tested?

Unit tests were used and ./dev/run_tests for code style tests.

rxin · 2016-03-16T07:01:53Z

cc @cloud-fan for review

cloud-fan · 2016-03-16T07:23:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ParseModes.scala

+  def isPermissiveMode(mode: String): Boolean = if (isValidMode(mode))  {
+    mode.toUpperCase == PERMISSIVE_MODE
+  } else {
+    true // We default to permissive is the mode string is not valid


should we log a warning for this case?

SparkQA · 2016-03-16T08:19:48Z

Test build #53286 has finished for PR 11756 at commit 4c46f4b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-16T08:47:05Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

@@ -288,6 +288,9 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging {
   * </li>
   * <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
   * (e.g. 00012)</li>
+   * <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
+   * during parsing. When fails to parse, `PERMISSIVE` mode sets `null`, `DROPMALFORMED` drops the
+   * record and `FAILFAST` throws an exception.<li>


I think we need to say more about these 3 modes. From the tests, it looks to me that:

PERMISSIVE mode will set other fields to null when meet a corrupted record, and put the malformed string into a new field configured by spark.sql.columnNameOfCorruptRecord.

DROPMALFORMED mode will ignore corrupted records and append a new field which is always null to the output.

FAILFAST mode will throw an exception.

It will be better if you can expand this doc and add some examples.

Could I maybe edit this without some examples? It is becoming a bit messy..

cloud-fan · 2016-03-16T09:16:54Z

I'm not familiar with CSV part, what if users set the schema directly before read data and the mode is PERMISSIVE? Will we add the extra field?

HyukjinKwon · 2016-03-16T09:22:33Z

For example, the data below:

1,2,3,4
3,2,1

will produce the records below:

PERMISSIVE

Row(1,2,3,4)
Row(3,2,1,null)

PERMISSIVE with user schema

Schema("field1", "field2", "field3")

Row(1,2,3)
Row(3,2,1)

PERMISSIVE with user schema

Schema("field1", "field2", "field3", "field4", "field5")

Row(1,2,3,4,null)
Row(3,2,1,null,null)

DROPMALFORMED

Row(1,2,3,4)

DROPMALFORMED with user schema

Schema("field1", "field2", "field3")

Drop all.

DROPMALFORMED with user schema

Schema("field1", "field2", "field3", "field4", "field5")

Drop all.

FAILFAST

Throws an exception

FAILFAST with user schema

Schema("field1", "field2", "field3")

Throws an exception.

FAILFAST with user schema

Schema("field1", "field2", "field3", "field4", "field5")

Throws an exception.

@cloud-fan I just added some more cases here.

SparkQA · 2016-03-16T09:53:26Z

Test build #53303 has finished for PR 11756 at commit 3675fae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-16T10:02:10Z

The commit I submitted includes comment changes and avoiding to add a _corrupt_record field when it is DROPMALFORMED mode in type inference.

SparkQA · 2016-03-16T10:13:28Z

Test build #53313 has finished for PR 11756 at commit 32ae8b2.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-16T11:33:58Z

Test build #53312 has finished for PR 11756 at commit 4440a55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-16T12:14:44Z

ah, thanks for the detail explanation and examples!

cloud-fan · 2016-03-16T12:16:09Z

python/pyspark/sql/readwriter.py

+                  record and puts the malformed string into a new field configured by \
+                 ``spark.sql.columnNameOfCorruptRecord``. When a schema is set by user, it sets \
+                 ``null`` for extra fields.
+                *  ``DROPMALFORMED`` : ignores the whole corrupted records and append.


should remove and append?

cloud-fan · 2016-03-16T12:34:19Z

Overall LGTM, thanks for working on it!

HyukjinKwon · 2016-03-17T00:44:46Z

@cloud-fan Actually, I have a question.
So, in JSON data source, I thought JSON data format itself can have a flexible schema so it does not necessarily have the same data unlike CSV data.

So, I thought the range of "malformed" rows does not include some rows having different schema for JSON data source (whereas for CSV the range of "malformed" rows includes some rows having different schema).

For the differences, it lead to some different actions for each parse mode comparing to CSV data source.

CSV
- FAILFAST : It throws an exception if any row does not have a same schema or if any row could not be converted into the user-given schema.
- DROPMALFORMED : : It drops every row that does not have a same schema or could not be converted into the user-given schema.
JSON
- FAILFAST : It throws an exception if any row has a corrupted format or if any row could not be converted into the user-given schema.
- DROPMALFORMED : It drops every row that has a corrupted format or could not be converted into the user-given schema.

Do you think it is acceptable?

cloud-fan · 2016-03-17T01:01:05Z

This makes sense to me. Actually for CSV, when any row does not have a same schema, it just means corrupted format, as CSV has a very simple format and can always be parsed.

cloud-fan · 2016-03-17T01:03:22Z

LGTM, cc @davies for another look.

rxin · 2016-03-17T01:50:48Z

But since we had it, i'd say we should keep it to avoid breaking compatibility. We can have the per-read option override the global option.

SparkQA · 2016-03-17T02:05:34Z

Test build #53379 has finished for PR 11756 at commit de8d291.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-03-17T02:24:38Z

Filed in SPARK-13953.

SparkQA · 2016-03-17T02:47:54Z

Test build #53382 has finished for PR 11756 at commit 29a8f68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-17T02:55:47Z

Test build #53384 has finished for PR 11756 at commit 551593a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T05:27:06Z

Test build #53504 has finished for PR 11756 at commit bfc0405.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T05:34:11Z

Test build #53506 has finished for PR 11756 at commit 59e7214.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-18T07:19:59Z

Test build #53507 has finished for PR 11756 at commit 3ff900e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-21T02:51:19Z

retest it please.

HyukjinKwon · 2016-03-21T04:16:32Z

@cloud-fan Isn't this "it" a typo maybe :)?

HyukjinKwon · 2016-03-21T04:16:36Z

retest this please

cloud-fan · 2016-03-21T05:47:28Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+      Row("str_a_4") :: Nil)
+    assert(jsonDFTwo.schema === schemaTwo)
+  }
+
  test("Corrupt records") {


we can change this to: "Corrupt records: PERMISSIVE mode"

SparkQA · 2016-03-21T05:50:23Z

Test build #53652 has finished for PR 11756 at commit 3ff900e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-21T05:51:26Z

Last 2 comments, otherwise LGTM

cloud-fan · 2016-03-21T06:11:22Z

LGTM, pending tests

SparkQA · 2016-03-21T07:36:11Z

Test build #53659 has finished for PR 11756 at commit dec3d81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-03-21T07:44:24Z

Thanks! Merging to master!

## What changes were proposed in this pull request? Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source . This PR adds the support for parse modes just like CSV data source. There are three modes below: - `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode. - `DROPMALFORMED`: When it fails to parse, this drops the whole record. - `FAILFAST`: When it fails to parse, it just throws an exception. This PR also make JSON data source share the `ParseModes` in CSV data source. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <[email protected]> Closes apache#11756 from HyukjinKwon/SPARK-13764.

#105 Currently, this library does not support `PERMISSIVE` parse mode. Similar with JSON data source, this also can be done in the same way with `_corrupt_record`. This PR adds the support for `PERMISSIVE` mode and make this behaviour consistent with the other data sources supporting parse modes (JSON and CSV data sources.) Also, this PR adds the support for `_corrupt_record`. This PR is similar with apache/spark#11756 and apache/spark#11881. Author: hyukjinkwon <[email protected]> Closes #107 from HyukjinKwon/ISSUE-105-permissive.

databricks/spark-xml#105 Currently, this library does not support `PERMISSIVE` parse mode. Similar with JSON data source, this also can be done in the same way with `_corrupt_record`. This PR adds the support for `PERMISSIVE` mode and make this behaviour consistent with the other data sources supporting parse modes (JSON and CSV data sources.) Also, this PR adds the support for `_corrupt_record`. This PR is similar with apache/spark#11756 and apache/spark#11881. Author: hyukjinkwon <[email protected]> Closes #107 from HyukjinKwon/ISSUE-105-permissive.

Parse modes in JSON data source

4c46f4b

HyukjinKwon changed the title ~~Parse modes in JSON data source~~ [SPARK-13764][SQL] Parse modes in JSON data source Mar 16, 2016

HyukjinKwon mentioned this pull request Mar 16, 2016

[SPARK-13719][SQL] Parse JSON rows having an array type and a struct type in the same fieild #11752

Closed

cloud-fan reviewed Mar 16, 2016
View reviewed changes

Separate tests for parse modes

3675fae

cloud-fan reviewed Mar 16, 2016
View reviewed changes

Do not infer _corrupt_record when the mode is DROPMALFORMED

87c3251

Update comments for parse modes.

4440a55

Add the missing description

32ae8b2

cloud-fan reviewed Mar 16, 2016
View reviewed changes

Update comments and add some more tests.

de8d291

Resolve conflicts

bfc0405

Change Logging to internal.Logging

59e7214

Reorder imports

3ff900e

HyukjinKwon mentioned this pull request Mar 19, 2016

Adding and option to the csv parser, withParseExecptionAsNull which a… databricks/spark-csv#298

Closed

cloud-fan reviewed Mar 21, 2016
View reviewed changes

Separate tests for each mode

dec3d81

asfgit closed this in e474088 Mar 21, 2016

HyukjinKwon mentioned this pull request Apr 4, 2016

Support for PERMISSIVE/DROPMALFORMED mode and corrupt record option. databricks/spark-xml#107

Closed

HyukjinKwon deleted the SPARK-13764 branch January 2, 2018 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13764][SQL] Parse modes in JSON data source #11756

[SPARK-13764][SQL] Parse modes in JSON data source #11756

HyukjinKwon commented Mar 16, 2016

rxin commented Mar 16, 2016

cloud-fan Mar 16, 2016

SparkQA commented Mar 16, 2016

cloud-fan Mar 16, 2016

HyukjinKwon Mar 16, 2016

cloud-fan commented Mar 16, 2016

HyukjinKwon commented Mar 16, 2016

SparkQA commented Mar 16, 2016

HyukjinKwon commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

cloud-fan commented Mar 16, 2016

cloud-fan Mar 16, 2016

cloud-fan commented Mar 16, 2016

HyukjinKwon commented Mar 17, 2016

cloud-fan commented Mar 17, 2016

cloud-fan commented Mar 17, 2016

rxin commented Mar 17, 2016

SparkQA commented Mar 17, 2016

HyukjinKwon commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

cloud-fan commented Mar 21, 2016

HyukjinKwon commented Mar 21, 2016

HyukjinKwon commented Mar 21, 2016

cloud-fan Mar 21, 2016

SparkQA commented Mar 21, 2016

cloud-fan commented Mar 21, 2016

cloud-fan commented Mar 21, 2016

SparkQA commented Mar 21, 2016

cloud-fan commented Mar 21, 2016

[SPARK-13764][SQL] Parse modes in JSON data source #11756

[SPARK-13764][SQL] Parse modes in JSON data source #11756

Conversation

HyukjinKwon commented Mar 16, 2016

What changes were proposed in this pull request?

How was this patch tested?

rxin commented Mar 16, 2016

cloud-fan Mar 16, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 16, 2016

cloud-fan Mar 16, 2016

Choose a reason for hiding this comment

HyukjinKwon Mar 16, 2016

Choose a reason for hiding this comment

cloud-fan commented Mar 16, 2016

HyukjinKwon commented Mar 16, 2016

SparkQA commented Mar 16, 2016

HyukjinKwon commented Mar 16, 2016

SparkQA commented Mar 16, 2016

SparkQA commented Mar 16, 2016

cloud-fan commented Mar 16, 2016

cloud-fan Mar 16, 2016

Choose a reason for hiding this comment

cloud-fan commented Mar 16, 2016

HyukjinKwon commented Mar 17, 2016

cloud-fan commented Mar 17, 2016

cloud-fan commented Mar 17, 2016

rxin commented Mar 17, 2016

SparkQA commented Mar 17, 2016

HyukjinKwon commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 17, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

SparkQA commented Mar 18, 2016

cloud-fan commented Mar 21, 2016

HyukjinKwon commented Mar 21, 2016

HyukjinKwon commented Mar 21, 2016

cloud-fan Mar 21, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2016

cloud-fan commented Mar 21, 2016

cloud-fan commented Mar 21, 2016

SparkQA commented Mar 21, 2016

cloud-fan commented Mar 21, 2016