-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13764][SQL] Parse modes in JSON data source #11756
Conversation
cc @cloud-fan for review |
def isPermissiveMode(mode: String): Boolean = if (isValidMode(mode)) { | ||
mode.toUpperCase == PERMISSIVE_MODE | ||
} else { | ||
true // We default to permissive is the mode string is not valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we log a warning for this case?
Test build #53286 has finished for PR 11756 at commit
|
@@ -288,6 +288,9 @@ class DataFrameReader private[sql](sqlContext: SQLContext) extends Logging { | |||
* </li> | |||
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers | |||
* (e.g. 00012)</li> | |||
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records | |||
* during parsing. When fails to parse, `PERMISSIVE` mode sets `null`, `DROPMALFORMED` drops the | |||
* record and `FAILFAST` throws an exception.<li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to say more about these 3 modes. From the tests, it looks to me that:
PERMISSIVE
mode will set other fields to null when meet a corrupted record, and put the malformed string into a new field configured byspark.sql.columnNameOfCorruptRecord
.DROPMALFORMED
mode will ignore corrupted records and append a new field which is always null to the output.FAILFAST
mode will throw an exception.
It will be better if you can expand this doc and add some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I maybe edit this without some examples? It is becoming a bit messy..
I'm not familiar with CSV part, what if users set the schema directly before read data and the mode is |
For example, the data below:
will produce the records below:
Schema("field1", "field2", "field3")
Schema("field1", "field2", "field3", "field4", "field5")
Schema("field1", "field2", "field3")
Schema("field1", "field2", "field3", "field4", "field5")
Schema("field1", "field2", "field3")
Schema("field1", "field2", "field3", "field4", "field5")
@cloud-fan I just added some more cases here. |
Test build #53303 has finished for PR 11756 at commit
|
The commit I submitted includes comment changes and avoiding to add a |
Test build #53313 has finished for PR 11756 at commit
|
Test build #53312 has finished for PR 11756 at commit
|
ah, thanks for the detail explanation and examples! |
record and puts the malformed string into a new field configured by \ | ||
``spark.sql.columnNameOfCorruptRecord``. When a schema is set by user, it sets \ | ||
``null`` for extra fields. | ||
* ``DROPMALFORMED`` : ignores the whole corrupted records and append. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should remove and append
?
Overall LGTM, thanks for working on it! |
@cloud-fan Actually, I have a question. So, I thought the range of "malformed" rows does not include some rows having different schema for JSON data source (whereas for CSV the range of "malformed" rows includes some rows having different schema). For the differences, it lead to some different actions for each parse mode comparing to CSV data source.
Do you think it is acceptable? |
This makes sense to me. Actually for CSV, when |
LGTM, cc @davies for another look. |
But since we had it, i'd say we should keep it to avoid breaking compatibility. We can have the per-read option override the global option. |
Test build #53379 has finished for PR 11756 at commit
|
Filed in SPARK-13953. |
Test build #53382 has finished for PR 11756 at commit
|
Test build #53384 has finished for PR 11756 at commit
|
Test build #53504 has finished for PR 11756 at commit
|
Test build #53506 has finished for PR 11756 at commit
|
Test build #53507 has finished for PR 11756 at commit
|
retest it please. |
@cloud-fan Isn't this "it" a typo maybe :)? |
retest this please |
Row("str_a_4") :: Nil) | ||
assert(jsonDFTwo.schema === schemaTwo) | ||
} | ||
|
||
test("Corrupt records") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can change this to: "Corrupt records: PERMISSIVE mode"
Test build #53652 has finished for PR 11756 at commit
|
Last 2 comments, otherwise LGTM |
LGTM, pending tests |
Test build #53659 has finished for PR 11756 at commit
|
Thanks! Merging to master! |
## What changes were proposed in this pull request? Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source . This PR adds the support for parse modes just like CSV data source. There are three modes below: - `PERMISSIVE` : When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode. - `DROPMALFORMED`: When it fails to parse, this drops the whole record. - `FAILFAST`: When it fails to parse, it just throws an exception. This PR also make JSON data source share the `ParseModes` in CSV data source. ## How was this patch tested? Unit tests were used and `./dev/run_tests` for code style tests. Author: hyukjinkwon <[email protected]> Closes apache#11756 from HyukjinKwon/SPARK-13764.
#105 Currently, this library does not support `PERMISSIVE` parse mode. Similar with JSON data source, this also can be done in the same way with `_corrupt_record`. This PR adds the support for `PERMISSIVE` mode and make this behaviour consistent with the other data sources supporting parse modes (JSON and CSV data sources.) Also, this PR adds the support for `_corrupt_record`. This PR is similar with apache/spark#11756 and apache/spark#11881. Author: hyukjinkwon <[email protected]> Closes #107 from HyukjinKwon/ISSUE-105-permissive.
databricks/spark-xml#105 Currently, this library does not support `PERMISSIVE` parse mode. Similar with JSON data source, this also can be done in the same way with `_corrupt_record`. This PR adds the support for `PERMISSIVE` mode and make this behaviour consistent with the other data sources supporting parse modes (JSON and CSV data sources.) Also, this PR adds the support for `_corrupt_record`. This PR is similar with apache/spark#11756 and apache/spark#11881. Author: hyukjinkwon <[email protected]> Closes #107 from HyukjinKwon/ISSUE-105-permissive.
What changes were proposed in this pull request?
Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .
This PR adds the support for parse modes just like CSV data source. There are three modes below:
PERMISSIVE
: When it fails to parse, this setsnull
to to field. This is a default mode when it has been this mode.DROPMALFORMED
: When it fails to parse, this drops the whole record.FAILFAST
: When it fails to parse, it just throws an exception.This PR also make JSON data source share the
ParseModes
in CSV data source.How was this patch tested?
Unit tests were used and
./dev/run_tests
for code style tests.