-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389
Conversation
Test build #90934 has finished for PR 21389 at commit
|
Test build #90937 has finished for PR 21389 at commit
|
retest this please |
Test build #90944 has finished for PR 21389 at commit
|
/** | ||
* Verify if the schema is supported in JSON datasource. | ||
*/ | ||
def verifySchema(schema: StructType): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function verifySchema
is very similar with the one in Orc/Parquet except the exception message. Should we put it into a util object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm .. but wouldn't the supported types be very specific to data source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since supported types are specific to data sources, I think we need to verify a schema in each file format implementations. But, yes.... these built-in format (orc and parquet) has the same supported types, so it might be better to move the code veryfySchema
into somewhere (e.g., DataSourceUtils
or something) for avoiding code duplication....
cc: @dongjoon-hyun |
@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { | |||
spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), | |||
Row(badJson)) | |||
} | |||
|
|||
test("SPARK-24204 error handling for unsupported data types") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pinging me, @maropu .
Since this is all about file-based data sources, can we have all these test cases in FileBasedDataSourceSuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Probably, the suggestion is related to the one above) Essentially, the supported types are specific to datasource implementations, so I'm not 100% sure that it's the best to put these tests in FileBasedDataSourceSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suite doesn't assume that all file-based data source has the same capability. In this PR, the test codes are almost the same and the only difference are the mapping tables. For example,
json
-> Intervalorc
-> Interval, Nullparquet
-> Interval, Null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll try to brush up the tests.
retest this please. |
Test build #91010 has finished for PR 21389 at commit
|
@@ -89,6 +89,8 @@ class OrcFileFormat | |||
job: Job, | |||
options: Map[String, String], | |||
dataSchema: StructType): OutputWriterFactory = { | |||
DataSourceUtils.verifySchema("ORC", dataSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for refactoring the PR, @maropu ! What about using shortName
instead of string literal "ORC" here? Then, we can have the same line like the following.
DataSourceUtils.verifySchema(shortName, dataSchema)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, that's smart. I will update.
Test build #91011 has finished for PR 21389 at commit
|
Test build #91016 has finished for PR 21389 at commit
|
retest this please |
Test build #91035 has finished for PR 21389 at commit
|
retest this please |
Test build #91048 has finished for PR 21389 at commit
|
retest this please |
Test build #91059 has finished for PR 21389 at commit
|
|
||
case udt: UserDefinedType[_] => verifyType(udt.sqlType) | ||
|
||
// For backward-compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any test case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will.
Also, we need to merge this function with CSVUtils.verifySchema
in this pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as long as it does not break anything.
case NullType if format == "JSON" => | ||
|
||
case _ => | ||
throw new UnsupportedOperationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, for such a PR, we need to check all the data types that we block and ensure no behavior change is introduced by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Test build #91208 has finished for PR 21389 at commit
|
retest this please |
Test build #91806 has finished for PR 21389 at commit
|
object DataSourceUtils { | ||
|
||
/** | ||
* Verify if the schema is supported in datasource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the description? and document which built-in file formats are covered by this function. Also document which data types are not supported for each data source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@maropu Just want to double check whether all the data types are not supported before this PR? Have you ran these test cases without the code changes? After this PR, the error messages are more readable and captured earlier? |
Test build #91895 has finished for PR 21389 at commit
|
retest this please |
Test build #92359 has finished for PR 21389 at commit
|
Test build #92363 has finished for PR 21389 at commit
|
Test build #92368 has finished for PR 21389 at commit
|
retest this please |
Test build #92382 has finished for PR 21389 at commit
|
retest this please |
LGTM |
retest this please |
Test build #92390 has finished for PR 21389 at commit
|
Thanks! Merged to master. |
…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes #23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes apache#23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
This pr added code to verify a schema in Json/Orc/ParquetFileFormat along with CSVFileFormat.
How was this patch tested?
Added verification tests in
FileBasedDataSourceSuite
andHiveOrcSourceSuite
.