[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389

maropu · 2018-05-22T04:21:41Z

What changes were proposed in this pull request?

This pr added code to verify a schema in Json/Orc/ParquetFileFormat along with CSVFileFormat.

How was this patch tested?

Added verification tests in FileBasedDataSourceSuite and HiveOrcSourceSuite.

SparkQA · 2018-05-22T05:58:33Z

Test build #90934 has finished for PR 21389 at commit 0d88bcb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-22T07:05:02Z

Test build #90937 has finished for PR 21389 at commit 00208be.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-05-22T07:38:30Z

retest this please

SparkQA · 2018-05-22T11:44:00Z

Test build #90944 has finished for PR 21389 at commit 00208be.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-05-22T11:53:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonUtils.scala

+  /**
+   * Verify if the schema is supported in JSON datasource.
+   */
+  def verifySchema(schema: StructType): Unit = {


The function verifySchema is very similar with the one in Orc/Parquet except the exception message. Should we put it into a util object?

Hmm .. but wouldn't the supported types be very specific to data source?

Since supported types are specific to data sources, I think we need to verify a schema in each file format implementations. But, yes.... these built-in format (orc and parquet) has the same supported types, so it might be better to move the code veryfySchema into somewhere (e.g., DataSourceUtils or something) for avoiding code duplication....

maropu · 2018-05-22T14:41:56Z

cc: @dongjoon-hyun

dongjoon-hyun · 2018-05-22T15:42:58Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

@@ -2408,4 +2409,53 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData {
      spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()),
      Row(badJson))
  }
+
+  test("SPARK-24204 error handling for unsupported data types") {


Thank you for pinging me, @maropu .
Since this is all about file-based data sources, can we have all these test cases in FileBasedDataSourceSuite?

(Probably, the suggestion is related to the one above) Essentially, the supported types are specific to datasource implementations, so I'm not 100% sure that it's the best to put these tests in FileBasedDataSourceSuite .

The suite doesn't assume that all file-based data source has the same capability. In this PR, the test codes are almost the same and the only difference are the mapping tables. For example,

json -> Interval

orc -> Interval, Null

parquet -> Interval, Null

ok, I'll try to brush up the tests.

gengliangwang · 2018-05-23T05:20:53Z

retest this please.

SparkQA · 2018-05-23T05:23:01Z

Test build #91010 has finished for PR 21389 at commit c7fe24f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-05-23T06:00:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

@@ -89,6 +89,8 @@ class OrcFileFormat
      job: Job,
      options: Map[String, String],
      dataSchema: StructType): OutputWriterFactory = {
+    DataSourceUtils.verifySchema("ORC", dataSchema)


Thank you for refactoring the PR, @maropu ! What about using shortName instead of string literal "ORC" here? Then, we can have the same line like the following.

DataSourceUtils.verifySchema(shortName, dataSchema)

yea, that's smart. I will update.

SparkQA · 2018-05-23T06:08:57Z

Test build #91011 has finished for PR 21389 at commit e4ebac2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-23T07:05:01Z

Test build #91016 has finished for PR 21389 at commit e4ebac2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-23T12:58:06Z

retest this please

SparkQA · 2018-05-23T15:59:27Z

Test build #91035 has finished for PR 21389 at commit e4ebac2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-23T16:10:32Z

retest this please

SparkQA · 2018-05-23T18:09:19Z

Test build #91048 has finished for PR 21389 at commit e4ebac2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-23T18:56:53Z

retest this please

SparkQA · 2018-05-23T22:39:49Z

Test build #91059 has finished for PR 21389 at commit e4ebac2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-24T04:12:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+
+      case udt: UserDefinedType[_] => verifyType(udt.sqlType)
+
+      // For backward-compatibility


Do we have any test case for this?

ok, I will.
Also, we need to merge this function with CSVUtils.verifySchema in this pr?

Yes, as long as it does not break anything.

gatorsmile · 2018-05-24T04:14:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+      case NullType if format == "JSON" =>
+
+      case _ =>
+        throw new UnsupportedOperationException(


Basically, for such a PR, we need to check all the data types that we block and ensure no behavior change is introduced by this PR.

SparkQA · 2018-05-28T07:05:01Z

Test build #91208 has finished for PR 21389 at commit 04f4028.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-06-14T00:52:30Z

retest this please

SparkQA · 2018-06-14T04:46:55Z

Test build #91806 has finished for PR 21389 at commit 04f4028.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-14T15:42:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+object DataSourceUtils {
+
+  /**
+   * Verify if the schema is supported in datasource.


Please improve the description? and document which built-in file formats are covered by this function. Also document which data types are not supported for each data source?

gatorsmile · 2018-06-14T15:48:37Z

@maropu Just want to double check whether all the data types are not supported before this PR? Have you ran these test cases without the code changes? After this PR, the error messages are more readable and captured earlier?

SparkQA · 2018-06-15T08:03:51Z

Test build #91895 has finished for PR 21389 at commit 92d3553.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-06-15T12:54:05Z

retest this please

SparkQA · 2018-06-27T03:49:10Z

Test build #92359 has finished for PR 21389 at commit a3c400d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-27T06:17:38Z

Test build #92363 has finished for PR 21389 at commit c306953.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-27T07:05:02Z

Test build #92368 has finished for PR 21389 at commit 1cfc7b0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-06-27T13:10:28Z

retest this please

SparkQA · 2018-06-27T16:24:26Z

Test build #92382 has finished for PR 21389 at commit 1cfc7b0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-27T18:01:21Z

retest this please

gatorsmile · 2018-06-27T18:02:43Z

LGTM

HyukjinKwon · 2018-06-27T18:13:22Z

retest this please

SparkQA · 2018-06-27T21:57:23Z

Test build #92390 has finished for PR 21389 at commit 1cfc7b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-27T22:26:51Z

Thanks! Merged to master.

…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In #21667 and #21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes #23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…ld be consistent ## What changes were proposed in this pull request? 1. Remove parameter `isReadPath`. The supported types of read/write should be the same. 2. Disallow reading `NullType` for ORC data source. In apache#21667 and apache#21389, it was supposed that ORC supports reading `NullType`, but can't write it. This doesn't make sense. I read docs and did some tests. ORC doesn't support `NullType`. ## How was this patch tested? Unit tset Closes apache#23639 from gengliangwang/supportDataType. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

maropu force-pushed the SPARK-24204 branch from 0d88bcb to 00208be Compare May 22, 2018 06:44

gengliangwang reviewed May 22, 2018

View reviewed changes

dongjoon-hyun reviewed May 22, 2018

View reviewed changes

maropu force-pushed the SPARK-24204 branch from c7fe24f to e4ebac2 Compare May 23, 2018 02:32

dongjoon-hyun reviewed May 23, 2018

View reviewed changes

gatorsmile reviewed May 24, 2018

View reviewed changes

maropu force-pushed the SPARK-24204 branch from e4ebac2 to 04f4028 Compare May 28, 2018 03:47

gatorsmile reviewed Jun 14, 2018

View reviewed changes

maropu added 12 commits June 27, 2018 13:03

Fix

315c7e8

Fix

731c873

Fix

cf4cf2f

Fix

3158e00

Merge with CSV.verifySChema

0b25c4d

Fix

927497d

Add tests for CSV

589479d

Keep backward-compatibility

df1a67f

Brush up code

6303e49

Spit test cases into pieces

6301fb4

Review applied

50e7b11

Fix

1cfc7b0

maropu force-pushed the SPARK-24204 branch from eda4ca0 to 1cfc7b0 Compare June 27, 2018 04:04

asfgit closed this in 893ea22 Jun 27, 2018

gengliangwang mentioned this pull request Jun 29, 2018

[SPARK-24691][SQL]Dispatch the type support check in FileFormat implementation #21667

Closed

gengliangwang mentioned this pull request Jan 24, 2019

[SPARK-26716][SQL] FileFormat: the supported types of read/write should be consistent #23639

Closed

This was referenced Feb 5, 2019

Support Spark 2.4.0 databricks/spark-redshift#426

Closed

Support Spark 2.4.0 udemy/spark-redshift#2

Merged


		case udt: UserDefinedType[_] => verifyType(udt.sqlType)

		// For backward-compatibility

[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389

[SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFileFormat #21389

Conversation

maropu commented May 22, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented May 22, 2018

SparkQA commented May 22, 2018

maropu commented May 22, 2018

SparkQA commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu May 22, 2018 • edited Loading

Choose a reason for hiding this comment

gengliangwang commented May 23, 2018

SparkQA commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

kiszk commented May 23, 2018

SparkQA commented May 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 28, 2018

maropu commented Jun 14, 2018

SparkQA commented Jun 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jun 14, 2018

SparkQA commented Jun 15, 2018

maropu commented Jun 15, 2018

SparkQA commented Jun 27, 2018

SparkQA commented Jun 27, 2018

SparkQA commented Jun 27, 2018

maropu commented Jun 27, 2018

SparkQA commented Jun 27, 2018

gatorsmile commented Jun 27, 2018

gatorsmile commented Jun 27, 2018

HyukjinKwon commented Jun 27, 2018

SparkQA commented Jun 27, 2018

gatorsmile commented Jun 27, 2018

maropu commented May 22, 2018 •

edited

Loading

maropu May 22, 2018 •

edited

Loading