[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

HyukjinKwon · 2016-07-10T08:17:41Z

What changes were proposed in this pull request?

This PR proposes to force the user-specified schema to the nullable one, as Spark SQL can't validate it.

How was this patch tested?

Unit tests adde in FileStreamSourceSuite.scala and DataFrameReaderWriterSuite.scala.

HyukjinKwon · 2016-07-10T08:18:34Z

Hi @gatorsmile and @marmbrus, I saw the discussion and found you are related with this one. Could you please review this?

HyukjinKwon · 2016-07-10T08:19:59Z

sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala

@@ -879,6 +879,19 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes
      }
    }
  }
+
+  test("Check if fields in the schema are nullable") {


This one is forcing the schema as nullable but it has no tests. So the tests were added.

SparkQA · 2016-07-10T09:37:13Z

Test build #62049 has finished for PR 14124 at commit a917678.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-10T11:42:53Z

Test build #62053 has finished for PR 14124 at commit adae8de.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-10T13:44:30Z

Test build #62054 has finished for PR 14124 at commit 3980681.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2016-07-10T15:22:31Z

val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
val df = spark.read.schema(schema).json(rdd)
df.show()

When user-specified schemas are not nullable and the data contains null, the null value in the result becomes 0. This looks like a bug, right?

HyukjinKwon · 2016-07-10T22:54:39Z

Ah, yes it seems a bug to me. I thought it throws an exception in that case or works fine after thid PR. I will test this before/after this PR. Thanks!

HyukjinKwon · 2016-07-11T00:16:47Z

Oh, I see, before this patch

+---+
|  a|
+---+
|  1|
|  0|
+---+

after this patch

+----+
|   a|
+----+
|   1|
|null|
+----+

FYI, currently (before this patch) the code with StringType below

val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", StringType, nullable = false) :: Nil)
val df = spark.read.schema(schema).json(rdd)
df.show()

is being failed with the exception below:

Error while decoding: java.lang.NullPointerException
createexternalrow(input[0, string, false].toString, StructField(a,StringType,false))
+- input[0, string, false].toString
   +- input[0, string, false]

java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException
createexternalrow(input[0, string, false].toString, StructField(a,StringType,false))
+- input[0, string, false].toString
   +- input[0, string, false]

    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(ExpressionEncoder.scala:292)
    at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
    at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1$$anonfun$apply$15.apply(Dataset.scala:2218)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
...

It seems an unexpected behaviour anyway (I think at least we should fix the error message). I will submit a patch if this one is decided not worth being added. Thanks @gatorsmile again!

gatorsmile · 2016-07-11T01:59:29Z

@HyukjinKwon No matter whether this PR is merged or not, I still think we should fix the above issue. Silent conversion does not look good to me.

HyukjinKwon · 2016-07-11T02:08:39Z

@gatorsmile I am a bit confused if we are allowed to read JSON (via json(jsonRDD: RDD[String]) API) with schema having fields set false in nullable.
If it is meant to be not allowed, this PR will prevents the case above.

But, yea, I think I agree that it is a potential problem anyway (even if the case above is not allowed.)

viirya · 2016-07-11T02:54:43Z

@HyukjinKwon Your patch solves this inconsistency by forcing schema as nullable at all. However, looks like the parquet case is for compatibility, is this the same for json? If no, why we want to do this?

HyukjinKwon · 2016-07-11T03:33:01Z

@viirya Thanks for your comment! Actually, that's what I want to have some feedback about from @marmbrus .

It seems forcing to a nullable schema all is already happening, (see here) for all data sources implementing FileFormat when you read data via read.format(...).save(...) API (but not for structured streaming and another API for json, which this PR deals with).

So, actually, the purpose of this PR is, to make all read APIs consistent. The reason to make them consistent in a way that the schema is forced as nullable, is what he said in the mailing list.

Sure, but a traditional RDBMS has the opportunity to do validation before
loading data in. Thats not really an option when you are reading random
files from S3. This is why Hive and many other systems in this space treat
all columns as nullable.

Actually, apparently, Parquet also reads and writes the schema with nullability correctly if we get rid of asNullable in here (I tested this by a roundtrip in writing and reading before) but it seems that's prevented as a safeguard due to (I assume) the reason above.

@marmbrus Do you mind if I ask to clarify here please?

I think we might have to deal with this as a datasource-specific problem.

HyukjinKwon · 2016-07-16T10:04:47Z

Could you take a look please @marmbrus ?

HyukjinKwon · 2016-07-27T00:27:23Z

gentle ping @marmbrus

marmbrus · 2016-07-27T00:30:36Z

@cloud-fan

cloud-fan · 2016-07-27T05:00:34Z

What will happen if the given schema is wrong? It seems weird that we allow users to provide schema while reading the data, but without validating it.

HyukjinKwon · 2016-07-27T05:35:34Z

Thanks for feedback @cloud-fan !

If the user-given schema is wrong, it is handled differently for each datasource.

For JSON and CSV
it is kind of permissive generally (for example, compatibility among numeric types).
For ORC and Parquet
Generally it is strict to types. So they don't allow the compatibility (except for very few cases, e.g. for parquet, [SPARK-16632][sql] Respect Hive schema when merging parquet schema. #14272 and [SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet reader initialization #14278)
For Text
it only supports StringType.
For JDBC
it does not take user-given schema since it does not implement SchemaRelationProvider.

Should we disallow specifying schemas for these (maybe ORC and Parquet)?

HyukjinKwon · 2016-07-27T05:39:13Z

BTW, actually, this PR is not only about user-given schema.

Currently, it always reads data into dataframe by datasources based on FileFormat ignoring nullability in schema (for both user-given schema and inferred/read schema).

However, this does not happen when reading for streaming by the datasources (FileFormat) (and another JSON api).

So, this PR tries to make them all ignore the nullability in schema to be consistent.

HyukjinKwon · 2016-07-28T01:03:54Z

@cloud-fan If nullability should be not ignored, then I can fix this PR to make them consistent to not ignoring it (and of course I will try to identify the related problems). In this case, I will work on what @gatorsmile pointed out in #14124 (comment) about JSON (and will check the other data sources as well).

I will follow your decision.

To cut all the comments above short, (for other reviewers),

The purpose of this PR is whether it should force all schema to nullable schema or not.
Forcing this as nullable one is already happening with normal reading and writing for data sources based on FileFormat but not for structured streaming and json(rdd: RDD[String]) API.
This is for both inferred/read schema and user-given schema.

SparkQA · 2016-08-19T15:09:53Z

Test build #64066 has finished for PR 14124 at commit 079aae2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-08-30T07:57:22Z

Test build #64631 has finished for PR 14124 at commit ffacb55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T09:49:37Z

Test build #65360 has finished for PR 14124 at commit f6be52b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-22T02:40:21Z

Test build #65747 has finished for PR 14124 at commit 0bc06c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-16T07:47:37Z

Test build #67029 has finished for PR 14124 at commit 3f153a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-11-10T06:55:07Z

Sorry for the delay. After thinking it again, I think it doesn't make sense to allow users to specify the nullability when reading a data source, as Spark SQL can't validate it. How about we turn schema to nullable in DataFrameReader.schema?

HyukjinKwon · 2016-11-10T15:17:57Z

Thanks @cloud-fan, sure, that sounds great.

HyukjinKwon · 2016-11-10T15:30:38Z

Oh wait, @cloud-fan, it seems, at least, Parquet files could possibly be written with not nullable fields. So, reading it without user-specified schema might also cause the inconsistency between the schema read from structured streaming and the one read from filed sources.

If you are not sure of this, I am fine with turning the schema into nullable in DataFrameReader.schema for now. Let me rebase this one first.

HyukjinKwon · 2016-11-10T15:37:43Z

Actually, nvm. I think handling this in DataFrameReader.schema will deal with most of general cases.

SparkQA · 2016-11-10T17:49:57Z

Test build #68474 has finished for PR 14124 at commit 7306937.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-10T18:00:01Z

Test build #68477 has finished for PR 14124 at commit d240c0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-05T14:27:33Z

Test build #70922 has finished for PR 14124 at commit 1abaf1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-05T14:40:21Z

cc @liancheng , what do you think about the nullability change?

kiszk · 2017-03-20T18:03:36Z

#17293 added data validation using schema information for Parquet Reader, as @gatorsmile suggested in https://www.mail-archive.com/[email protected]/msg39233.html.

HyukjinKwon · 2017-03-22T08:55:02Z

Let me rather close this for a while. I will reopen if it looks worth. I think the same subject could be discussed in @kiszk 's PR.

HyukjinKwon reviewed Jul 10, 2016
View reviewed changes

HyukjinKwon changed the title ~~[SPARK-16472][SQL] Inconsistent nullability in schema after being read in SQL API~~ [SPARK-16472][SQL] Inconsistent nullability in schema after being read by data sources implementing FileFormat Jul 12, 2016

HyukjinKwon force-pushed the SPARK-16472 branch from ffacb55 to bf56186 Compare September 14, 2016 07:42

HyukjinKwon force-pushed the SPARK-16472 branch from f6be52b to 0bc06c6 Compare September 22, 2016 00:33

HyukjinKwon mentioned this pull request Oct 8, 2016

[SPARK-17763][SQL] JacksonParser silently parses null as 0 when the field is not nullable #15329

Closed

HyukjinKwon force-pushed the SPARK-16472 branch from 0bc06c6 to 3f153a3 Compare October 16, 2016 05:32

HyukjinKwon mentioned this pull request Nov 6, 2016

[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

Closed

HyukjinKwon force-pushed the SPARK-16472 branch from 3f153a3 to 7306937 Compare November 10, 2016 15:31

HyukjinKwon force-pushed the SPARK-16472 branch from 7306937 to d240c0d Compare November 10, 2016 15:56

HyukjinKwon changed the title ~~[SPARK-16472][SQL] Inconsistent nullability in schema after being read by data sources implementing FileFormat~~ [SPARK-16472][SQL] Force user specified schema to the nullable one Nov 10, 2016

Turning to nullable schema when it is set in DataFrameReader

1abaf1b

HyukjinKwon force-pushed the SPARK-16472 branch from d240c0d to 1abaf1b Compare January 5, 2017 12:16

HyukjinKwon mentioned this pull request Mar 14, 2017

[SPARK-19950][SQL] Fix to ignore nullable when df.load() is executed for file-based data source #17293

Closed

HyukjinKwon closed this Mar 22, 2017

HyukjinKwon deleted the SPARK-16472 branch January 2, 2018 03:43

andrei-ionescu mentioned this pull request Oct 3, 2019

Cannot update an Iceberg dataset from a Parquet file due to "field should be required, but is optional" apache/iceberg#510

Closed

HyukjinKwon mentioned this pull request Oct 3, 2024

[SPARK-49893] Respect user schema nullability for file data sources when DSV2 Table is used. #48321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

HyukjinKwon commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 10, 2016

HyukjinKwon Jul 10, 2016 •

edited

Loading

SparkQA commented Jul 10, 2016

SparkQA commented Jul 10, 2016

SparkQA commented Jul 10, 2016

gatorsmile commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

gatorsmile commented Jul 11, 2016

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

viirya commented Jul 11, 2016 •

edited

Loading

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

HyukjinKwon commented Jul 16, 2016

HyukjinKwon commented Jul 27, 2016

marmbrus commented Jul 27, 2016

cloud-fan commented Jul 27, 2016

HyukjinKwon commented Jul 27, 2016 •

edited

Loading

HyukjinKwon commented Jul 27, 2016 •

edited

Loading

HyukjinKwon commented Jul 28, 2016 •

edited

Loading

SparkQA commented Aug 19, 2016

SparkQA commented Aug 30, 2016

SparkQA commented Sep 14, 2016

SparkQA commented Sep 22, 2016

SparkQA commented Oct 16, 2016

cloud-fan commented Nov 10, 2016 •

edited

Loading

HyukjinKwon commented Nov 10, 2016

HyukjinKwon commented Nov 10, 2016

HyukjinKwon commented Nov 10, 2016

SparkQA commented Nov 10, 2016

SparkQA commented Nov 10, 2016

SparkQA commented Jan 5, 2017

cloud-fan commented Jan 5, 2017

kiszk commented Mar 20, 2017 •

edited

Loading

HyukjinKwon commented Mar 22, 2017

[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

[SPARK-16472][SQL] Force user specified schema to the nullable one #14124

Conversation

HyukjinKwon commented Jul 10, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Jul 10, 2016

HyukjinKwon Jul 10, 2016 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Jul 10, 2016

SparkQA commented Jul 10, 2016

SparkQA commented Jul 10, 2016

gatorsmile commented Jul 10, 2016 • edited Loading

HyukjinKwon commented Jul 10, 2016 • edited Loading

HyukjinKwon commented Jul 11, 2016 • edited Loading

gatorsmile commented Jul 11, 2016

HyukjinKwon commented Jul 11, 2016 • edited Loading

viirya commented Jul 11, 2016 • edited Loading

HyukjinKwon commented Jul 11, 2016 • edited Loading

HyukjinKwon commented Jul 16, 2016

HyukjinKwon commented Jul 27, 2016

marmbrus commented Jul 27, 2016

cloud-fan commented Jul 27, 2016

HyukjinKwon commented Jul 27, 2016 • edited Loading

HyukjinKwon commented Jul 27, 2016 • edited Loading

HyukjinKwon commented Jul 28, 2016 • edited Loading

SparkQA commented Aug 19, 2016

SparkQA commented Aug 30, 2016

SparkQA commented Sep 14, 2016

SparkQA commented Sep 22, 2016

SparkQA commented Oct 16, 2016

cloud-fan commented Nov 10, 2016 • edited Loading

HyukjinKwon commented Nov 10, 2016

HyukjinKwon commented Nov 10, 2016

HyukjinKwon commented Nov 10, 2016

SparkQA commented Nov 10, 2016

SparkQA commented Nov 10, 2016

SparkQA commented Jan 5, 2017

cloud-fan commented Jan 5, 2017

kiszk commented Mar 20, 2017 • edited Loading

HyukjinKwon commented Mar 22, 2017

HyukjinKwon commented Jul 10, 2016 •

edited

Loading

HyukjinKwon Jul 10, 2016 •

edited

Loading

gatorsmile commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 10, 2016 •

edited

Loading

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

viirya commented Jul 11, 2016 •

edited

Loading

HyukjinKwon commented Jul 11, 2016 •

edited

Loading

HyukjinKwon commented Jul 27, 2016 •

edited

Loading

HyukjinKwon commented Jul 27, 2016 •

edited

Loading

HyukjinKwon commented Jul 28, 2016 •

edited

Loading

cloud-fan commented Nov 10, 2016 •

edited

Loading

kiszk commented Mar 20, 2017 •

edited

Loading