[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

HyukjinKwon · 2016-11-04T13:58:03Z

What changes were proposed in this pull request?

Currently, there are the three cases when reading CSV by datasource when it is PERMISSIVE parse mode.

schema == parsed tokens (from each line)
No problem to cast the value in the tokens to the field in the schema as they are equal.
schema < parsed tokens (from each line)
It slices the tokens into the number of fields in schema.
schema > parsed tokens (from each line)
It appends null into parsed tokens so that safely values can be casted with the schema.

However, when null is appended in the third case, we should take null into account when casting the values.

In case of StringType, it is fine as UTF8String.fromString(datum) produces null when the input is null. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not StringType.

The codes below:

val path = "/tmp/a"
Seq("1").toDF().write.text(path.getAbsolutePath)
val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", IntegerType, true) :: Nil)
spark.read.schema(schema).option("header", "false").csv(path).show()

prints

Before

java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:542)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24)

After

+---+----+
|  a|   b|
+---+----+
|  1|null|
+---+----+

How was this patch tested?

Unit test in CSVSuite.scala and CSVTypeCastSuite.scala

SparkQA · 2016-11-04T16:00:48Z

Test build #68131 has finished for PR 15767 at commit 3692099.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T16:13:08Z

Test build #68132 has finished for PR 15767 at commit e1c58c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-04T16:25:51Z

Test build #68133 has finished for PR 15767 at commit 4132075.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-04T16:35:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

@@ -232,7 +232,7 @@ private[csv] object CSVTypeCast {
      nullable: Boolean = true,
      options: CSVOptions = CSVOptions()): Any = {

-    if (nullable && datum == options.nullValue) {
+    if (datum == null || nullable && datum == options.nullValue) {


This might be clearer with parentheses, though I think it's correct. It is this right?
datum == null || (nullable && datum == options.nullValue)

Sure, makes sense.

SparkQA · 2016-11-05T06:04:46Z

Test build #68182 has finished for PR 15767 at commit c0667d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-05T07:14:53Z

Can you get rid of those links from the pr description? They become stale immediately after merging this.

rxin · 2016-11-05T07:17:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

@@ -232,7 +232,8 @@ private[csv] object CSVTypeCast {
      nullable: Boolean = true,
      options: CSVOptions = CSVOptions()): Any = {

-    if (nullable && datum == options.nullValue) {
+    val isNull = datum == options.nullValue || datum == null
+    if (nullable && isNull) {


isn't this more clear if you do

// datum can be null if the number of fields found is less than the length of the schema if (datum == options.nullValue || datum == null) { if (!nullable) { throw some exception saying field is not null but null value found } null } else { .. }

I'd also add a null validation test.

Sure, I will.

SparkQA · 2016-11-05T07:33:08Z

Test build #68184 has finished for PR 15767 at commit b913eac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-11-05T09:41:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

+    // datum can be null if the number of fields found is less than the length of the schema
+    if (datum == options.nullValue || datum == null) {
+      if (!nullable) {
+        throw new RuntimeException("null value found but the field is not nullable.")


Nit: This could be require(nullable, ...) which would throw a better exception, IllegalArgumentException, too. (Even NPE would be reasonable.) But I don't feel strongly about it.

Sure, I think this sounds good.

Sorry I thought more about this - the current error message doesn't give the user a way to know which field is causing the problem. Can you add at least the field name to the error msg?

Sure, I can. Will try to make this neat up.

SparkQA · 2016-11-05T10:49:07Z

Test build #68198 has finished for PR 15767 at commit e5146e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-05T12:32:40Z

retest this please

SparkQA · 2016-11-05T15:08:10Z

Test build #68205 has finished for PR 15767 at commit e5146e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-05T15:15:38Z

Test build #68206 has finished for PR 15767 at commit 0b15484.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-05T20:01:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

-    if (nullable && datum == options.nullValue) {
+    // datum can be null if the number of fields found is less than the length of the schema
+    if (datum == options.nullValue || datum == null) {
+      require(nullable, "null value found but the field is not nullable.")


sorry it doesn't make sense to throw illegalargumentexception at runtime here. What's the illegal argument? It is usually used when some parameters are out of bound or invalid. What's happening here is that we found exceptions during runtime that invalidates the assumption the user made (the field is nullable).

cc @marmbrus

related to some other topic we talked about - for csv we do throw errors if a field is defined non-nullable by the user, but the data ended up containing nulls.

Oh, I thought users set a non-nullable field in the schema inappropriate for data having null values and that is an illigal argument.. FWIW, IllegalArgumentException seems extending RuntimeException too..
Let me revert it back. I don't feel strongly about this.

(Other related PRs with the comment, #15767 (comment), are #15329 and #14124)

@rxin well, the user input is the illegal argument I guess, but I don't feel strongly about it. I was really arguing against RuntimeException which is never really the right thing to throw. NullPointerException? IllegalStateException? take your pick of anything else more specific.

SparkQA · 2016-11-06T04:41:59Z

Test build #68222 has finished for PR 15767 at commit e5146e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-11-06T04:46:43Z

retest this please

SparkQA · 2016-11-06T06:17:44Z

Test build #68225 has finished for PR 15767 at commit e5146e3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-06T11:45:40Z

Test build #68236 has finished for PR 15767 at commit aa89171.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-07T02:51:07Z

Merging in master/branch-2.1. Thanks.

…ma is lager than parsed tokens ## What changes were proposed in this pull request? Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode. - schema == parsed tokens (from each line) No problem to cast the value in the tokens to the field in the schema as they are equal. - schema < parsed tokens (from each line) It slices the tokens into the number of fields in schema. - schema > parsed tokens (from each line) It appends `null` into parsed tokens so that safely values can be casted with the schema. However, when `null` is appended in the third case, we should take `null` into account when casting the values. In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`. The codes below: ```scala val path = "/tmp/a" Seq("1").toDF().write.text(path.getAbsolutePath) val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: Nil) spark.read.schema(schema).option("header", "false").csv(path).show() ``` prints **Before** ``` java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24) ``` **After** ``` +---+----+ | a| b| +---+----+ | 1|null| +---+----+ ``` ## How was this patch tested? Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala` Author: hyukjinkwon <[email protected]> Closes #15767 from HyukjinKwon/SPARK-18269. (cherry picked from commit 556a3b7) Signed-off-by: Reynold Xin <[email protected]>

HyukjinKwon · 2016-11-07T03:00:14Z

Thank you @rxin!

…ma is lager than parsed tokens ## What changes were proposed in this pull request? Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode. - schema == parsed tokens (from each line) No problem to cast the value in the tokens to the field in the schema as they are equal. - schema < parsed tokens (from each line) It slices the tokens into the number of fields in schema. - schema > parsed tokens (from each line) It appends `null` into parsed tokens so that safely values can be casted with the schema. However, when `null` is appended in the third case, we should take `null` into account when casting the values. In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`. The codes below: ```scala val path = "/tmp/a" Seq("1").toDF().write.text(path.getAbsolutePath) val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: Nil) spark.read.schema(schema).option("header", "false").csv(path).show() ``` prints **Before** ``` java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24) ``` **After** ``` +---+----+ | a| b| +---+----+ | 1|null| +---+----+ ``` ## How was this patch tested? Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala` Author: hyukjinkwon <[email protected]> Closes apache#15767 from HyukjinKwon/SPARK-18269.

Take the case of null into account

3692099

HyukjinKwon changed the title ~~[SPARK-18269][SQL] null should be properly read when schema is lager than parsed tokens and types are not string~~ [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens Nov 4, 2016

HyukjinKwon added 2 commits November 4, 2016 23:04

minimise the change

e1c58c1

Indentation

4132075

srowen reviewed Nov 4, 2016

View reviewed changes

HyukjinKwon added 2 commits November 5, 2016 14:10

Make the condition better

c0667d1

No extra change

b913eac

rxin reviewed Nov 5, 2016

View reviewed changes

address the comment for condition and exception

e5146e3

srowen reviewed Nov 5, 2016

View reviewed changes

rxin reviewed Nov 5, 2016

View reviewed changes

HyukjinKwon force-pushed the SPARK-18269 branch from 0b15484 to e5146e3 Compare November 6, 2016 03:20

Print field name in the exception message

aa89171

asfgit closed this in 556a3b7 Nov 7, 2016

HyukjinKwon mentioned this pull request Mar 28, 2017

Should read null properly when schema is lager than parsed tokens in PrunedScan databricks/spark-csv#420

Closed

HyukjinKwon deleted the SPARK-18269 branch January 2, 2018 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

HyukjinKwon commented Nov 4, 2016 •

edited

Loading

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

srowen Nov 4, 2016

HyukjinKwon Nov 5, 2016

SparkQA commented Nov 5, 2016

rxin commented Nov 5, 2016

rxin Nov 5, 2016

rxin Nov 5, 2016

HyukjinKwon Nov 5, 2016

SparkQA commented Nov 5, 2016

srowen Nov 5, 2016

HyukjinKwon Nov 5, 2016

rxin Nov 6, 2016 •

edited

Loading

HyukjinKwon Nov 6, 2016

SparkQA commented Nov 5, 2016

HyukjinKwon commented Nov 5, 2016

SparkQA commented Nov 5, 2016

SparkQA commented Nov 5, 2016

rxin Nov 5, 2016

rxin Nov 5, 2016

HyukjinKwon Nov 6, 2016

HyukjinKwon Nov 6, 2016

srowen Nov 6, 2016

SparkQA commented Nov 6, 2016

HyukjinKwon commented Nov 6, 2016

SparkQA commented Nov 6, 2016

SparkQA commented Nov 6, 2016

rxin commented Nov 7, 2016

HyukjinKwon commented Nov 7, 2016

[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767

Conversation

HyukjinKwon commented Nov 4, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

SparkQA commented Nov 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 5, 2016

rxin commented Nov 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rxin Nov 6, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 5, 2016

HyukjinKwon commented Nov 5, 2016

SparkQA commented Nov 5, 2016

SparkQA commented Nov 5, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 6, 2016

HyukjinKwon commented Nov 6, 2016

SparkQA commented Nov 6, 2016

SparkQA commented Nov 6, 2016

rxin commented Nov 7, 2016

HyukjinKwon commented Nov 7, 2016

HyukjinKwon commented Nov 4, 2016 •

edited

Loading

rxin Nov 6, 2016 •

edited

Loading