-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens #15767
Conversation
Test build #68131 has finished for PR 15767 at commit
|
Test build #68132 has finished for PR 15767 at commit
|
Test build #68133 has finished for PR 15767 at commit
|
@@ -232,7 +232,7 @@ private[csv] object CSVTypeCast { | |||
nullable: Boolean = true, | |||
options: CSVOptions = CSVOptions()): Any = { | |||
|
|||
if (nullable && datum == options.nullValue) { | |||
if (datum == null || nullable && datum == options.nullValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be clearer with parentheses, though I think it's correct. It is this right?
datum == null || (nullable && datum == options.nullValue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense.
Test build #68182 has finished for PR 15767 at commit
|
Can you get rid of those links from the pr description? They become stale immediately after merging this. |
@@ -232,7 +232,8 @@ private[csv] object CSVTypeCast { | |||
nullable: Boolean = true, | |||
options: CSVOptions = CSVOptions()): Any = { | |||
|
|||
if (nullable && datum == options.nullValue) { | |||
val isNull = datum == options.nullValue || datum == null | |||
if (nullable && isNull) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this more clear if you do
// datum can be null if the number of fields found is less than the length of the schema
if (datum == options.nullValue || datum == null) {
if (!nullable) {
throw some exception saying field is not null but null value found
}
null
} else {
..
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also add a null validation test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will.
Test build #68184 has finished for PR 15767 at commit
|
// datum can be null if the number of fields found is less than the length of the schema | ||
if (datum == options.nullValue || datum == null) { | ||
if (!nullable) { | ||
throw new RuntimeException("null value found but the field is not nullable.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This could be require(nullable, ...)
which would throw a better exception, IllegalArgumentException
, too. (Even NPE would be reasonable.) But I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think this sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought more about this - the current error message doesn't give the user a way to know which field is causing the problem. Can you add at least the field name to the error msg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can. Will try to make this neat up.
Test build #68198 has finished for PR 15767 at commit
|
retest this please |
Test build #68205 has finished for PR 15767 at commit
|
Test build #68206 has finished for PR 15767 at commit
|
if (nullable && datum == options.nullValue) { | ||
// datum can be null if the number of fields found is less than the length of the schema | ||
if (datum == options.nullValue || datum == null) { | ||
require(nullable, "null value found but the field is not nullable.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry it doesn't make sense to throw illegalargumentexception at runtime here. What's the illegal argument? It is usually used when some parameters are out of bound or invalid. What's happening here is that we found exceptions during runtime that invalidates the assumption the user made (the field is nullable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @marmbrus
related to some other topic we talked about - for csv we do throw errors if a field is defined non-nullable by the user, but the data ended up containing nulls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought users set a non-nullable field in the schema inappropriate for data having null values and that is an illigal argument.. FWIW, IllegalArgumentException
seems extending RuntimeException
too..
Let me revert it back. I don't feel strongly about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Other related PRs with the comment, #15767 (comment), are #15329 and #14124)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin well, the user input is the illegal argument I guess, but I don't feel strongly about it. I was really arguing against RuntimeException
which is never really the right thing to throw. NullPointerException
? IllegalStateException
? take your pick of anything else more specific.
0b15484
to
e5146e3
Compare
Test build #68222 has finished for PR 15767 at commit
|
retest this please |
Test build #68225 has finished for PR 15767 at commit
|
Test build #68236 has finished for PR 15767 at commit
|
Merging in master/branch-2.1. Thanks. |
…ma is lager than parsed tokens ## What changes were proposed in this pull request? Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode. - schema == parsed tokens (from each line) No problem to cast the value in the tokens to the field in the schema as they are equal. - schema < parsed tokens (from each line) It slices the tokens into the number of fields in schema. - schema > parsed tokens (from each line) It appends `null` into parsed tokens so that safely values can be casted with the schema. However, when `null` is appended in the third case, we should take `null` into account when casting the values. In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`. The codes below: ```scala val path = "/tmp/a" Seq("1").toDF().write.text(path.getAbsolutePath) val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: Nil) spark.read.schema(schema).option("header", "false").csv(path).show() ``` prints **Before** ``` java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24) ``` **After** ``` +---+----+ | a| b| +---+----+ | 1|null| +---+----+ ``` ## How was this patch tested? Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala` Author: hyukjinkwon <[email protected]> Closes #15767 from HyukjinKwon/SPARK-18269. (cherry picked from commit 556a3b7) Signed-off-by: Reynold Xin <[email protected]>
Thank you @rxin! |
…ma is lager than parsed tokens ## What changes were proposed in this pull request? Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode. - schema == parsed tokens (from each line) No problem to cast the value in the tokens to the field in the schema as they are equal. - schema < parsed tokens (from each line) It slices the tokens into the number of fields in schema. - schema > parsed tokens (from each line) It appends `null` into parsed tokens so that safely values can be casted with the schema. However, when `null` is appended in the third case, we should take `null` into account when casting the values. In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`. The codes below: ```scala val path = "/tmp/a" Seq("1").toDF().write.text(path.getAbsolutePath) val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, true) :: Nil) spark.read.schema(schema).option("header", "false").csv(path).show() ``` prints **Before** ``` java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:542) at java.lang.Integer.parseInt(Integer.java:615) at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24) ``` **After** ``` +---+----+ | a| b| +---+----+ | 1|null| +---+----+ ``` ## How was this patch tested? Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala` Author: hyukjinkwon <[email protected]> Closes apache#15767 from HyukjinKwon/SPARK-18269.
What changes were proposed in this pull request?
Currently, there are the three cases when reading CSV by datasource when it is
PERMISSIVE
parse mode.schema == parsed tokens (from each line)
No problem to cast the value in the tokens to the field in the schema as they are equal.
schema < parsed tokens (from each line)
It slices the tokens into the number of fields in schema.
schema > parsed tokens (from each line)
It appends
null
into parsed tokens so that safely values can be casted with the schema.However, when
null
is appended in the third case, we should takenull
into account when casting the values.In case of
StringType
, it is fine asUTF8String.fromString(datum)
producesnull
when the input isnull
. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are notStringType
.The codes below:
prints
Before
After
How was this patch tested?
Unit test in
CSVSuite.scala
andCSVTypeCastSuite.scala