-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should read null properly when schema is lager than parsed tokens in PrunedScan #420
Should read null properly when schema is lager than parsed tokens in PrunedScan #420
Conversation
Codecov Report
@@ Coverage Diff @@
## master #420 +/- ##
==========================================
+ Coverage 87.01% 87.12% +0.11%
==========================================
Files 12 12
Lines 562 567 +5
Branches 136 144 +8
==========================================
+ Hits 489 494 +5
Misses 73 73
Continue to review full report at Codecov.
|
// If the given column is not nullable, we simply fall back to normal string | ||
// rather than returning null for backwards compatibility. Note that this case is | ||
// different with Spark's internal CSV datasource. | ||
val isNullValueMatched = datum == nullValue && nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this case is different with Spark currently but did not change the behaviour here.
In case of Spark, it throws an exception in this case rather than failing back to string as below:
val schema = StructType(
StructField("a", IntegerType, true) ::
StructField("b", IntegerType, false) :: Nil)
spark.read.schema(schema)
.option("header", "false")
.option("nullValue", "2")
.option("mode", "FAILFAST")
.csv(Seq("1,2").toDS())
.show()
prints
java.lang.RuntimeException: null value found but field b is not nullable.
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$3.apply(UnivocityParser.scala:104)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$3.apply(UnivocityParser.scala:103)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
In case of this library,
val schema = StructType(
StructField("a", IntegerType, true) ::
StructField("b", IntegerType, false) :: Nil)
new CsvParser()
.withSchema(schema)
.withUseHeader(false)
.withParseMode(ParseModes.FAIL_FAST_MODE)
.withNullValue("2")
.csvRdd(sqlContext, sqlContext.sparkContext.parallelize(Seq("1,2")))
.show()
prints below:
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended to not break this behaviour in this library but I am willing to match this to Spark's internal CSV one if you think so.
val exception3 = intercept[RuntimeException]{ | ||
TypeCast.castTo(null, "_c", StringType, nullable = false) | ||
} | ||
assert(exception3.getMessage.contains("null value found but field _c is not nullable")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the actual test cases added.
|
||
// `datum` can be null when some tokens were inserted when permissive modes via `PrunedScan`. | ||
// In this case, they are treated as nulls. | ||
val isNullDatum = datum == null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the actual cases added are,
datum == null && !nullable
: returnsnull
. (it was NEP orjava.lang.NumberFormatException: null
before).datum == null && !nullable
: throws an exception. (it was NEP orjava.lang.NumberFormatException: null
before).treatEmptyValuesAsNulls && datum == && !nullable
: throws an exception. (it was trying to setnull
to non-nullable field.
These should be non-behaviour changes.
(Two comments were folded by the last commit but still valid to explain the changes here) |
Currently running the codes below:
with the data below:
throws an exception as below:
This is related with apache/spark#15767. Please refer the description there.