Should read null properly when schema is lager than parsed tokens in PrunedScan #420

HyukjinKwon · 2017-03-28T03:38:14Z

Currently running the codes below:

val schema = StructType(
  StructField("bool", BooleanType, true) ::
  StructField("nullcol", IntegerType, true) ::
  StructField("nullcol1", IntegerType, true) :: Nil)

new CsvParser()
  .withSchema(schema)
  .withUseHeader(true)
  .withParserLib(parserLib)
  .withParseMode(ParseModes.PERMISSIVE_MODE)
  .csvFile(sqlContext, boolFile)
  .select("bool", "nullcol")
  .collect()

with the data below:

bool
"True"
"False"

"true"

throws an exception as below:

java.lang.NumberFormatException: null
	at java.lang.Integer.parseInt(Integer.java:542)
	at java.lang.Integer.parseInt(Integer.java:615)
	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
	at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
	at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:57)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:196)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:174)

This is related with apache/spark#15767. Please refer the description there.

codecov-io · 2017-03-28T03:57:54Z

Codecov Report

Merging #420 into master will increase coverage by 0.11%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #420      +/-   ##
==========================================
+ Coverage   87.01%   87.12%   +0.11%     
==========================================
  Files          12       12              
  Lines         562      567       +5     
  Branches      136      144       +8     
==========================================
+ Hits          489      494       +5     
  Misses         73       73

Impacted Files	Coverage Δ
...scala/com/databricks/spark/csv/util/TypeCast.scala	`94.87% <100%> (+0.58%)`	⬆️
...n/scala/com/databricks/spark/csv/CsvRelation.scala	`89.02% <100%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7997c3c...49adc5e. Read the comment docs.

HyukjinKwon · 2017-03-28T04:02:45Z

src/main/scala/com/databricks/spark/csv/util/TypeCast.scala

+    // If the given column is not nullable, we simply fall back to normal string
+    // rather than returning null for backwards compatibility. Note that this case is
+    // different with Spark's internal CSV datasource.
+    val isNullValueMatched = datum == nullValue && nullable


I think this case is different with Spark currently but did not change the behaviour here.
In case of Spark, it throws an exception in this case rather than failing back to string as below:

val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, false) :: Nil) spark.read.schema(schema) .option("header", "false") .option("nullValue", "2") .option("mode", "FAILFAST") .csv(Seq("1,2").toDS()) .show()

prints

java.lang.RuntimeException: null value found but field b is not nullable. at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$3.apply(UnivocityParser.scala:104) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$3.apply(UnivocityParser.scala:103) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)

In case of this library,

val schema = StructType( StructField("a", IntegerType, true) :: StructField("b", IntegerType, false) :: Nil) new CsvParser() .withSchema(schema) .withUseHeader(false) .withParseMode(ParseModes.FAIL_FAST_MODE) .withNullValue("2") .csvRdd(sqlContext, sqlContext.sparkContext.parallelize(Seq("1,2"))) .show()

prints below:

+---+---+ | a| b| +---+---+ | 1| 2| +---+---+

I intended to not break this behaviour in this library but I am willing to match this to Spark's internal CSV one if you think so.

HyukjinKwon · 2017-03-28T04:09:20Z

src/test/scala/com/databricks/spark/csv/util/TypeCastSuite.scala

+    val exception3 = intercept[RuntimeException]{
+      TypeCast.castTo(null, "_c", StringType, nullable = false)
+    }
+    assert(exception3.getMessage.contains("null value found but field _c is not nullable"))


Here are the actual test cases added.

HyukjinKwon · 2017-03-28T04:09:31Z

src/main/scala/com/databricks/spark/csv/util/TypeCast.scala

+
+    // `datum` can be null when some tokens were inserted when permissive modes via `PrunedScan`.
+    // In this case, they are treated as nulls.
+    val isNullDatum = datum == null


Here, the actual cases added are,

datum == null && !nullable : returns null. (it was NEP or java.lang.NumberFormatException: null before).

datum == null && !nullable : throws an exception. (it was NEP or java.lang.NumberFormatException: null before).

treatEmptyValuesAsNulls && datum == && !nullable : throws an exception. (it was trying to set null to non-nullable field.

These should be non-behaviour changes.

HyukjinKwon · 2017-03-28T14:54:00Z

(Two comments were folded by the last commit but still valid to explain the changes here)

HyukjinKwon added 2 commits March 28, 2017 12:30

Should read null properly when schema is lager than parsed tokens

ce51335

Add negative test case

cdf0790

HyukjinKwon commented Mar 28, 2017

View reviewed changes

Few more comments and styles

49adc5e

HyukjinKwon closed this Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should read null properly when schema is lager than parsed tokens in PrunedScan #420

Should read null properly when schema is lager than parsed tokens in PrunedScan #420

HyukjinKwon commented Mar 28, 2017 •

edited

Loading

codecov-io commented Mar 28, 2017 •

edited

Loading

HyukjinKwon Mar 28, 2017

HyukjinKwon Mar 28, 2017

HyukjinKwon Mar 28, 2017

HyukjinKwon Mar 28, 2017

HyukjinKwon commented Mar 28, 2017

Should read null properly when schema is lager than parsed tokens in PrunedScan #420

Should read null properly when schema is lager than parsed tokens in PrunedScan #420

Conversation

HyukjinKwon commented Mar 28, 2017 • edited Loading

codecov-io commented Mar 28, 2017 • edited Loading

Codecov Report

HyukjinKwon Mar 28, 2017

Choose a reason for hiding this comment

HyukjinKwon Mar 28, 2017

Choose a reason for hiding this comment

HyukjinKwon Mar 28, 2017

Choose a reason for hiding this comment

HyukjinKwon Mar 28, 2017

Choose a reason for hiding this comment

HyukjinKwon commented Mar 28, 2017

HyukjinKwon commented Mar 28, 2017 •

edited

Loading

codecov-io commented Mar 28, 2017 •

edited

Loading