Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should read null properly when schema is lager than parsed tokens in PrunedScan #420

Closed

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Mar 28, 2017

Currently running the codes below:

val schema = StructType(
  StructField("bool", BooleanType, true) ::
  StructField("nullcol", IntegerType, true) ::
  StructField("nullcol1", IntegerType, true) :: Nil)

new CsvParser()
  .withSchema(schema)
  .withUseHeader(true)
  .withParserLib(parserLib)
  .withParseMode(ParseModes.PERMISSIVE_MODE)
  .csvFile(sqlContext, boolFile)
  .select("bool", "nullcol")
  .collect()

with the data below:

bool
"True"
"False"

"true"

throws an exception as below:

java.lang.NumberFormatException: null
	at java.lang.Integer.parseInt(Integer.java:542)
	at java.lang.Integer.parseInt(Integer.java:615)
	at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
	at scala.collection.immutable.StringOps.toInt(StringOps.scala:30)
	at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:57)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:196)
	at com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:174)

This is related with apache/spark#15767. Please refer the description there.

@codecov-io
Copy link

codecov-io commented Mar 28, 2017

Codecov Report

Merging #420 into master will increase coverage by 0.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #420      +/-   ##
==========================================
+ Coverage   87.01%   87.12%   +0.11%     
==========================================
  Files          12       12              
  Lines         562      567       +5     
  Branches      136      144       +8     
==========================================
+ Hits          489      494       +5     
  Misses         73       73
Impacted Files Coverage Δ
...scala/com/databricks/spark/csv/util/TypeCast.scala 94.87% <100%> (+0.58%) ⬆️
...n/scala/com/databricks/spark/csv/CsvRelation.scala 89.02% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7997c3c...49adc5e. Read the comment docs.

// If the given column is not nullable, we simply fall back to normal string
// rather than returning null for backwards compatibility. Note that this case is
// different with Spark's internal CSV datasource.
val isNullValueMatched = datum == nullValue && nullable
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case is different with Spark currently but did not change the behaviour here.
In case of Spark, it throws an exception in this case rather than failing back to string as below:

val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", IntegerType, false) :: Nil)

spark.read.schema(schema)
  .option("header", "false")
  .option("nullValue", "2")
  .option("mode", "FAILFAST")
  .csv(Seq("1,2").toDS())
  .show()

prints

java.lang.RuntimeException: null value found but field b is not nullable.
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$nullSafeDatum(UnivocityParser.scala:179)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$3.apply(UnivocityParser.scala:104)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$makeConverter$3.apply(UnivocityParser.scala:103)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:219)
	at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)

In case of this library,

val schema = StructType(
  StructField("a", IntegerType, true) ::
  StructField("b", IntegerType, false) :: Nil)

new CsvParser()
  .withSchema(schema)
  .withUseHeader(false)
  .withParseMode(ParseModes.FAIL_FAST_MODE)
  .withNullValue("2")
  .csvRdd(sqlContext, sqlContext.sparkContext.parallelize(Seq("1,2")))
  .show()

prints below:

+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to not break this behaviour in this library but I am willing to match this to Spark's internal CSV one if you think so.

val exception3 = intercept[RuntimeException]{
TypeCast.castTo(null, "_c", StringType, nullable = false)
}
assert(exception3.getMessage.contains("null value found but field _c is not nullable"))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the actual test cases added.


// `datum` can be null when some tokens were inserted when permissive modes via `PrunedScan`.
// In this case, they are treated as nulls.
val isNullDatum = datum == null
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the actual cases added are,

  • datum == null && !nullable : returns null. (it was NEP or java.lang.NumberFormatException: null before).
  • datum == null && !nullable : throws an exception. (it was NEP or java.lang.NumberFormatException: null before).
  • treatEmptyValuesAsNulls && datum == && !nullable : throws an exception. (it was trying to set null to non-nullable field.

These should be non-behaviour changes.

@HyukjinKwon
Copy link
Member Author

(Two comments were folded by the last commit but still valid to explain the changes here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants