Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4856] [SQL] NullType instead of StringType when sampling against empty string or nul... #3708

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,8 @@ private[sql] object JsonRDD extends Logging {
val elementType = typeOfArray(array)
buildKeyPathForInnerStructs(array, elementType) :+ (key, elementType)
}
// we couldn't tell what the type is if the value is null or empty string
case (key: String, value) if value == "" || value == null => (key, NullType) :: Nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The null case makes sense to me, but why "" as well? That seems to be unequivocally a String

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the unit test, which probably make more sense.

In some cases (as shown in the unit test), "" is equivalent to null for struct type, so we'd better not to say "it's MUST be StringType if we meet an empty string".

In the meantime, the NullType is the minimum data type, and it can be promoted to any other data type in JsonRDD (e.g. promote to StructType), however, it's impossible to promote a StringType to StructType.

It's safe to make it as NullType here, as we can promote it as StringType in the last promote rules, see https://github.com/chenghao-intel/spark/blob/json/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L231

case (key: String, value) => (key, typeOfPrimitiveValue(value)) :: Nil
}
}
Expand Down Expand Up @@ -400,13 +402,13 @@ private[sql] object JsonRDD extends Logging {
} else {
desiredType match {
case StringType => toString(value)
case _ if value == null || value == "" => null // guard the non string type
case IntegerType => value.asInstanceOf[IntegerType.JvmType]
case LongType => toLong(value)
case DoubleType => toDouble(value)
case DecimalType() => toDecimal(value)
case BooleanType => value.asInstanceOf[BooleanType.JvmType]
case NullType => null

case ArrayType(elementType, _) =>
value.asInstanceOf[Seq[Any]].map(enforceCorrectType(_, elementType))
case struct: StructType => asRow(value.asInstanceOf[Map[String, Any]], struct)
Expand Down
19 changes: 19 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,25 @@ class JsonSuite extends QueryTest {
StringType)
}

test("Complex field and type inferring with null in sampling") {
val jsonSchemaRDD = jsonRDD(jsonNullStruct)
val expectedSchema = StructType(
StructField("headers", StructType(
StructField("Charset", StringType, true) ::
StructField("Host", StringType, true) :: Nil)
, true) ::
StructField("ip", StringType, true) ::
StructField("nullstr", StringType, true):: Nil)

assert(expectedSchema === jsonSchemaRDD.schema)
jsonSchemaRDD.registerTempTable("jsonTable")

checkAnswer(
sql("select nullstr, headers.Host from jsonTable"),
Seq(Row("", "1.abc.com"), Row("", null), Row("", null), Row(null, null))
)
}

test("Primitive field and type inferring") {
val jsonSchemaRDD = jsonRDD(primitiveFieldAndType)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,13 @@ object TestJsonData {
"""{"num_num_1":21474836570, "num_num_2":1.1, "num_num_3": 21474836470,
"num_bool":null, "num_str":92233720368547758070, "str_bool":null}""" :: Nil)

val jsonNullStruct =
TestSQLContext.sparkContext.parallelize(
"""{"nullstr":"","ip":"27.31.100.29","headers":{"Host":"1.abc.com","Charset":"UTF-8"}}""" ::
"""{"nullstr":"","ip":"27.31.100.29","headers":{}}""" ::
"""{"nullstr":"","ip":"27.31.100.29","headers":""}""" ::
"""{"nullstr":null,"ip":"27.31.100.29","headers":null}""" :: Nil)

val complexFieldValueTypeConflict =
TestSQLContext.sparkContext.parallelize(
"""{"num_struct":11, "str_array":[1, 2, 3],
Expand Down