[SPARK-39061][SQL] Set nullable correctly for `Inline` output attributes #36883

bersprockets · 2022-06-15T21:47:41Z

What changes were proposed in this pull request?

Change Inline#elementSchema to make each struct field nullable when the containing array has a null element.

Why are the changes needed?

This query returns incorrect results (the last row should be NULL NULL):

spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null));
1	2
-1	-1
Time taken: 4.053 seconds, Fetched 2 row(s)
spark-sql>

And this query gets a NullPointerException:

spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null));
22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.NullPointerException: null
	at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) ~[?:?]
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?]
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere

When an array of structs is created by CreateArray, and no struct field contains a literal null value, the schema for the struct will have non-nullable fields, even if the array itself has a null entry (as in the example above). As a result, the output attributes for the generator will be non-nullable.

When the output attributes for Inline are non-nullable, GenerateUnsafeProjection#writeExpressionsToBuffer generates incorrect code for null structs.

In more detail, the issue is this: GenerateExec#codeGenCollection generates code that will check if the struct instance (i.e., array element) is null and, if so, set a boolean for each struct field to indicate that the field contains a null. However, unless the generator's output attributes are nullable, GenerateUnsafeProjection#writeExpressionsToBuffer will not generate any code to check those booleans. Instead it will generate code to write out whatever is in the variables that normally hold the struct values (which will be garbage if the array element is null).

Arrays of structs from file sources do not have this issue. In that case, each StructField will have nullable=true due to this.

(Note: the eval path for Inline has a different bug with null array elements that occurs even when nullable is set correctly in the schema, but I will address that in a separate PR).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala

HyukjinKwon · 2022-06-16T00:38:40Z

Merged to master, branch-3.3 and branch-3.2.

### What changes were proposed in this pull request? Change `Inline#elementSchema` to make each struct field nullable when the containing array has a null element. ### Why are the changes needed? This query returns incorrect results (the last row should be `NULL NULL`): ``` spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null)); 1 2 -1 -1 Time taken: 4.053 seconds, Fetched 2 row(s) spark-sql> ``` And this query gets a NullPointerException: ``` spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null)); 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere ``` When an array of structs is created by `CreateArray`, and no struct field contains a literal null value, the schema for the struct will have non-nullable fields, even if the array itself has a null entry (as in the example above). As a result, the output attributes for the generator will be non-nullable. When the output attributes for `Inline` are non-nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` generates incorrect code for null structs. In more detail, the issue is this: `GenerateExec#codeGenCollection` generates code that will check if the struct instance (i.e., array element) is null and, if so, set a boolean for each struct field to indicate that the field contains a null. However, unless the generator's output attributes are nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` will not generate any code to check those booleans. Instead it will generate code to write out whatever is in the variables that normally hold the struct values (which will be garbage if the array element is null). Arrays of structs from file sources do not have this issue. In that case, each `StructField` will have nullable=true due to [this](https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L417). (Note: the eval path for `Inline` has a different bug with null array elements that occurs even when `nullable` is set correctly in the schema, but I will address that in a separate PR). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes #36883 from bersprockets/inline_struct_nullability_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit fc385da) Signed-off-by: Hyukjin Kwon <[email protected]>

Change `Inline#elementSchema` to make each struct field nullable when the containing array has a null element. This query returns incorrect results (the last row should be `NULL NULL`): ``` spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null)); 1 2 -1 -1 Time taken: 4.053 seconds, Fetched 2 row(s) spark-sql> ``` And this query gets a NullPointerException: ``` spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null)); 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere ``` When an array of structs is created by `CreateArray`, and no struct field contains a literal null value, the schema for the struct will have non-nullable fields, even if the array itself has a null entry (as in the example above). As a result, the output attributes for the generator will be non-nullable. When the output attributes for `Inline` are non-nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` generates incorrect code for null structs. In more detail, the issue is this: `GenerateExec#codeGenCollection` generates code that will check if the struct instance (i.e., array element) is null and, if so, set a boolean for each struct field to indicate that the field contains a null. However, unless the generator's output attributes are nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` will not generate any code to check those booleans. Instead it will generate code to write out whatever is in the variables that normally hold the struct values (which will be garbage if the array element is null). Arrays of structs from file sources do not have this issue. In that case, each `StructField` will have nullable=true due to [this](https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L417). (Note: the eval path for `Inline` has a different bug with null array elements that occurs even when `nullable` is set correctly in the schema, but I will address that in a separate PR). No. New unit test. Closes #36883 from bersprockets/inline_struct_nullability_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit fc385da) Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon · 2022-06-20T01:41:15Z

sql/core/src/test/scala/org/apache/spark/sql/GeneratorFunctionSuite.scala

@@ -389,6 +389,31 @@ class GeneratorFunctionSuite extends QueryTest with SharedSparkSession {
        Row(0, 1) :: Row(3, 4) :: Row(6, 7) :: Row(null, null) :: Row(null, null) :: Nil)
    }
  }
+
+  test("SPARK-39061: inline should handle null struct") {


Seems like this fails in Scala 2.13 for some reasons. I filed a JIRA at https://issues.apache.org/jira/browse/SPARK-39524

@HyukjinKwon I will take a look.

I resolved that Jira (and a few others) as a dup of SPARK-39520 ("ExpressionSetSuite test failure with Scala 2.13") since those failures are fixed by SPARK-39520's proposed fix.

Change `Inline#elementSchema` to make each struct field nullable when the containing array has a null element. This query returns incorrect results (the last row should be `NULL NULL`): ``` spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null)); 1 2 -1 -1 Time taken: 4.053 seconds, Fetched 2 row(s) spark-sql> ``` And this query gets a NullPointerException: ``` spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null)); 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere ``` When an array of structs is created by `CreateArray`, and no struct field contains a literal null value, the schema for the struct will have non-nullable fields, even if the array itself has a null entry (as in the example above). As a result, the output attributes for the generator will be non-nullable. When the output attributes for `Inline` are non-nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` generates incorrect code for null structs. In more detail, the issue is this: `GenerateExec#codeGenCollection` generates code that will check if the struct instance (i.e., array element) is null and, if so, set a boolean for each struct field to indicate that the field contains a null. However, unless the generator's output attributes are nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` will not generate any code to check those booleans. Instead it will generate code to write out whatever is in the variables that normally hold the struct values (which will be garbage if the array element is null). Arrays of structs from file sources do not have this issue. In that case, each `StructField` will have nullable=true due to [this](https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L417). (Note: the eval path for `Inline` has a different bug with null array elements that occurs even when `nullable` is set correctly in the schema, but I will address that in a separate PR). No. New unit test. Closes apache#36883 from bersprockets/inline_struct_nullability_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit fc385da) Signed-off-by: Hyukjin Kwon <[email protected]>

bersprockets added 2 commits June 14, 2022 08:47

Add failing test

4949ea9

Potential fix

aed6288

github-actions bot added the SQL label Jun 15, 2022

HyukjinKwon reviewed Jun 16, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/generators.scala Show resolved Hide resolved

HyukjinKwon approved these changes Jun 16, 2022

View reviewed changes

HyukjinKwon closed this in fc385da Jun 16, 2022

HyukjinKwon reviewed Jun 20, 2022

View reviewed changes

bersprockets deleted the inline_struct_nullability_issue branch August 10, 2022 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39061][SQL] Set nullable correctly for `Inline` output attributes #36883

[SPARK-39061][SQL] Set nullable correctly for `Inline` output attributes #36883

bersprockets commented Jun 15, 2022

HyukjinKwon commented Jun 16, 2022

HyukjinKwon Jun 20, 2022

bersprockets Jun 20, 2022

bersprockets Jun 20, 2022

HyukjinKwon Jun 20, 2022

[SPARK-39061][SQL] Set nullable correctly for Inline output attributes #36883

[SPARK-39061][SQL] Set nullable correctly for Inline output attributes #36883

Conversation

bersprockets commented Jun 15, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Jun 16, 2022

HyukjinKwon Jun 20, 2022

Choose a reason for hiding this comment

bersprockets Jun 20, 2022

Choose a reason for hiding this comment

bersprockets Jun 20, 2022

Choose a reason for hiding this comment

HyukjinKwon Jun 20, 2022

Choose a reason for hiding this comment

[SPARK-39061][SQL] Set nullable correctly for `Inline` output attributes #36883

[SPARK-39061][SQL] Set nullable correctly for `Inline` output attributes #36883