Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-39061][SQL] Set nullable correctly for
Inline
output attributes
### What changes were proposed in this pull request? Change `Inline#elementSchema` to make each struct field nullable when the containing array has a null element. ### Why are the changes needed? This query returns incorrect results (the last row should be `NULL NULL`): ``` spark-sql> select inline(array(named_struct('a', 1, 'b', 2), null)); 1 2 -1 -1 Time taken: 4.053 seconds, Fetched 2 row(s) spark-sql> ``` And this query gets a NullPointerException: ``` spark-sql> select inline(array(named_struct('a', '1', 'b', '2'), null)); 22/04/28 16:51:54 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.NullPointerException: null at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) ~[spark-catalyst_2.12-3.4.0-SNAPSHOT.jar:3.4.0-SNAPSHOT] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.generate_doConsume_0$(Unknown Source) ~[?:?] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) ~[?:?] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(Buffere ``` When an array of structs is created by `CreateArray`, and no struct field contains a literal null value, the schema for the struct will have non-nullable fields, even if the array itself has a null entry (as in the example above). As a result, the output attributes for the generator will be non-nullable. When the output attributes for `Inline` are non-nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` generates incorrect code for null structs. In more detail, the issue is this: `GenerateExec#codeGenCollection` generates code that will check if the struct instance (i.e., array element) is null and, if so, set a boolean for each struct field to indicate that the field contains a null. However, unless the generator's output attributes are nullable, `GenerateUnsafeProjection#writeExpressionsToBuffer` will not generate any code to check those booleans. Instead it will generate code to write out whatever is in the variables that normally hold the struct values (which will be garbage if the array element is null). Arrays of structs from file sources do not have this issue. In that case, each `StructField` will have nullable=true due to [this](https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L417). (Note: the eval path for `Inline` has a different bug with null array elements that occurs even when `nullable` is set correctly in the schema, but I will address that in a separate PR). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. Closes apache#36883 from bersprockets/inline_struct_nullability_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information