-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validation to RecordBatch
for non-nullable fields containing null values
#1890
Add validation to RecordBatch
for non-nullable fields containing null values
#1890
Conversation
@alamb I think this PR has exposed a bug in dictionary IPC encoding. The checked-in |
RecordBatch
for non-nullable fields containing null valuesRecordBatch
for non-nullable fields containing null values
arrow/src/record_batch.rs
Outdated
// hacky workaround for known issue with dictionary IPC encoding | ||
// https://github.com/apache/arrow-rs/issues/1892 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1892
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd rather an if !condition
over an if else
with an empty if body
Codecov Report
@@ Coverage Diff @@
## master #1890 +/- ##
=======================================
Coverage 83.44% 83.44%
=======================================
Files 202 202
Lines 57135 57146 +11
=======================================
+ Hits 47678 47688 +10
- Misses 9457 9458 +1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, minor nits, big fan of this
arrow/src/record_batch.rs
Outdated
@@ -138,6 +138,20 @@ impl RecordBatch { | |||
) | |||
})?; | |||
|
|||
for (c, f) in columns.iter().zip(&schema.fields) { | |||
if !f.is_nullable() && c.null_count() > 0 { | |||
if f.name().len() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_empty() is normally preferred
arrow/src/record_batch.rs
Outdated
// hacky workaround for known issue with dictionary IPC encoding | ||
// https://github.com/apache/arrow-rs/issues/1892 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd rather an if !condition
over an if else
with an empty if body
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Integration tests failed with:
|
Failed with a Rust / C# test this time
|
Other builds are failing with the same issue. I wonder if it is related to apache/arrow#13279 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
arrow/src/record_batch.rs
Outdated
// hacky workaround for known issue with dictionary IPC encoding | ||
// https://github.com/apache/arrow-rs/issues/1892 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1893 should fix this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @viirya I have upmerged and removed this workaround
I think a rebase should clear the test failures related to IPC dictionaries |
🎉 |
Marking this as an API change because it caused many errors in the datafusion tests -- see apache/datafusion#2778 -- and I think it should be highlighted prominently in the release notes |
Which issue does this PR close?
Closes #1888
Rationale for this change
Allowing null values in non-nullable fields can lead to incorrect results and missed optimizations in downstream crates such as DataFusion
What changes are included in this PR?
Add validation check when creating a RecordBatch
Are there any user-facing changes?
Yes. Users may start getting errors if their schemas are not currently defined correctly.