Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String and BinaryArray created from iterators that don't accurately report size can lead to undefined behavior #197

Closed
alamb opened this issue Apr 26, 2021 · 2 comments
Labels
arrow Changes to the arrow crate bug security

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11862

As [~jorgecarleitao] says on apache/arrow#9588 (comment)

The (Rust) Iterator spec recommends, but does not require, that the iterator reports a correct length. Consumer that lead to undefined behavior from an incorrect size_hint are the causers of said undefined behavior.

The only case where consumers can trust the iterators' length is when the interator implement unsafe trait TrustedLen. Unfortunately, TrustedLen is still in unstable. For that reason, we have been exposing unsafe Buffer::from_trusted_len_iter and the like for those cases.

So the code should be updated to handle the case where the reported size_hint turns out to be incorrect

@alamb
Copy link
Contributor Author

alamb commented Jan 9, 2022

An update: manually audited all the uses of the size_hint in the arrow codebase:

rg -n -H --no-heading -e 'size_hint' $(git rev-parse --show-toplevel || pwd)
/Users/alamb/Software/arrow-rs/arrow/src/buffer/immutable.rs:308:                let (lower, _) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:387:        let (lower, _) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:433:        let (_, upper) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:475:        let (_, upper) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:524:        let (_, upper) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:621:            let byte_capacity: usize = iterator.size_hint().0.saturating_add(7) / 8;
/Users/alamb/Software/arrow-rs/arrow/src/buffer/mutable.rs:653:                    iterator.size_hint().0.saturating_add(7) / 8, //convert bit count to byte count, rounding up
/Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:338:        let (lower, _) = iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:376:    /// I.e. that `size_hint().1` correctly reports its length.
/Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:384:        let (_, upper) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_primitive.rs:959:        let (_, upper_size_bound) = value_iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_list.rs:153:        let (lower, _) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_binary.rs:236:        let (_, data_len) = iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_binary.rs:1125:        let (_, upper_size_bound) = value_iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/builder.rs:271:            .size_hint()
/Users/alamb/Software/arrow-rs/arrow/src/array/builder.rs:777:            .size_hint()
/Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:68:    fn size_hint(&self) -> (usize, Option<usize>) {
/Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:140:    fn size_hint(&self) -> (usize, Option<usize>) {
/Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:214:    fn size_hint(&self) -> (usize, Option<usize>) {
/Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:293:    fn size_hint(&self) -> (usize, Option<usize>) {
/Users/alamb/Software/arrow-rs/arrow/src/array/iterator.rs:370:    fn size_hint(&self) -> (usize, Option<usize>) {
/Users/alamb/Software/arrow-rs/arrow/src/util/trusted_len.rs:36:    let (_, upper) = iterator.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/util/bit_chunk_iterator.rs:160:    fn size_hint(&self) -> (usize, Option<usize>) {
/Users/alamb/Software/arrow-rs/arrow/src/array/array_boolean.rs:197:        let (_, data_len) = iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_dictionary.rs:179:        let (lower, _) = it.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_dictionary.rs:221:        let (lower, _) = it.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:174:        let (_, data_len) = iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:221:        let (_, data_len) = iter.size_hint();
/Users/alamb/Software/arrow-rs/arrow/src/array/array_string.rs:548:        let (_, upper_size_bound) = string_iter.size_hint();

My analysis is that any place where upper is used to size a buffer, it is either:

  1. Advisory for performance to allocate capacity (like Vec::with_capacity) but not required for correctness
  2. Used in an unsafe block which clearly states it is undefined behavior to call the function with an iterator that does not correctly report its size
  3. I did find one issue: Undefined behavior for GenericStringArray::from_iter_values if reported iterator upper bound is incorrect #1144

For 2. above, theunsafe code will panic if the iterator's size was inaccurate (potentially after writing past valid memory) as an additional safeguard

The only thing remaining for me to do on this issue is to verify the uses of the unsafe from_trusted_iter calls.

@alamb
Copy link
Contributor Author

alamb commented Jan 11, 2022

The only thing remaining for me to do on this issue is to verify the uses of the unsafe from_trusted_iter calls.

As a final follow up, I did verify the uses of from_trusted_iter in the crate and concluded they are all safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug security
Projects
None yet
Development

No branches or pull requests

2 participants