Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error) #3215

Closed
aarashy opened this issue Nov 28, 2022 · 9 comments
Labels
arrow Changes to the arrow crate bug

Comments

@aarashy
Copy link
Contributor

aarashy commented Nov 28, 2022

Describe the bug

When reading Arrow Bytes, it seems boolean array inputs can trigger panics under certain conditions.
If my input bytes are bad, I want the arrow API to throw an Err, not panic.

In this case, I don't think there's something wrong with my input bytes - rather, it seems like there's an off-by-one error internally within arrow-rs which is causing invariants to be broken.

The error is being thrown in the validate routine:
https://github.com/apache/arrow-rs/blob/master/arrow-data/src/data.rs#L667-L673

And it's being unwrapped in the caller in create_primitive_array, which DOES NOT return a Result.
https://github.com/apache/arrow-rs/blob/master/arrow-ipc/src/reader.rs#L487

Judging by the fact that unwrap is used, this function presupposes certain invariants that make it safe to use unwrap, but for some inputs, these invariants seem to not be met.

Since the problem appears to be with the length of a buffer, I'm tracing the origin of that buffer (per my stack trace) to https://github.com/apache/arrow-rs/blob/master/arrow-ipc/src/reader.rs#L1140 - I might be wrong.

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Need at least 321 bytes for bitmap in buffers[0] in array of type Boolean, but got 320")', /usr/local/cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/arrow-23.0.0/src/ipc/reader.rs:486:14
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
   2: core::result::unwrap_failed
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/result.rs:1785:5
   3: arrow::ipc::reader::create_primitive_array
   4: arrow::ipc::reader::create_array
   5: arrow::ipc::reader::read_record_batch
   6: arrow::ipc::reader::StreamReader<R>::maybe_next
   7: arrow::ipc::reader::StreamReader<R>::maybe_next
   8: arrow::ipc::reader::StreamReader<R>::maybe_next
   9: arrow::ipc::reader::StreamReader<R>::maybe_next
  10: arrow::ipc::reader::StreamReader<R>::maybe_next
  11: arrow::ipc::reader::StreamReader<R>::maybe_next
  12: arrow::ipc::reader::StreamReader<R>::maybe_next
  13: arrow::ipc::reader::StreamReader<R>::maybe_next
  14: arrow::ipc::reader::StreamReader<R>::maybe_next
  15: arrow::ipc::reader::StreamReader<R>::maybe_next
  16: arrow::ipc::reader::StreamReader<R>::maybe_next
  17: arrow::ipc::reader::StreamReader<R>::maybe_next
  18: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
  19: core::iter::adapters::try_process
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

To Reproduce

I am using the following routine to read IPC bytes. I don't currently have an input bytes which triggers this, but I can find one if you think it's critical for you to debug this properly.

pub fn from_ipc_bytes(bytes: &[u8]) -> Result<Vec<RecordBatch>, anyhow::Error> {
    let cursor: Cursor<&[u8]> = Cursor::new(bytes);
    let reader = arrow::ipc::reader::StreamReader::try_new(cursor, None)?;
    let record_batches = reader.collect::<Result<Vec<RecordBatch>, arrow::error::ArrowError>>()?;
    Ok(record_batches)
}

Expected behavior

Either throw an error if there's a problem with my input bytes, or succeed if there is no problem with my input bytes, but do not panic.

Additional context

@alamb - You seem to have written these validation checks, so I wonder if you would understand what might be happening for me. Let me know if it's crucial for me to provide bytes inputs which trigger the panic, but I think the stacktrace and error message here might be enough to go off of.

@aarashy aarashy added the bug label Nov 28, 2022
@alamb
Copy link
Contributor

alamb commented Nov 28, 2022

Hi @aarashy -- thank you for the report.

The validation routine you point to is basically saying that it expects that the size of the underlying buffer for the BooleanArray is at least 321 bytes, but for some reason the data has only 320.

I agree that the IPC reader should return the error rather than panicing in this case.

Not sure if you are up for proposing a PR to fix the issue (propagate the error) but if you are I would be most appreciative

@alamb alamb changed the title Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes Off-by-one buffer size error triggers Panic when constructing RecordBatch from IPC bytes (should return an Error) Nov 28, 2022
@aarashy
Copy link
Contributor Author

aarashy commented Nov 28, 2022

@alamb - I appreciate the prompt response.

Do you think it would be crazy for create_primitive_array to return a Result so that I can question-mark the build result and remove the unwraps in that function entirely? It seems the only caller of create_primitive_array is create_array which returns a Result, so I think there wouldn't be too much refactoring there.

However, there's a bigger question of why the data has only 320 instead of 321 in the first place.
That, I am less sure about.
I'll try to get a bytes input that triggers this panic to enable a full repro and maybe it would be more obvious then.

@alamb
Copy link
Contributor

alamb commented Nov 29, 2022

Do you think it would be crazy for create_primitive_array to return a Result so that I can question-mark the build result and remove the unwraps in that function entirely?

I don't think that would be crazy at all -- sounds like a good idea to me

However, there's a bigger question of why the data has only 320 instead of 321 in the first place.

👍

@aarashy
Copy link
Contributor Author

aarashy commented Nov 30, 2022

I removed the unwraps here #3232

I have some bytes which reproduce this error, but the data is private. The bytes were the result of the apache-arrow NPM package function tableToIPC - https://github.com/apache/arrow/blob/5611f2bd0d6136b005d137a84b50709fc5c813bb/js/src/ipc/serialization.ts#L61

More info about the error stacktrace above

Let me share a little bit more about the error in ArrayData::validate above.

  1. The value of len_plus_offset is 2565, and offset is 0. Thus, 2565 is the number of rows in the the record batch from higher in the callstack (read_record_batch).
  2. 2565 is not divisible by 8 (2565/8 = 320.625) - so it seems correct for the validate function to be expecting 321 bytes if we need to fit 2565 bits within byte-boundaries (there is a ceiling function being applied on the division, which seems correct).

Thus, I really want to understand why the buffer is given only 320 bytes. It seems like instead of correctly zero padding the 2565 bits to reach the next byte-boundary, the bits have truncated at the previous byte-boundary and the byte buffer is 1 byte too small.

I'm struggling to really find the origin of the length of the Array data byte buffers here which would be performing this truncation... I suppose it would be Message::header_as_dictionary_batch -> Message::buffers?

Another similar error stacktrace

I have another callstack which is throwing a similar panic on a different input bytes set. This one is from indexing into a slice out of bounds, again off-by-one.

thread 'tokio-runtime-worker' panicked at 'range end index 65 out of range for slice of length 64', /usr/local/cargo/registry/src/jackfan.us.kg-1ecc6299db9ec823/arrow-buffer-28.0.0/src/util/bit_chunk_iterator.rs:57:23
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
   2: core::slice::index::slice_end_index_len_fail_rt
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/slice/index.rs:76:5
   3: core::slice::index::slice_end_index_len_fail
             at ./rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/slice/index.rs:69:9
   4: arrow_buffer::util::bit_chunk_iterator::UnalignedBitChunk::new
   5: arrow_buffer::buffer::immutable::Buffer::count_set_bits_offset
   6: arrow_data::data::ArrayData::new_unchecked
   7: arrow_data::data::ArrayDataBuilder::build_unchecked
   8: arrow_ipc::reader::create_dictionary_array
   9: arrow_ipc::reader::create_array
  10: arrow_ipc::reader::read_record_batch

In this case:

  1. The record batch is of length 518, which is again not divisible by 8 (64.75).
  2. The panic is coming from https://github.com/apache/arrow-rs/blob/master/arrow-buffer/src/util/bit_chunk_iterator.rs#L57 - the UnalignedBitChunk is being created downstream of the count_nulls routine in ArrayData::new_unchecked. This callstack doesn't perform validations or return Result, so it's awkward for me to add a slice-out-bounds check and return an error. To me this indicates, again, that the wherever the null_bit_buffer length is being initialized within the arrow-rs callstack, it's not being padded correctly and instead being truncated.
  3. The data_type of the array being created here is Dictionary(Int32, Utf8).

I wonder if @tustvold, @alamb, or @HaoYang670 would know anything about improperly truncated / padded byte buffers during IPC byte reading. It's possible that my input is missing some padding, but perhaps this crate should generously perform padding which the Javascript apache-arrow crate is neglecting. It's also possible that this is a Javascript arrow package bug, but if we have an opportunity to be more permissive and flexible here, we should take it (and we at least shouldn't have panics),

@viirya
Copy link
Member

viirya commented Nov 30, 2022

If your IPC payload is generated by apache-arrow NPM package function tableToIPC, the size of buffers is produced by that, not from arrow-rs. IPC reader just reads provided buffer with provided offset/length. So if the buffer size is wrong, most likely it is because tableToIPC or the NPM package produces incorrect IPC payload.

@aarashy
Copy link
Contributor Author

aarashy commented Nov 30, 2022

@viirya , thanks for your helpful reply. I've created this issue in the arrow repo to track it there. (apache/arrow#14791)

Now my last question is, are we okay with the fact that the UnalignedBitChunk indexes into the slice without checking the bounds first, causing a panic?

It seems like, in situations where the input arrow bytes are the correct shape but have some subtle bug like this, arrow-rs is vulnerable to panic. Do you think it's important to add appropriate checks here instead of using functions like new_unchecked? By its name, this function seems to believe that essential invariants like buffer size have already been verified, but in this case they are not.

@tustvold
Copy link
Contributor

Now my last question is, are we okay with the fact that the UnalignedBitChunk indexes into the slice without checking the bounds first, causing a panic?

I think the issue here is that currently create_dictionary_array isn't performing validation. I think this is incorrect and should be fixed.

UnalignedBitChunk expects that its inputs are already bounds checked, and panics as an invariant has been violated. This I think is fine.

@tustvold
Copy link
Contributor

tustvold commented Dec 6, 2022

I don't think there are further action items for this ticket, so closing. Feel free to reopen if I have missed something

@tustvold tustvold closed this as completed Dec 6, 2022
@alamb
Copy link
Contributor

alamb commented Dec 9, 2022

label_issue.py automatically added labels {'arrow'} from #3247

@alamb alamb added the arrow Changes to the arrow crate label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

No branches or pull requests

4 participants