Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure there is a single zero in the offsets buffer for an empty ListArray. #1620

Closed
alamb opened this issue Apr 25, 2022 · 13 comments
Closed

Comments

@alamb
Copy link
Contributor

alamb commented Apr 25, 2022

Good idea checking the C++ version. The docs for binary layout also mention

The offsets buffer contains length + 1 signed integers
...
and the last slot is the length of the values array

which would mean there has to be a single zero in the offsets buffer for an empty ListArray.

If this is a requirements it would be better to validate it when creating ArrayData. The code in ArrayData::validate_each_offset explicitly allow this case though:

        // An empty binary-like array can have 0 offsets
        if self.len == 0 && offset_buffer.is_empty() {
            return Ok(());
        }

I'm more hesitant to change this now, maybe let's wait for some more eyes on this.

Originally posted by @jhorstmann in #1602 (comment)

@HaoYang670
Copy link
Contributor

HaoYang670 commented May 8, 2022

I got this error:

---- ipc::reader::tests::read_generated_files_014 stdout ----
---- ipc::reader::tests::read_generated_files_014 stdout ----
thread 'ipc::reader::tests::read_generated_files_014' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Offsets buffer size (bytes): 0 isn't large enough for Binary. Length 0 needs 1")', arrow/src/ipc/reader.rs:276:29

---- ipc::reader::tests::read_generated_streams_014 stdout ----
thread 'ipc::reader::tests::read_generated_streams_014' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Offsets buffer size (bytes): 0 isn't large enough for Binary. Length 0 needs 1")', arrow/src/ipc/reader.rs:276:29

---- ipc::writer::tests::read_and_rewrite_generated_files_014 stdout ----
thread 'ipc::writer::tests::read_and_rewrite_generated_files_014' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Offsets buffer size (bytes): 0 isn't large enough for Binary. Length 0 needs 1")', arrow/src/ipc/reader.rs:276:29

---- ipc::writer::tests::read_and_rewrite_generated_streams_014 stdout ----
thread 'ipc::writer::tests::read_and_rewrite_generated_streams_014' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Offsets buffer size (bytes): 0 isn't large enough for Binary. Length 0 needs 1")', arrow/src/ipc/reader.rs:276:29

after deleting
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L737-L740
and
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1052-L1055

I don't know how data is generated in arrow-testing.

@HaoYang670
Copy link
Contributor

@alamb Could you please help to recheck that whether empty array has empty offsets buffer in arrow-testing?

@alamb
Copy link
Contributor Author

alamb commented May 9, 2022

Hi @HaoYang670. I looked at the C++ implementation and I think it does allow an empty offsets buffer for an empty array -- namely the validation code skips doing any checks if data.len == 0:

https://github.com/apache/arrow/blob/c715bebbd89089f385c9996560866da23ea1ddda/cpp/src/arrow/array/validate.cc#L550

I don't know how data is generated in arrow-testing.

I also don't know how this data was generated

@HaoYang670
Copy link
Contributor

HaoYang670 commented May 9, 2022

Does it mean that for empty (list array, binary array, and string array), both an empty offsets buffer and an offsets buffer with a single zero are allowed?

If this is true, how could we implement this issue?

@HaoYang670
Copy link
Contributor

HaoYang670 commented May 9, 2022

@alamb
Copy link
Contributor Author

alamb commented May 9, 2022

But it seems like list array does not allow empty offsets:

I think that code is for constructing a new array programatically. The validation code I think may be more permissive to support arrays that come in from other arrow implementations.

Maybe it is time to ask on the [email protected] mailing list 🤔 I am out of ideas

@HaoYang670
Copy link
Contributor

I will send an email

@jorgecarleitao
Copy link
Member

Hey, there is one file in the test files that has no offset - it is the file for version 0.14 or something. My understanding is that previous versions of the (IPC) spec did not require offset and recent versions require them.

I have a comment about this in arrow2.

(I am curious about the mailing list discussion though; this is not very authoritative ^^)

@HaoYang670
Copy link
Contributor

It seems like Arrow tends to be ambiguous in this case, and an empty buffer or a buffer with just a 0 in it both are valid.

@HaoYang670
Copy link
Contributor

Personally, I prefer keeping the current implementation.

@alamb
Copy link
Contributor Author

alamb commented May 10, 2022

Email link for cross reference: https://lists.apache.org/thread/w7g1zfqrjxx0bvrct0mt5zwxvdnc9nob

@HaoYang670
Copy link
Contributor

Could we close this issue, or open it for more discussions?

@alamb
Copy link
Contributor Author

alamb commented May 11, 2022

I think the conclusion is "we should keep the current implementation, as imperfect as it may be, for backwards compatibility reasons"

Agree with closing -- anyone who disagrees please let me know and I'll reopen it

Thanks for the effort @HaoYang670

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants