Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Append Column API (#4155) #4269

Merged
merged 3 commits into from
May 24, 2023
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4155

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 23, 2023
@@ -104,12 +104,12 @@ impl<'a> ByteArrayWriter<'a> {
/// Returns a new [`ByteArrayWriter`]
pub fn new(
descr: ColumnDescPtr,
props: &'a WriterPropertiesPtr,
props: WriterPropertiesPtr,
Copy link
Contributor Author

@tustvold tustvold May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a drive-by cleanup. Given it is going to be cloned anyway, might as well just pass the cloned type (this API is crate private)

let mut column_state_slice = column_state.as_mut_slice();
let mut column_writers = Vec::with_capacity(columns.len());
for c in columns {
let ((buf, out), tail) = column_state_slice.split_first_mut().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat obtuse, but is necessary to make the lifetimes work

@@ -1540,4 +1609,83 @@ mod tests {
assert_eq!(s.min_value.as_deref(), Some(1_i32.to_le_bytes().as_ref()));
assert_eq!(s.max_value.as_deref(), Some(3_i32.to_le_bytes().as_ref()));
}

#[test]
fn test_spliced_write() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is modeled on what #3871 will need to do within the ArrowWriter

///
/// This can be used for efficiently concatenating or projecting parquet data,
/// or encoding parquet data to temporary in-memory buffers
pub fn splice_column<R: ChunkReader>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps worth highlighting that if the reader doesn't correspond to ColumnCloseResult the resulting parquet file will contain gibberish. Ultimately there is no way to prevent this, after all if the user really wanted to they could just write whatever they felt like to the underlying file anyway, and so I don't think this is actually an issue. The onus is ultimately on the read-side to tolerate broken files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments on this API:

pub?

Are we happy enough with this API to mark it pub? Maybe we should leave it crate private until there is an example showing how to use it (see Example section ) below

Naming

I don't understand the use of the name splice given this API appears to append a column -- the only difference is that the column comes from some other source.

Given this I suggest an alternate name like append_column or append_column_from_reader

Example

I also think it would be super helpful to write an example program in parquet/examples that shows how to append data to an existing file (e.g. #4150) and link that to this doc commet . Perhaps you plan to do that as a follow on PR

Documentation

I recommend the following additions:

  1. Add the caveat from the PR description that the data has to match or else invalid parquet will be written
  2. Add a note that the next column from reader is appended to the next column from writer (the state is stored in the reader)
  3. Explain that the close is the result from closing the previous column in this writer

Reduced Foot Guns 🦶 🔫

While I agree providing users unlimited protection is probably not reasonable, I do think we should provide basic error checking to help users avoid silly errors.

For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column (to make sure the column name and type matches..)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column

This check already exists - https://github.com/apache/arrow-rs/pull/4269/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR522

Explain that the close is the result from closing the previous column in this writer

It need not be, ColumnCloseResult is just a struct of column data. There are various ways a user could conceivably construct it.

Perhaps you plan to do that as a follow on PR

I have a PR almost ready that adds a parquet-concat binary that will show how to use this

Are we happy enough with this API to mark it pub

In this case I would rather expose it so that people can explore the various use-cases it unlocks, I also have a PR lined up that uses it to efficiently concatenate parquet files, and it will need to be public for that

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code looks good to me, thank you @tustvold -- while I am not an expert in this area I read all the code and tests and it makes sense to me and since this API isn't used by existing code I think the impact is minimal

I have a bunch of API improvement suggestions which I left inline but nothing that I think that couldn't be improved / fixed as a follow on.

///
/// This can be used for efficiently concatenating or projecting parquet data,
/// or encoding parquet data to temporary in-memory buffers
pub fn splice_column<R: ChunkReader>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some comments on this API:

pub?

Are we happy enough with this API to mark it pub? Maybe we should leave it crate private until there is an example showing how to use it (see Example section ) below

Naming

I don't understand the use of the name splice given this API appears to append a column -- the only difference is that the column comes from some other source.

Given this I suggest an alternate name like append_column or append_column_from_reader

Example

I also think it would be super helpful to write an example program in parquet/examples that shows how to append data to an existing file (e.g. #4150) and link that to this doc commet . Perhaps you plan to do that as a follow on PR

Documentation

I recommend the following additions:

  1. Add the caveat from the PR description that the data has to match or else invalid parquet will be written
  2. Add a note that the next column from reader is appended to the next column from writer (the state is stored in the reader)
  3. Explain that the close is the result from closing the previous column in this writer

Reduced Foot Guns 🦶 🔫

While I agree providing users unlimited protection is probably not reasonable, I do think we should provide basic error checking to help users avoid silly errors.

For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column (to make sure the column name and type matches..)?

@tustvold tustvold merged commit 58e2c1c into apache:master May 24, 2023
@tustvold tustvold mentioned this pull request May 24, 2023
@tustvold
Copy link
Contributor Author

Example in #4274

@tustvold tustvold changed the title Add splice column API (#4155) Add Append Column API (#4155) May 24, 2023
alamb pushed a commit to alamb/arrow-rs that referenced this pull request May 30, 2023
* Add splice column API (apache#4155)

* Review feedback

* Re-encode offset index
@alippai
Copy link
Contributor

alippai commented Jun 1, 2023

Is append_column() a good name? I'd expect that it adds a new column to an existing set of columns (ie adding a new field "horizontally" to the row group)

@tustvold
Copy link
Contributor Author

tustvold commented Jun 1, 2023

I'd expect that it adds a new column to an existing set of columns (ie adding a new field "horizontally" to the row group)

That is what it does?

@alippai
Copy link
Contributor

alippai commented Jun 1, 2023

Oh sorry, silly me. Based on the other discussion I thought this is the one which concats vertically, increasing the number of rows, not the columns.

@tustvold
Copy link
Contributor Author

tustvold commented Jun 1, 2023

this is the one which concats vertically

They're two sides of the same coin, to concatenate vertically you simply concatenate all the columns in a row group 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Splice Parquet Data
3 participants