-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Append Column API (#4155) #4269
Conversation
@@ -104,12 +104,12 @@ impl<'a> ByteArrayWriter<'a> { | |||
/// Returns a new [`ByteArrayWriter`] | |||
pub fn new( | |||
descr: ColumnDescPtr, | |||
props: &'a WriterPropertiesPtr, | |||
props: WriterPropertiesPtr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a drive-by cleanup. Given it is going to be cloned anyway, might as well just pass the cloned type (this API is crate private)
let mut column_state_slice = column_state.as_mut_slice(); | ||
let mut column_writers = Vec::with_capacity(columns.len()); | ||
for c in columns { | ||
let ((buf, out), tail) = column_state_slice.split_first_mut().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat obtuse, but is necessary to make the lifetimes work
@@ -1540,4 +1609,83 @@ mod tests { | |||
assert_eq!(s.min_value.as_deref(), Some(1_i32.to_le_bytes().as_ref())); | |||
assert_eq!(s.max_value.as_deref(), Some(3_i32.to_le_bytes().as_ref())); | |||
} | |||
|
|||
#[test] | |||
fn test_spliced_write() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is modeled on what #3871 will need to do within the ArrowWriter
parquet/src/file/writer.rs
Outdated
/// | ||
/// This can be used for efficiently concatenating or projecting parquet data, | ||
/// or encoding parquet data to temporary in-memory buffers | ||
pub fn splice_column<R: ChunkReader>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perhaps worth highlighting that if the reader doesn't correspond to ColumnCloseResult the resulting parquet file will contain gibberish. Ultimately there is no way to prevent this, after all if the user really wanted to they could just write whatever they felt like to the underlying file anyway, and so I don't think this is actually an issue. The onus is ultimately on the read-side to tolerate broken files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments on this API:
pub
?
Are we happy enough with this API to mark it pub? Maybe we should leave it crate private until there is an example showing how to use it (see Example
section ) below
Naming
I don't understand the use of the name splice
given this API appears to append a column -- the only difference is that the column comes from some other source.
Given this I suggest an alternate name like append_column
or append_column_from_reader
Example
I also think it would be super helpful to write an example program in parquet/examples
that shows how to append data to an existing file (e.g. #4150) and link that to this doc commet . Perhaps you plan to do that as a follow on PR
Documentation
I recommend the following additions:
- Add the caveat from the PR description that the data has to match or else invalid parquet will be written
- Add a note that the next column from
reader
is appended to the next column fromwriter
(the state is stored in the reader) - Explain that the
close
is the result from closing the previous column in this writer
Reduced Foot Guns 🦶 🔫
While I agree providing users unlimited protection is probably not reasonable, I do think we should provide basic error checking to help users avoid silly errors.
For example, perhaps we can at least make sure metadata.column_descr_ptr()
matches the target column (to make sure the column name and type matches..)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column
This check already exists - https://github.com/apache/arrow-rs/pull/4269/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR522
Explain that the close is the result from closing the previous column in this writer
It need not be, ColumnCloseResult
is just a struct of column data. There are various ways a user could conceivably construct it.
Perhaps you plan to do that as a follow on PR
I have a PR almost ready that adds a parquet-concat binary that will show how to use this
Are we happy enough with this API to mark it pub
In this case I would rather expose it so that people can explore the various use-cases it unlocks, I also have a PR lined up that uses it to efficiently concatenate parquet files, and it will need to be public for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code looks good to me, thank you @tustvold -- while I am not an expert in this area I read all the code and tests and it makes sense to me and since this API isn't used by existing code I think the impact is minimal
I have a bunch of API improvement suggestions which I left inline but nothing that I think that couldn't be improved / fixed as a follow on.
parquet/src/file/writer.rs
Outdated
/// | ||
/// This can be used for efficiently concatenating or projecting parquet data, | ||
/// or encoding parquet data to temporary in-memory buffers | ||
pub fn splice_column<R: ChunkReader>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some comments on this API:
pub
?
Are we happy enough with this API to mark it pub? Maybe we should leave it crate private until there is an example showing how to use it (see Example
section ) below
Naming
I don't understand the use of the name splice
given this API appears to append a column -- the only difference is that the column comes from some other source.
Given this I suggest an alternate name like append_column
or append_column_from_reader
Example
I also think it would be super helpful to write an example program in parquet/examples
that shows how to append data to an existing file (e.g. #4150) and link that to this doc commet . Perhaps you plan to do that as a follow on PR
Documentation
I recommend the following additions:
- Add the caveat from the PR description that the data has to match or else invalid parquet will be written
- Add a note that the next column from
reader
is appended to the next column fromwriter
(the state is stored in the reader) - Explain that the
close
is the result from closing the previous column in this writer
Reduced Foot Guns 🦶 🔫
While I agree providing users unlimited protection is probably not reasonable, I do think we should provide basic error checking to help users avoid silly errors.
For example, perhaps we can at least make sure metadata.column_descr_ptr()
matches the target column (to make sure the column name and type matches..)?
Example in #4274 |
* Add splice column API (apache#4155) * Review feedback * Re-encode offset index
Is |
That is what it does? |
Oh sorry, silly me. Based on the other discussion I thought this is the one which concats vertically, increasing the number of rows, not the columns. |
They're two sides of the same coin, to concatenate vertically you simply concatenate all the columns in a row group 😄 |
Which issue does this PR close?
Closes #4155
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?