Add Append Column API (#4155) #4269

tustvold · 2023-05-23T17:16:54Z

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold · 2023-05-23T17:18:21Z

parquet/src/arrow/arrow_writer/byte_array.rs

@@ -104,12 +104,12 @@ impl<'a> ByteArrayWriter<'a> {
    /// Returns a new [`ByteArrayWriter`]
    pub fn new(
        descr: ColumnDescPtr,
-        props: &'a WriterPropertiesPtr,
+        props: WriterPropertiesPtr,


This was a drive-by cleanup. Given it is going to be cloned anyway, might as well just pass the cloned type (this API is crate private)

tustvold · 2023-05-23T17:22:05Z

parquet/src/file/writer.rs

+        let mut column_state_slice = column_state.as_mut_slice();
+        let mut column_writers = Vec::with_capacity(columns.len());
+        for c in columns {
+            let ((buf, out), tail) = column_state_slice.split_first_mut().unwrap();


This is somewhat obtuse, but is necessary to make the lifetimes work

tustvold · 2023-05-23T17:27:59Z

parquet/src/file/writer.rs

@@ -1540,4 +1609,83 @@ mod tests {
        assert_eq!(s.min_value.as_deref(), Some(1_i32.to_le_bytes().as_ref()));
        assert_eq!(s.max_value.as_deref(), Some(3_i32.to_le_bytes().as_ref()));
    }
+
+    #[test]
+    fn test_spliced_write() {


This is modeled on what #3871 will need to do within the ArrowWriter

tustvold · 2023-05-23T17:33:33Z

parquet/src/file/writer.rs

+    ///
+    /// This can be used for efficiently concatenating or projecting parquet data,
+    /// or encoding parquet data to temporary in-memory buffers
+    pub fn splice_column<R: ChunkReader>(


It is perhaps worth highlighting that if the reader doesn't correspond to ColumnCloseResult the resulting parquet file will contain gibberish. Ultimately there is no way to prevent this, after all if the user really wanted to they could just write whatever they felt like to the underlying file anyway, and so I don't think this is actually an issue. The onus is ultimately on the read-side to tolerate broken files.

I have some comments on this API:

pub?

Are we happy enough with this API to mark it pub? Maybe we should leave it crate private until there is an example showing how to use it (see Example section ) below

Naming

I don't understand the use of the name splice given this API appears to append a column -- the only difference is that the column comes from some other source.

Given this I suggest an alternate name like append_column or append_column_from_reader

Example

I also think it would be super helpful to write an example program in parquet/examples that shows how to append data to an existing file (e.g. #4150) and link that to this doc commet . Perhaps you plan to do that as a follow on PR

Documentation

I recommend the following additions:

Add the caveat from the PR description that the data has to match or else invalid parquet will be written

Add a note that the next column from reader is appended to the next column from writer (the state is stored in the reader)

Explain that the close is the result from closing the previous column in this writer

Reduced Foot Guns 🦶 🔫

While I agree providing users unlimited protection is probably not reasonable, I do think we should provide basic error checking to help users avoid silly errors.

For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column (to make sure the column name and type matches..)?

For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column

This check already exists - https://github.com/apache/arrow-rs/pull/4269/files#diff-3b307348aabe465890fa39973e9fda0243bd2344cb7cb9cdf02ac2d39521d7caR522

Explain that the close is the result from closing the previous column in this writer

It need not be, ColumnCloseResult is just a struct of column data. There are various ways a user could conceivably construct it.

Perhaps you plan to do that as a follow on PR

I have a PR almost ready that adds a parquet-concat binary that will show how to use this

Are we happy enough with this API to mark it pub

In this case I would rather expose it so that people can explore the various use-cases it unlocks, I also have a PR lined up that uses it to efficiently concatenate parquet files, and it will need to be public for that

alamb

I think the code looks good to me, thank you @tustvold -- while I am not an expert in this area I read all the code and tests and it makes sense to me and since this API isn't used by existing code I think the impact is minimal

I have a bunch of API improvement suggestions which I left inline but nothing that I think that couldn't be improved / fixed as a follow on.

alamb · 2023-05-24T13:07:59Z

parquet/src/file/writer.rs

+    ///
+    /// This can be used for efficiently concatenating or projecting parquet data,
+    /// or encoding parquet data to temporary in-memory buffers
+    pub fn splice_column<R: ChunkReader>(


I have some comments on this API:

pub?

Are we happy enough with this API to mark it pub? Maybe we should leave it crate private until there is an example showing how to use it (see Example section ) below

Naming

I don't understand the use of the name splice given this API appears to append a column -- the only difference is that the column comes from some other source.

Given this I suggest an alternate name like append_column or append_column_from_reader

Example

I also think it would be super helpful to write an example program in parquet/examples that shows how to append data to an existing file (e.g. #4150) and link that to this doc commet . Perhaps you plan to do that as a follow on PR

Documentation

I recommend the following additions:

Add the caveat from the PR description that the data has to match or else invalid parquet will be written

Add a note that the next column from reader is appended to the next column from writer (the state is stored in the reader)

Explain that the close is the result from closing the previous column in this writer

Reduced Foot Guns 🦶 🔫

While I agree providing users unlimited protection is probably not reasonable, I do think we should provide basic error checking to help users avoid silly errors.

For example, perhaps we can at least make sure metadata.column_descr_ptr() matches the target column (to make sure the column name and type matches..)?

tustvold · 2023-05-24T13:44:48Z

Example in #4274

* Add splice column API (apache#4155) * Review feedback * Re-encode offset index

alippai · 2023-06-01T14:08:09Z

Is append_column() a good name? I'd expect that it adds a new column to an existing set of columns (ie adding a new field "horizontally" to the row group)

tustvold · 2023-06-01T14:10:32Z

I'd expect that it adds a new column to an existing set of columns (ie adding a new field "horizontally" to the row group)

That is what it does?

alippai · 2023-06-01T14:12:39Z

Oh sorry, silly me. Based on the other discussion I thought this is the one which concats vertically, increasing the number of rows, not the columns.

tustvold · 2023-06-01T14:14:42Z

this is the one which concats vertically

They're two sides of the same coin, to concatenate vertically you simply concatenate all the columns in a row group 😄

github-actions bot added the parquet Changes to the parquet crate label May 23, 2023

tustvold commented May 23, 2023

View reviewed changes

tustvold force-pushed the splice-column branch from f43609c to 1fd36f8 Compare May 23, 2023 17:20

tustvold commented May 23, 2023

View reviewed changes

Add splice column API (apache#4155)

04ed4be

tustvold force-pushed the splice-column branch from 1fd36f8 to 04ed4be Compare May 23, 2023 17:22

tustvold commented May 23, 2023

View reviewed changes

alamb approved these changes May 24, 2023

View reviewed changes

tustvold added 2 commits May 24, 2023 14:16

Review feedback

7b42aa2

Re-encode offset index

b98649f

tustvold merged commit 58e2c1c into apache:master May 24, 2023

tustvold mentioned this pull request May 24, 2023

Add parquet-concat #4274

Merged

tustvold changed the title ~~Add splice column API (#4155)~~ Add Append Column API (#4155) May 24, 2023

This was referenced May 24, 2023

append row groups to already exist parquet file #557

Closed

Improve ArrowWriter memory usage: Buffer Pages in ArrowWriter instead of RecordBatch (#3871) #4280

Merged

alamb pushed a commit to alamb/arrow-rs that referenced this pull request May 30, 2023

Add splice column API (apache#4155) (apache#4269)

62c6cbb

* Add splice column API (apache#4155) * Review feedback * Re-encode offset index

tustvold mentioned this pull request Jun 1, 2023

[C++][Parquet] Process parquet rowgroups without Arrow conversion apache/arrow#35638

Open

tustvold mentioned this pull request Jun 1, 2023

Concatenate parquet files without deserializing? #1711

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Append Column API (#4155) #4269

Add Append Column API (#4155) #4269

tustvold commented May 23, 2023

tustvold May 23, 2023 •

edited

Loading

tustvold May 23, 2023

tustvold May 23, 2023

tustvold May 23, 2023

alamb May 24, 2023

tustvold May 24, 2023

alamb left a comment

alamb May 24, 2023

tustvold commented May 24, 2023

alippai commented Jun 1, 2023

tustvold commented Jun 1, 2023

alippai commented Jun 1, 2023

tustvold commented Jun 1, 2023 •

edited

Loading

Add Append Column API (#4155) #4269

Add Append Column API (#4155) #4269

Conversation

tustvold commented May 23, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold May 23, 2023 • edited Loading

Choose a reason for hiding this comment

tustvold May 23, 2023

Choose a reason for hiding this comment

tustvold May 23, 2023

Choose a reason for hiding this comment

tustvold May 23, 2023

Choose a reason for hiding this comment

alamb May 24, 2023

Choose a reason for hiding this comment

pub?

Naming

Example

Documentation

Reduced Foot Guns 🦶 🔫

tustvold May 24, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb May 24, 2023

Choose a reason for hiding this comment

pub?

Naming

Example

Documentation

Reduced Foot Guns 🦶 🔫

tustvold commented May 24, 2023

alippai commented Jun 1, 2023

tustvold commented Jun 1, 2023

alippai commented Jun 1, 2023

tustvold commented Jun 1, 2023 • edited Loading

tustvold May 23, 2023 •

edited

Loading

`pub`?

`pub`?

tustvold commented Jun 1, 2023 •

edited

Loading