Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Process parquet rowgroups without Arrow conversion #35638

Open
alippai opened this issue May 17, 2023 · 25 comments
Open

[C++][Parquet] Process parquet rowgroups without Arrow conversion #35638

alippai opened this issue May 17, 2023 · 25 comments

Comments

@alippai
Copy link
Contributor

alippai commented May 17, 2023

Describe the usage question you have. Please include as many useful details as possible.

I'd like to read a Parquet file and append an Arrow table to the new Parquet file created based on the old file and the new table added as a new row group.
Can I read the Parquet rowgroup by rowgroup, decide to drop any or use them and assemble a new Parquet file without doing the (de)serialization to Arrow?

Component(s)

C++, Parquet, Python

@mapleFU
Copy link
Member

mapleFU commented May 17, 2023

Seems that you want a "append" syntax, and want to avoid read->covert to arrow->writeback?

I guess current Parquet code cannot support this :-(

@alippai
Copy link
Contributor Author

alippai commented May 17, 2023

Yes, something like that. My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.

I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please

@alippai
Copy link
Contributor Author

alippai commented May 17, 2023

Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet? Would that be a good addition to pyarrow dataset to optionally ensure the parquet rowgroups contains only one partition?

@westonpace
Copy link
Member

My usecase is writing data to a small parquet file daily, changing the last 3 days. I don’t have exact numbers to support this extra api yet, but wanted to ask first.

I can imagine this is not a common case to keep/drop row groups based on the stats or append new row groups - feel free to close the issue, please

I would say that it is a very common thing for users to want to do. However, parquet is often not the correct layer of abstraction to introduce this capability. For example, the table formats like Iceberg, Delta Lake, and Hudi have all come up with ways to handle this.

Appending data to parquet groups has been asked for several times. I've seen arguments that it is simply not possible without rewriting the file (because thrift uses a lot of absolute file offsets and those offsets, in the portions of the file you are not changing, would become invalid) but I have not investigated it thoroughly enough myself.

Speaking of this... is it a good practice to use row groups instead of hive partitions or is that considered an anti-pattern when speaking of parquet?

There are pros and cons to both. Row groups can be more flexible than hive partitions (e.g. each row group contains statistics for ALL columns and not just some and row group filters can include things like bloom filters). However, hive partitions support append operations (you can always add more files to the month=July folder but you can't add more data to an existing row group).

@westonpace
Copy link
Member

Would that be a good addition to pyarrow dataset to optionally ensure the parquet rowgroups contains only one partition?

I'm not sure I understand what you are suggesting.

@alippai
Copy link
Contributor Author

alippai commented May 18, 2023

Setting a partitionby='rowgroups' to write_table so it'd write:

Rowgroup 1:
  ...
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
  - date: 20230517, value: x
Rowgroup 2:
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  - date: 20230518, value: x
  ...

Instead of the current (based on the row count limit):

Rowgroup 1:
 ...
 - date: 20230517, value: x
 - date: 20230517, value: x
 - date: 20230517, value: x
 - date: 20230518, value: x
 - date: 20230518, value: x
 - date: 20230518, value: x
Rowgroup 2:
 - date: 20230518, value: x
 - date: 20230518, value: x
 - date: 20230518, value: x
 - date: 20230518, value: x
 - date: 20230518, value: x
 - date: 20230518, value: x
 ...

@alippai
Copy link
Contributor Author

alippai commented May 18, 2023

@westonpace reading the parquet thrift doc the naive approach would be keeping the buffers and statistics only, recreating everything else. I didn't know parquet works like this, thanks for the insight!

My goal is slightly different from deltalake and others (and I'm also not fan of JVM based setups for this kind of workload). My idea was relying less on the traditional FS and using the internal structure of the parquet more because of the very reason you've mentioned (filters, statistics). Architecturally Skyhook would be closer to this or "simply" storing all the metadata + statistics in TiKV or other kv store.

@mapleFU
Copy link
Member

mapleFU commented May 19, 2023

@alippai I guess it "can" be a better solution, because spliting partition to different row-groups makes reader can prune uneccessary row-group. But I don't know whether current implemention support it.

@westonpace
Copy link
Member

Ok, I think I understand better now. I misread this request originally and didn't fully realize that you want to create a new parquet file. I thought you were trying to modify the existing parquet file.

Yes, this makes sense. No, I'm not sure the capability is really there but some of it might be.

The parquet library always decodes its data, as best I can tell. There are some underlying structures like the PageReader which might not. However, there is nothing at the level of "read this row group and append it to another file without decoding".

@alippai
Copy link
Contributor Author

alippai commented Jun 1, 2023

If I’m right @tustvold created similar low level interfaces. Still looking for the exact MR but maybe he can share what level of abstraction worked well in the rust impl

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

apache/arrow-rs#4269 is the PR. Not sure how transferable it is to C++, it is somewhat coupled with the way the write path works, but the basic idea is to allow appending an entire column chunk to a row group.

apache/arrow-rs#4274 contains an example of how to use this to efficiently concatenate files

@alippai
Copy link
Contributor Author

alippai commented Jun 1, 2023

@tustvold @mapleFU @westonpace (and many others): the speed you are adding new and new parquet features is amazing. Maybe we should start adding a matrix for arrow, arrow-rs, arrow2 (rs), parquet-mr, duckdb to https://arrow.apache.org/docs/status.html so we know what statistics, bloom filters are read and written, which operations are available.

Would you be supportive or it's not the right time now? I can start the MR.

@tustvold
Copy link
Contributor

tustvold commented Jun 1, 2023

I think adding documentation of the support within the various arrow projects for parquet makes sense to me, https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/, might serve as some inspiration for further features beyond the obvious support for encoding X or data type Y.

I'm less sure that we should endeavor to maintain up to date feature support for readers outside the arrow umbrella, e.g. parquet-mr, duckdb, arrow2, etc...

@alippai
Copy link
Contributor Author

alippai commented Jun 1, 2023

Indeed, I didn't realize that's not covered by the current docs. I also favor the less work and more consistency.

@westonpace
Copy link
Member

+1 to adding this table somewhere (also, yes, big thanks to @mapleFU and @wgtmac for the recent work). A good first pass would be for each implementation to document what they support locally (e.g. arrow-c++ to add to https://arrow.apache.org/docs/cpp/parquet.html#supported-parquet-features and arrow-rs to add to somewhere in https://docs.rs/parquet/latest/parquet/arrow/index.html)

If we are going to combine them in a table somewhere then maybe we could add to somewhere on https://parquet.apache.org/docs/overview/

That would allow other parquet implementations to contribute their feature list if they chose and might be more appropriate than https://arrow.apache.org/docs/status.html

Although I have no write privileges over there 🤷 so if we want something more local it would probably be ok.

@wgtmac
Copy link
Member

wgtmac commented Jun 10, 2023

I did similar work in the parquet-mr repo to merge row groups of different parquet files without decompression and decoding into a single parquet file (with some supported transformation like re-compression, encryption or dropping columns).

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

Is that what you suppose to have in the parquet-cpp? @alippai

@alippai
Copy link
Contributor Author

alippai commented Jun 11, 2023

@wgtmac In this issue I was looking for a more simple function, appending a new RowGroup (or copying a rowgroup) without merging. Or deleting/replacing rowgroups without materializing the whole file as an Arrow Table.

Overall I think a public RowGroup level (what you have in parquet-mr) and page level API (what @tustvold created for rust) makes sense (without decoding, decompression, statistics and bloom filter re-calculation etc).

@wgtmac
Copy link
Member

wgtmac commented Jun 12, 2023

@wgtmac In this issue I was looking for a more simple function, appending a new RowGroup (or copying a rowgroup) without merging. Or deleting/replacing rowgroups without materializing the whole file as an Arrow Table.

Overall I think a public RowGroup level (what you have in parquet-mr) and page level API (what @tustvold created for rust) makes sense (without decoding, decompression, statistics and bloom filter re-calculation etc).

Yes, I understand your use case. Appending to or modifying a parquet file would require the file system to support mutation or append operation, which is not a typical use case. So merging several parquet files directly on row groups seems to be more generic and can be an alternative solution in your case.

@vinothchandar
Copy link
Member

@wgtmac For the rewriting, is there any advantage of using Arrow over parquet-mr. IIUC, you decode the pages there lazily and write back (w or w/o modifications). Maybe for vector processing transformation of the entire page perhaps? e.g x = x + 1 on column x.

@westonpace westonpace changed the title Process parquet rowgroups without Arrow conversion [C++][Parquet] Process parquet rowgroups without Arrow conversion Jul 13, 2023
@wgtmac
Copy link
Member

wgtmac commented Jul 13, 2023

@wgtmac For the rewriting, is there any advantage of using Arrow over parquet-mr. IIUC, you decode the pages there lazily and write back (w or w/o modifications). Maybe for vector processing transformation of the entire page perhaps? e.g x = x + 1 on column x.

I don't think there is significant difference between Arrow and parquet-mr if pages do not need any modification. When re-compression and/or re-encoding is applied, it would be more performant to go with Arrow.

@vinothchandar
Copy link
Member

Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.

@wgtmac
Copy link
Member

wgtmac commented Jul 13, 2023

Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.

Just curious: is there any plan to add similar optimization to Apache Hudi? Our old friends at Uber have done a great job: https://www.uber.com/en-HK/blog/fast-copy-on-write-within-apache-parquet/. @vinothchandar

@westonpace
Copy link
Member

Thanks. @westonpace Any guidance/pointers from someone wanting to take this forward? Does that make sense to add to Arrow?.

I am not familiar enough with the code in parquet-c++ to be able to give much advice going forwards (@wgtmac and @mapleFU may have an opinion). I think it makes sense as a parquet-c++ feature but probably not as an arrow feature (as you wouldn't need any arrow arrays)

@vinothchandar
Copy link
Member

@wgtmac We have an implementation using parquet-mr in the community. I am trying to consolidate all these efforts - ours, parquet-mr and understand plans in Arrow, as we'd like to embrace Arrow (in place of Avro in Hudi 1.0). We can jam more on Hudi Slack if the parquet-mr piece interests you. cc @yihua

Thanks @westonpace. I'll wait to hear more opinions.

@wgtmac
Copy link
Member

wgtmac commented Jul 16, 2023

Sure, that sounds interesting! Let's discuss more about that @vinothchandar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants