You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which part is this question about
The parquet file writer usage
Describe your question
Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.
Disable dictionary compression for columns that don't have repeated values
Use writer version 2, which has better string encoding
Represent the id / sequence as an integral type instead of a variable length string
Try without snappy, as compression may not always yield benefits
Maybe try writing the data using something like pyarrow to determine if this is something specific to the Rust implementation
Without the data it is hard to say for sure what is going on, but ignoring compression parquet will have at least a 4 byte overhead per string, and so in the case of lots of small strings...
Which part is this question about
The parquet file writer usage
Describe your question
Hi, I'm looking if parqet and arrow could fit a usecase of mine, but I've ran into a strange issue, for which I can find now answer in the documentation. I have two input files in txt format, where each record spans 4 lines. I have a parser that reads that just fine, and want to convert that format to a parquet file. The two input files are combined around 600MB, but when I write these to a parquet file, the resulting file is nearly 5GB, it also consumes around 6/7GB memory while writing the files. I have turned on compression.
My rust configuration for the writer.
The text was updated successfully, but these errors were encountered: