-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support compressed CSV/TSV files in parquet-fromcsv
.
#3721
Comments
I would like try to fix this👀 |
I sadly find that this feature could not be implemented elegantly. I want to be able to stream compressed files, although I can use the arrow_csv::reader::ReaderBuilder
pub fn build<R>(self, reader: R) -> Result<Reader<R>, ArrowError>
where
R: Read + Seek, |
That is a pity. |
great |
#4130 has removed the Seek requirement from all the CSV reader APIs, so it should now be possible to support compressed CSV in addition to stdin input within parquet-fromcsv |
I would fix this now🤓 |
@suxiaogang223 Thanks for the effort. There is still some issue in some cases. For example, the following fails for me: # Create a compressed TSV file with bgzip (multiple gzip chunks)
❯ yes $(printf 'chr1\t123456\t123465\tABCDEFGHIJKLMNOPQRSTUVX\t1\n') | tr ' ' '\t' | head -n 4000 | dd if=/dev/stdin of=/dev/stdout bs=4k | bgzip > test.tsv.bgzip.gz
43+1 records in
43+1 records out
180000 bytes (180 kB, 176 KiB) copied, 0.0034506 s, 52.2 MB/s
$ cat test_parquet.schema
message root {
OPTIONAL BYTE_ARRAY Chromosome (STRING);
OPTIONAL INT64 Start;
OPTIONAL INT64 End;
OPTIONAL BYTE_ARRAY Name (STRING);
OPTIONAL INT64 Count;
}
parquet-fromcsv \
--schema test_parquet.schema \
--delimiter $'\t' \
--csv-compression gzip \
--input-file test.tsv.bgzip.gz \
--output-file test.parquet
Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 1451, expected 5 got 4"))) The same failure, but made with plain gzip, where we compress each 4096 bytes as a different gzip chunk: $ rm test.gzip_4096_chunked.tsv.gz; final_file_size=$(yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | wc -c); for i in $(seq 1 4096 "${final_file_size}") ; do echo $i; yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | tail -c +${i} | head -c 4096 | gzip >> gzip_4096_chunked.tsv.gz; done
1
4097
8193
12289
16385
20481
24577
28673
32769
36865
40961
# Check if each line contains the same content:
$ zcat gzip_4096_chunked.tsv.gz | uniq -c
1000 chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
$ parquet-fromcsv --schema /staging/leuven/stg_00002/lcb/ghuls/fragments_parquet.schema --delimiter $'\t' --csv-compression gzip --input-file gzip_4096_chunked.tsv.gz --output-file gzip_4096_chunked.parquet AT001Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 92, expected 5 got 1")))
# Last 10 lines of the first chunk:
❯ yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | tail -c +1 | head -c 4096 | tail
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
c
# First 10 lines of second chunk:
$ yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' | head -n 1000 | tail -c +4097 | head -c 4096 | head
hr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1
|
@ghuls Thank you for your feedback. I really didn't think about multi-compressed files🙃, could you please make a new issue so i can continue to track this problem and fix it. |
|
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Reading a gzip compressed TSV file with
parquet-fromcsv
.Describe the solution you'd like
Support compressed CSV/TSV files in
parquet-fromcsv
.Also it would be nice if there was a link to the parquet schema text format, or some example schemas.
The text was updated successfully, but these errors were encountered: