Support compressed CSV/TSV files in `parquet-fromcsv`. #3721

ghuls · 2023-02-14T23:20:41Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Reading a gzip compressed TSV file with parquet-fromcsv.

Describe the solution you'd like

Support compressed CSV/TSV files in parquet-fromcsv.

Also it would be nice if there was a link to the parquet schema text format, or some example schemas.

The text was updated successfully, but these errors were encountered:

suxiaogang223 · 2023-02-15T15:57:34Z

I would like try to fix this👀

suxiaogang223 · 2023-03-12T14:21:26Z

I sadly find that this feature could not be implemented elegantly. I want to be able to stream compressed files, although I can use the flate2 library to process compressed files like normal files, but arrow-csv reader requires Seek trait.

arrow_csv::reader::ReaderBuilder
pub fn build<R>(self, reader: R) -> Result<Reader<R>, ArrowError>
where
    R: Read + Seek,

ghuls · 2023-03-16T13:19:56Z

That is a pity.

suxiaogang223 · 2023-04-26T01:46:51Z

great

tustvold · 2023-04-27T17:40:46Z

#4130 has removed the Seek requirement from all the CSV reader APIs, so it should now be possible to support compressed CSV in addition to stdin input within parquet-fromcsv

suxiaogang223 · 2023-04-28T01:00:40Z

I would fix this now🤓

ghuls · 2023-05-04T14:08:08Z

@suxiaogang223 Thanks for the effort. There is still some issue in some cases.
Most likely this is related when the gzipped file contain multiple gzip chunks (especially if the different decompressed chunks end in the middle of a line) like what happens when compressing files with bgzip (tool from HTSlib used a lot in bioinformatics)

For example, the following fails for me:

# Create a compressed TSV file with bgzip (multiple gzip chunks)
❯  yes $(printf 'chr1\t123456\t123465\tABCDEFGHIJKLMNOPQRSTUVX\t1\n') | tr ' ' '\t' |  head -n 4000 | dd if=/dev/stdin of=/dev/stdout bs=4k | bgzip > test.tsv.bgzip.gz
43+1 records in
43+1 records out
180000 bytes (180 kB, 176 KiB) copied, 0.0034506 s, 52.2 MB/s

$ cat test_parquet.schema
message root {
  OPTIONAL BYTE_ARRAY Chromosome (STRING);
  OPTIONAL INT64 Start;
  OPTIONAL INT64 End;
  OPTIONAL BYTE_ARRAY Name (STRING);
  OPTIONAL INT64 Count;
}


parquet-fromcsv \
    --schema test_parquet.schema \
    --delimiter $'\t' \
    --csv-compression gzip \
    --input-file test.tsv.bgzip.gz \
    --output-file test.parquet

Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 1451, expected 5 got 4")))

The same failure, but made with plain gzip, where we compress each 4096 bytes as a different gzip chunk:

$ rm test.gzip_4096_chunked.tsv.gz; final_file_size=$(yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | wc -c);  for i in $(seq 1 4096 "${final_file_size}") ; do echo $i; yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | tail -c +${i} | head -c 4096 | gzip >> gzip_4096_chunked.tsv.gz; done
1
4097
8193
12289
16385
20481
24577
28673
32769
36865
40961

# Check if each line contains the same content:
$ zcat gzip_4096_chunked.tsv.gz | uniq -c
   1000 chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1


$ parquet-fromcsv --schema /staging/leuven/stg_00002/lcb/ghuls/fragments_parquet.schema --delimiter $'\t' --csv-compression gzip --input-file gzip_4096_chunked.tsv.gz --output-file gzip_4096_chunked.parquet AT001Error: WithContext("Failed to read RecordBatch from CSV", ArrowError(CsvError("incorrect number of fields for line 92, expected 5 got 1")))

# Last 10 lines of the first chunk:
❯  yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | tail -c +1 | head -c 4096 | tail
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
c

# First 10 lines of second chunk:
$  yes 'chr1 123456 123465 ABCDEFGHIJKLMNOPQRSTUVX 1' | tr ' ' '\t' |  head -n 1000 | tail -c +4097 | head -c 4096 | head
hr1     123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1
chr1    123456  123465  ABCDEFGHIJKLMNOPQRSTUVX 1

suxiaogang223 · 2023-05-05T06:10:56Z

@ghuls Thank you for your feedback. I really didn't think about multi-compressed files🙃, could you please make a new issue so i can continue to track this problem and fix it.

tustvold · 2023-05-18T08:24:38Z

label_issue.py automatically added labels {'parquet'} from #4160

ghuls added the enhancement Any new improvement worthy of a entry in the changelog label Feb 14, 2023

tustvold added good first issue Good for newcomers help wanted labels Feb 15, 2023

tustvold mentioned this issue Apr 25, 2023

Remove Seek Requirement from CSV ReaderBuilder #4130

Closed

suxiaogang223 mentioned this issue Apr 30, 2023

Support Compression in parquet-fromcsv #4160

Merged

tustvold closed this as completed in #4160 May 3, 2023

ghuls mentioned this issue May 5, 2023

Fail to read block compressed gzip files with parquet-fromcsv #4173

Closed

tustvold added the parquet Changes to the parquet crate label May 18, 2023

ghuls mentioned this issue Jun 16, 2023

Support reading of zstd compressed csv files pola-rs/polars#9283

Open

ghuls mentioned this issue Dec 15, 2024

Support gzipped (and/or) other compressions for CSV input files. domoritz/arrow-tools#116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support compressed CSV/TSV files in `parquet-fromcsv`. #3721

Support compressed CSV/TSV files in `parquet-fromcsv`. #3721

ghuls commented Feb 14, 2023

suxiaogang223 commented Feb 15, 2023

suxiaogang223 commented Mar 12, 2023

ghuls commented Mar 16, 2023

suxiaogang223 commented Apr 26, 2023

tustvold commented Apr 27, 2023

suxiaogang223 commented Apr 28, 2023

ghuls commented May 4, 2023 •

edited

Loading

suxiaogang223 commented May 5, 2023

tustvold commented May 18, 2023

Support compressed CSV/TSV files in parquet-fromcsv. #3721

Support compressed CSV/TSV files in parquet-fromcsv. #3721

Comments

ghuls commented Feb 14, 2023

suxiaogang223 commented Feb 15, 2023

suxiaogang223 commented Mar 12, 2023

ghuls commented Mar 16, 2023

suxiaogang223 commented Apr 26, 2023

tustvold commented Apr 27, 2023

suxiaogang223 commented Apr 28, 2023

ghuls commented May 4, 2023 • edited Loading

suxiaogang223 commented May 5, 2023

tustvold commented May 18, 2023

Support compressed CSV/TSV files in `parquet-fromcsv`. #3721

Support compressed CSV/TSV files in `parquet-fromcsv`. #3721

ghuls commented May 4, 2023 •

edited

Loading