Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gzipped (and/or) other compressions for CSV input files. #116

Closed
ghuls opened this issue Dec 15, 2024 · 3 comments · Fixed by #128
Closed

Support gzipped (and/or) other compressions for CSV input files. #116

ghuls opened this issue Dec 15, 2024 · 3 comments · Fixed by #128
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@ghuls
Copy link

ghuls commented Dec 15, 2024

Reading compressed CSV files was added to parquet-fromcsv.rs of `arrow-rs:

apache/arrow-rs#3721

For Gzip compressed files, supporting reading gzip file with multiple gzip headers is a must as in bioinformatics BGZF compressed files are very common:
apache/arrow-rs@993a7cc

@domoritz
Copy link
Owner

Thanks for the issue no pointer. What do we need to do here to support it? Can you send a pull request?

@ghuls
Copy link
Author

ghuls commented Dec 15, 2024

For now, when reading a compres
sed CSV/TSV input file, you get this error:

$ ./csv2parquet --delimiter $'\t' --comment '#' --header 'false' --max-read-records 1000 fragments.tsv.gz  fragments.csv2parquet.parquet
Error: General("Error inferring schema: Csv error: Encountered UTF-8 error while reading CSV file: invalid utf-8: invalid UTF-8 in field 0 near byte index 1 at line 1")

Reading the input file probably needs something like this:
https://github.com/apache/arrow-rs/pull/4160/files

@domoritz
Copy link
Owner

That would be great. Please send a pull request similar to https://github.com/apache/arrow-rs/pull/4160/files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants