ποΈ A utility for converting WARC to Parquet.
The binary may be installed via cargo
:
$ cargo install warc-parquet
To use the crate in your project, add the following to your Cargo.toml
file:
[dependencies]
warc-parquet = "0.6.1"
Once installed, the warc-parquet
utility can be used to transform WARC into Parquet:
$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.zstd.parquet
warc-parquet
is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward:
$ wget --warc-file github 'https://github.com'
$ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.zstd.parquet
It's also simple to preprocess via standard UNIX piping:
$ cat example.warc.gz | gzip -d | warc-parquet > example.zstd.parquet
Various compression options, including the option to forego compression altogether, are also available:
$ cat example.warc.gz | warc-parquet --gzipped --compression gzip > example.gz.parquet
π‘
warc-parquet --help
displays complete options and usage information.
Refer to the docs for more details about how to use the Reader
within your own programs.
There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB:
$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select type, id from 'example.zstd.parquet';
ββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β type β id β
ββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββ€
β warcinfo β <urn:uuid:A8063499-7675-4D8D-A736-A1D7DAE84C84> β
β request β <urn:uuid:3EB20966-D74F-4949-AACB-23DB3A0733A7> β
β response β <urn:uuid:8B92CADC-F770-45BE-8B72-E13A61CD6D1C> β
β metadata β <urn:uuid:4C0E9E17-E21B-49E0-859A-D1016FBDE636> β
β resource β <urn:uuid:14F502A5-3BDE-4D0B-8A43-95F4BB8398C6> β
β resource β <urn:uuid:6B6D6ADD-52FF-4760-AA00-FB9E739CABBE> β
ββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββ
D describe select * from 'example.zstd.parquet';
βββββββββββββββββββββββββββ¬ββββββββββββββ¬βββββββ¬ββββββ¬ββββββββββ¬ββββββββ
β column_name β column_type β null β key β default β extra β
βββββββββββββββββββββββββββΌββββββββββββββΌβββββββΌββββββΌββββββββββΌββββββββ€
β id β VARCHAR β YES β β β β
β content_length β UINTEGER β YES β β β β
β date β TIMESTAMP β YES β β β β
β type β VARCHAR β YES β β β β
β content_type β VARCHAR β YES β β β β
β concurrent_to β VARCHAR β YES β β β β
β block_digest β VARCHAR β YES β β β β
β payload_digest β VARCHAR β YES β β β β
β ip_address β VARCHAR β YES β β β β
β refers_to β VARCHAR β YES β β β β
β target_uri β VARCHAR β YES β β β β
β truncated β VARCHAR β YES β β β β
β warc_info_id β VARCHAR β YES β β β β
β filename β VARCHAR β YES β β β β
β profile β VARCHAR β YES β β β β
β identified_payload_type β VARCHAR β YES β β β β
β segment_number β UINTEGER β YES β β β β
β segment_origin_id β VARCHAR β YES β β β β
β segment_total_length β UINTEGER β YES β β β β
β body β BLOB β YES β β β β
βββββββββββββββββββββββββββ΄ββββββββββββββ΄βββββββ΄ββββββ΄ββββββββββ΄ββββββββ
This crate uses #![forbid(unsafe_code)]
to ensure everything is implemented in 100% safe Rust.
We appreciate all kinds of contributions, thank you!