Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement reading gzipped WARC files #10

Merged
merged 1 commit into from
Jul 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "warc"
version = "0.2.0"
version = "0.2.1"
description = "A Rust library for reading and writing WARC files."
readme = "README.md"
repository = "https://github.com/jedireza/warc"
Expand All @@ -14,3 +14,11 @@ edition = "2018"
chrono = "0.4.11"
nom = "5.1.1"
uuid = { version = "0.8.1", features = ["v4"] }

[dependencies.libflate]
version = "1"
optional = true

[features]
default = ["gzip"]
gzip = ["libflate"]
31 changes: 31 additions & 0 deletions examples/read_gzip.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
use warc::header::{WARC_DATE, WARC_RECORD_ID};
use warc::WarcReader;

fn main() -> Result<(), std::io::Error> {
let file = WarcReader::from_path_gzip("warc_example.warc.gz")?;

let mut count = 0;
for record in file {
count += 1;
match record {
Err(err) => println!("ERROR: {}\r\n", err),
Ok(record) => {
println!(
"{}: {}",
WARC_RECORD_ID,
String::from_utf8_lossy(record.headers.get(WARC_RECORD_ID).unwrap())
);
println!(
"{}: {}",
WARC_DATE,
String::from_utf8_lossy(record.headers.get(WARC_DATE).unwrap())
);
println!("");
}
}
}

println!("Total records: {}", count);

Ok(())
}
19 changes: 19 additions & 0 deletions src/warc_types.rs
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
use crate::parser;
use crate::{Error, Record};

use std::fs;
use std::io;
use std::io::{BufRead, BufReader, BufWriter, Write};
use std::path::Path;

#[cfg(feature = "gzip")]
use libflate::gzip::Decoder as GzipReader;

const KB: usize = 1_024;
const MB: usize = 1_048_576;

Expand Down Expand Up @@ -76,6 +80,21 @@ impl WarcReader<BufReader<fs::File>> {
}
}

#[cfg(feature = "gzip")]
impl WarcReader<BufReader<GzipReader<std::fs::File>>> {
pub fn from_path_gzip<P: AsRef<Path>>(path: P) -> io::Result<Self> {
let file = fs::OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(&path)?;
let gzip_stream = GzipReader::new(file)?;
let reader = BufReader::with_capacity(1 * MB, gzip_stream);

Ok(WarcReader::new(reader))
}
}

impl<R: BufRead> Iterator for WarcReader<R> {
type Item = Result<Record, Error>;

Expand Down