-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot skip bad record while reading warc file #267
Comments
Hi @akshedu, thanks for the report. Can you let us know a little bit more? There should have been a template to help tease out some more information. Can you update this ticket, and provide more context? It will help us get to the root of the issue. This also sounds like it could be a duplicate of #246, and #258. |
Hi @ruebot, updated with more details. If you need the warc file I can share it as well. |
From where? We don't push snapshot builds. Did you build aut locally, and use the As an aside, the template is there to alleviate a lot of this contextual clarity that is lost here. At the very least, can you provide your exact steps with this format:
|
Steps to reproduce:
|
@akshedu can you try and reproduce with Apache Spark 2.1.3. The 0.16.0 release doesn't officially have Apache 2.3.1 support. |
I just ran this on 2.1.1. The following script:
led to this error message
which FWIW is tripped by this in WARCReaderFactory.
|
Same here with 2.1.3
|
@ruebot and I did a bit of digging into this, using JWAT-Tools and then manually looking at the WARC records themselves. There are some issues with the WARC file itself. Here's the test results:
We looked into the headers, and here's the WARC header for the broken file:
and here's a working header:
I have carriage returns turned on here, so we can see how (a) line-endings differ; (b) WARC-Date is different, etc. There are similar mismatches throughout the headers. I'm not an expert on WARCs - I'm not sure if the specification changed dramatically between 0.18 and 1.0, or whether this is an artefact of Nutch or being compressed/decompressed at some stage. But I think since we rely on the webarchive-commons library, it might be worth opening up an issue there if you want to continue poking at this. It's probably out of scope for AUT. I did see a similar issue there that might be of help. |
So, I did get this to work. Broken WARCs stick in my craw! See the results of the top ten domains here:
The WARC is basically all screwed up, with line-endings, etc. (see above) If you do need to get it to work, however, I used jwattools to decompress and recompress. The recompressed warc.gz file is correct and now works with AUT. See set of commands here:
Then works and the re-compression process has fixed the file. Not ideal, but I don't think this dataset is ideal from a WARC compliance standpoint. 😄 |
Trying to read a WARC file which has an info header results in read failure. I followed the steps as:
Using spark 2.3.1, scala shell. Downloaded the aut-0.16.1-SNAPSHOT-fatjar.jar and used the --jars option with spark-shell to load additional functions.
Loaded the required modules:
First tried the compressed file:
Got the following error:
Then tried the uncompressed file:
Got the following error:
Checked the warc file and it looked like this:
The text was updated successfully, but these errors were encountered: