Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot skip bad record while reading warc file #267

Closed
akshedu opened this issue Sep 5, 2018 · 10 comments
Closed

Cannot skip bad record while reading warc file #267

akshedu opened this issue Sep 5, 2018 · 10 comments

Comments

@akshedu
Copy link

akshedu commented Sep 5, 2018

Trying to read a WARC file which has an info header results in read failure. I followed the steps as:

Using spark 2.3.1, scala shell. Downloaded the aut-0.16.1-SNAPSHOT-fatjar.jar and used the --jars option with spark-shell to load additional functions.

spark-shell --jars ~/Workspace/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar

Loaded the required modules:

scala> import io.archivesunleashed._
import io.archivesunleashed._

scala> import io.archivesunleashed.matchbox._
import io.archivesunleashed.matchbox._

First tried the compressed file:

scala> val r = RecordLoader.loadArchives("/Users/akshanshgupta/Workspace/00.warc.gz", sc)
r: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[11] at map at package.scala:54

Got the following error:

scala> r.take(1)
2018-09-05 18:21:36 ERROR Executor:91 - Exception in task 0.0 in stage 2.0 (TID 2)
java.io.NotSerializableException: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@21ccdded)
	- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@7f369f4c)
	- element of array (index: 0)
	- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:21:36 ERROR TaskSetManager:70 - Task 0.0 in stage 2.0 (TID 2) had a not serializable result: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@21ccdded)
	- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@7f369f4c)
	- element of array (index: 0)
	- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1); not retrying
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 2.0 (TID 2) had a not serializable result: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@21ccdded)
	- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@7f369f4c)
	- element of array (index: 0)
	- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
  ... 57 elided

Then tried the uncompressed file:

scala> val r = RecordLoader.loadArchives("/Users/akshanshgupta/Workspace/01.warc", sc)
r: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[14] at map at package.scala:54

Got the following error:

scala> r.take(1)
2018-09-05 18:23:15 WARN  ArchiveReader$ArchiveRecordIterator:462 - Trying skip of failed record cleanup of {reader-identifier=file:/Users/akshanshgupta/Workspace/01.warc, absolute-offset=0, WARC-Date=2009-03-65T08:43:19-0800, Content-Length=219, WARC-Record-ID=<urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character a(Expecting d)
java.io.IOException: Unexpected character a(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:175)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:23:15 WARN  ArchiveReader$ArchiveRecordIterator:462 - Trying skip of failed record cleanup of {reader-identifier=file:/Users/akshanshgupta/Workspace/01.warc, absolute-offset=0, WARC-Date=2009-03-65T08:43:19-0800, Content-Length=219, WARC-Record-ID=<urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character 41(Expecting d)
java.io.IOException: Unexpected character 41(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:186)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:23:15 WARN  WARCReaderFactory$UncompressedWARCReader:502 - Bad Record. Trying skip (Record start 409): Unexpected character 57(Expecting d)
res4: Array[io.archivesunleashed.ArchiveRecord] = Array()

Checked the warc file and it looked like this:

WARC/0.18
WARC-Type: warcinfo
WARC-Date: 2009-03-65T08:43:19-0800
WARC-Record-ID: <urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>
Content-Type: application/warc-fields
Content-Length: 219

software: Nutch 1.0-dev (modified for clueweb09)
isPartOf: clueweb09-en
description: clueweb09 crawl with WARC output
format: WARC file version 0.18
conformsTo: http://www.archive.org/documents/WarcFileFormat-0.18.html

WARC/0.18
WARC-Type: response
WARC-Target-URI: http://00000-nrt-realestate.homepagestartup.com/
WARC-Warcinfo-ID: 993d3969-9643-4934-b1c6-68d4dbe55b83
WARC-Date: 2009-03-65T08:43:19-0800
WARC-Record-ID: <urn:uuid:67f7cabd-146c-41cf-bd01-04f5fa7d5229>
WARC-TREC-ID: clueweb09-en0000-00-00000
Content-Type: application/http;msgtype=response
WARC-Identified-Payload-Type: 
Content-Length: 16558

HTTP/1.1 200 OK
Content-Type: text/html
Date: Tue, 13 Jan 2009 18:05:10 GMT
Pragma: no-cache
Cache-Control: no-cache, must-revalidate
X-Powered-By: PHP/4.4.8
Server: WebServerX
Connection: close
Last-Modified: Tue, 13 Jan 2009 18:05:10 GMT
Expires: Mon, 20 Dec 1998 01:00:00 GMT
Content-Length: 16254

<head> <meta http-equiv="Content-Language" content="en-gb"> <meta http-equiv="Content-Type" 
@ruebot
Copy link
Member

ruebot commented Sep 5, 2018

Hi @akshedu, thanks for the report. Can you let us know a little bit more? There should have been a template to help tease out some more information. Can you update this ticket, and provide more context? It will help us get to the root of the issue.

This also sounds like it could be a duplicate of #246, and #258.

@akshedu
Copy link
Author

akshedu commented Sep 5, 2018

Hi @ruebot, updated with more details. If you need the warc file I can share it as well.

@ruebot
Copy link
Member

ruebot commented Sep 5, 2018

Downloaded the aut-0.16.1-SNAPSHOT-fatjar.jar

From where? We don't push snapshot builds. Did you build aut locally, and use the --jars option?

As an aside, the template is there to alleviate a lot of this contextual clarity that is lost here. At the very least, can you provide your exact steps with this format:

**To Reproduce**
Steps to reproduce the behavior (e.g.):
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

@akshedu
Copy link
Author

akshedu commented Sep 5, 2018

Steps to reproduce:

  1. Download the aut jar -
    https://github.com/archivesunleashed/aut/releases/download/aut-0.16.0/aut-0.16.0-fatjar.jar

  2. Run the spark-shell with the jar file:

spark-shell --jars ~/Downloads/aut-0.16.0-fatjar.jar
  1. Download the warc file -
    https://www.cse.iitb.ac.in/~soumen/tmp/cw09/00.warc.gz

  2. Load the required modules:

scala> import io.archivesunleashed._
import io.archivesunleashed._

scala> import io.archivesunleashed.matchbox._
import io.archivesunleashed.matchbox._
  1. Read the warc file:
scala> val r = RecordLoader.loadArchives("/Users/akshanshgupta/Workspace/00.warc.gz", sc)
r: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[2] at map at package.scala:50

scala> r.take(1)
[Stage 0:>                                                          (0 + 1) / 1]2018-09-05 18:38:48 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.io.NotSerializableException: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@5dc9ca4a)
	- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@56d06a37)
	- element of array (index: 0)
	- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1)
	at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
	at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
	at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:38:48 ERROR TaskSetManager:70 - Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@5dc9ca4a)
	- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@56d06a37)
	- element of array (index: 0)
	- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1); not retrying
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.archive.io.warc.WARCRecord
Serialization stack:
	- object not serializable (class: org.archive.io.warc.WARCRecord, value: org.archive.io.warc.WARCRecord@5dc9ca4a)
	- field (class: io.archivesunleashed.ArchiveRecordImpl, name: warcRecord, type: class org.archive.io.warc.WARCRecord)
	- object (class io.archivesunleashed.ArchiveRecordImpl, io.archivesunleashed.ArchiveRecordImpl@56d06a37)
	- element of array (index: 0)
	- array (class [Lio.archivesunleashed.ArchiveRecord;, size 1)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
  ... 53 elided
  1. Uncompress the file and try reading again:
scala> val r = RecordLoader.loadArchives("/Users/akshanshgupta/Workspace/00.warc", sc)
r: org.apache.spark.rdd.RDD[io.archivesunleashed.ArchiveRecord] = MapPartitionsRDD[5] at map at package.scala:50

scala> r.take(1)
2018-09-05 18:39:45 WARN  ArchiveReader$ArchiveRecordIterator:462 - Trying skip of failed record cleanup of {reader-identifier=file:/Users/akshanshgupta/Workspace/00.warc, absolute-offset=0, WARC-Date=2009-03-65T08:43:19-0800, Content-Length=219, WARC-Record-ID=<urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character a(Expecting d)
java.io.IOException: Unexpected character a(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:175)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:39:45 WARN  ArchiveReader$ArchiveRecordIterator:462 - Trying skip of failed record cleanup of {reader-identifier=file:/Users/akshanshgupta/Workspace/00.warc, absolute-offset=0, WARC-Date=2009-03-65T08:43:19-0800, Content-Length=219, WARC-Record-ID=<urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>, WARC-Type=warcinfo, Content-Type=application/warc-fields}: Unexpected character 41(Expecting d)
java.io.IOException: Unexpected character 41(Expecting d)
	at org.archive.io.warc.WARCReader.readExpectedChar(WARCReader.java:80)
	at org.archive.io.warc.WARCReader.gotoEOR(WARCReader.java:68)
	at org.archive.io.ArchiveReader.cleanupCurrentRecord(ArchiveReader.java:176)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.hasNext(ArchiveReader.java:449)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:501)
	at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:436)
	at io.archivesunleashed.data.ArchiveRecordInputFormat$ArchiveRecordReader.nextKeyValue(ArchiveRecordInputFormat.java:186)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:214)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$28.apply(RDD.scala:1358)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
2018-09-05 18:39:45 WARN  WARCReaderFactory$UncompressedWARCReader:502 - Bad Record. Trying skip (Record start 409): Unexpected character 57(Expecting d)
res1: Array[io.archivesunleashed.ArchiveRecord] = Array()  

@ruebot
Copy link
Member

ruebot commented Sep 5, 2018

@akshedu can you try and reproduce with Apache Spark 2.1.3. The 0.16.0 release doesn't officially have Apache 2.3.1 support.

@ianmilligan1
Copy link
Member

ianmilligan1 commented Sep 5, 2018

I just ran this on 2.1.1.

The following script:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/mnt/vol1/data_sets/aut_debug/*.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)

led to this error message

unexpected extra data after record org.archive.io.warc.WARCRecord@790869a4

which FWIW is tripped by this in WARCReaderFactory.

        protected void gotoEOR(ArchiveRecord rec) throws IOException {
            long skipped = 0; 
            while (getIn().read()>-1) {
                skipped++;
            }
            if(skipped>4) {
                System.err.println("unexpected extra data after record "+rec);
            }
            return;
        }
    }

@ruebot
Copy link
Member

ruebot commented Sep 5, 2018

Same here with 2.1.3

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.3
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r = RecordLoader.loadArchives("/home/nruest/Downloads/00.warc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)


// Exiting paste mode, now interpreting.

unexpected extra data after record org.archive.io.warc.WARCRecord@6b2cebc5
import io.archivesunleashed._
import io.archivesunleashed.matchbox._
r: Array[(String, Int)] = Array()

@ianmilligan1
Copy link
Member

@ruebot and I did a bit of digging into this, using JWAT-Tools and then manually looking at the WARC records themselves.

There are some issues with the WARC file itself. Here's the test results:

$ ./jwattools.sh -el test /mnt/vol1/data_sets/aut_debug/00.warc.gz
Showing errors: true
Validate digest: true
Using relaxed URI validation for ARC URL and WARC Target-URI.
Using 1 thread(s).
Output Thread started.
ThreadPool started.
Queued 1 file(s).
ThreadPool shut down.
Output Thread stopped.
#
# Job summary
#
GZip files: 0
  +  Arc: 0
  + Warc: 1
 Arc files: 0
Warc files: 0
    Errors: 124544
  Warnings: 17792
RuntimeErr: 0
   Skipped: 0
      Time: 00:01:02 (62324 ms.)
TotalBytes: 161.1 mb
  AvgBytes: 2.5 mb/s
INVALID: 35582
INVALID_EXPECTED: 71166
REQUIRED_INVALID: 17796
'WARC-Date' header: 17792
'WARC-Date' value: 17792
'WARC-Target-URI' value: 8
'WARC-Warcinfo-ID' value: 35578
Data before WARC version: 17791
Empty lines before WARC version: 17791
Trailing newlines: 17792

We looked into the headers, and here's the WARC header for the broken file:

WARC/0.18
WARC-Type: warcinfo
WARC-Date: 2009-03-65T08:43:19-0800
WARC-Record-ID: <urn:uuid:993d3969-9643-4934-b1c6-68d4dbe55b83>
Content-Type: application/warc-fields
Content-Length: 219

software: Nutch 1.0-dev (modified for clueweb09)
isPartOf: clueweb09-en
description: clueweb09 crawl with WARC output
format: WARC file version 0.18
conformsTo: http://www.archive.org/documents/WarcFileFormat-0.18.html

and here's a working header:

WARC/1.0^M
WARC-Type: warcinfo^M
WARC-Date: 2009-12-18T23:17:27Z^M
WARC-Filename: ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz^M
WARC-Record-ID: <urn:uuid:7bef4ed8-86df-4bd5-a419-95e33c02a667>^M
Content-Type: application/warc-fields^M
Content-Length: 595^M
^M

I have carriage returns turned on here, so we can see how (a) line-endings differ; (b) WARC-Date is different, etc. There are similar mismatches throughout the headers.

I'm not an expert on WARCs - I'm not sure if the specification changed dramatically between 0.18 and 1.0, or whether this is an artefact of Nutch or being compressed/decompressed at some stage.

But I think since we rely on the webarchive-commons library, it might be worth opening up an issue there if you want to continue poking at this. It's probably out of scope for AUT. I did see a similar issue there that might be of help.

@ianmilligan1
Copy link
Member

So, I did get this to work. Broken WARCs stick in my craw!

See the results of the top ten domains here:

r: Array[(String, Int)] = Array((directory.binarybiz.com,1473), (blog.pennlive.com,1037), (americanhistory.si.edu,931), (businessfinder.mlive.com,876), (bama.edebris.com,812), (basnect.info,754), (cbs5.com,665), (2modern.com,599), (clinicaltrials.gov,506), (dotwhat.net,439))

The WARC is basically all screwed up, with line-endings, etc. (see above)

If you do need to get it to work, however, I used jwattools to decompress and recompress. The recompressed warc.gz file is correct and now works with AUT. See set of commands here:

./jwattools.sh decompress /mnt/vol1/data_sets/aut_debug/00.warc.gz
./jwattools.sh compress /mnt/vol1/data_sets/aut_debug/00.warc

Then works and the re-compression process has fixed the file. Not ideal, but I don't think this dataset is ideal from a WARC compliance standpoint. 😄

@ianmilligan1
Copy link
Member

I'm going to close this issue in light of above, but we do still have #246 and #258 open. Better error handling is something we do need to work on with AUT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants