Skip to content

Commit

Permalink
Updates for changing RemoveHttpHeader to RemoveHTTPHeader. (#19)
Browse files Browse the repository at this point in the history
- Add ScalaDF example for: Extract Plain Text Without HTTP Headers
- See also:
   - archivesunleashed/aut#368
   - archivesunleashed/aut#374
   - archivesunleashed/aut#370
  • Loading branch information
Gursimran Singh authored and ruebot committed Nov 7, 2019
1 parent f40abd4 commit 4f73504
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 12 deletions.
4 changes: 2 additions & 2 deletions current/collection-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,9 @@ import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
.select(ExtractBaseDomain($"Url").as("Domain"))
.select(ExtractDomain($"Url").as("Domain"))
.groupBy("Domain").count().orderBy(desc("count"))
.show(20, False)
.show(20, false)
```

What do I do with the results? See [this guide](df-results.md)!
Expand Down
30 changes: 20 additions & 10 deletions current/text-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,25 @@ import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-noheaders/")
```

As most plain text use cases do not require HTTP headers to be in the output, we are removing headers in the following examples.

### Scala DF

TODO
```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("example.warc.gz", sc)
.extractValidPagesDF()
.select(RemoveHTML($"content"))
.write
.option("header","true")
.csv("plain-text-noheaders/")
```

### Python DF

Expand All @@ -79,7 +89,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDomains(Set("www.archive.org"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-domain/")
```
### Scala DF
Expand All @@ -104,7 +114,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepUrlPatterns(Set("(?i)http://www.archive.org/details/.*".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("details/")
```

Expand All @@ -128,7 +138,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDomains(Set("www.archive.org"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, ExtractBoilerpipeText(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, ExtractBoilerpipeText(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-no-boilerplate/")
```

Expand Down Expand Up @@ -156,7 +166,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDate(List("200804"), ExtractDate.DateComponent.YYYYMM)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-date-filtered-200804/")
```

Expand All @@ -168,7 +178,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDate(List("2008"), ExtractDate.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-date-filtered-2008/")
```

Expand All @@ -180,7 +190,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDate(List("2008","2015"), ExtractDate.DateComponent.YYYY)
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-date-filtered-2008-2015/")
```

Expand Down Expand Up @@ -213,7 +223,7 @@ import io.archivesunleashed.matchbox._
RecordLoader.loadArchives("example.arc.gz", sc).keepValidPages()
.keepDomains(Set("www.archive.org"))
.keepLanguages(Set("fr"))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-fr/")
```

Expand All @@ -239,7 +249,7 @@ import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("example.arc.gz",sc).keepValidPages()
.keepContent(Set("radio".r))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHttpHeader(r.getContentString))))
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(RemoveHTTPHeader(r.getContentString))))
.saveAsTextFile("plain-text-radio/")
```

Expand Down

0 comments on commit 4f73504

Please sign in to comment.