Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch for #269: Replace backslash with forward slash in URL #276

Merged
merged 4 commits into from
Oct 17, 2018
Merged

Conversation

borislin
Copy link
Collaborator

@borislin borislin commented Oct 16, 2018

This PR improves ExtractDomain by replacing backslash with forward slash in URL before passing it into Java URL class.


GitHub issue(s):

What does this Pull Request do?

This PR improves URL parsing in ExtractDomain by replacing backslash with forward slash before passing it into Java URL class, allowing ExtractDomain to capture the true domain of an URL.

How should this be tested?

  • git fetch --all
  • git checkout fix-url
  • mvn clean install
  • Create an output directory with sub-directories:
    mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
  • Adapt the script below:
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")

val input = "/tuna1/scratch/i2milligan/warcs.archive-it.org/cgi-bin/getarcs.pl/*.gz"

val output1 = "/tuna1/scratch/aut-issue-269/derivatives/all-domains"
val output2 = "/tuna1/scratch/aut-issue-269/derivatives/all-text"
val output3 = "/tuna1/scratch/aut-issue-269/derivatives/gephi"

RecordLoader.loadArchives(input, sc).map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile(output1)

RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile(output2)

val links = RecordLoader.loadArchives(input, sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
WriteGraphML(links, output3)

sys.exit

Current Results

With this PR patch:

  • /tuna1/scratch/aut-issue-269/derivatives/all-domains or /tuna1/scratch/aut-issue-269/derivatives/all-domains.txt (a combined version of all files in /tuna1/scratch/aut-issue-269/derivatives/all-domains)
  • /tuna1/scratch/aut-issue-269/derivatives/gephi (doesn't contain backslash anymore, proper domain seetorontonow.canada-booknow.com has been extracted from URL)

Without this PR patch (master branch):

  • /tuna1/scratch/aut-issue-269/derivatives/all-domains-without-patch or /tuna1/scratch/aut-issue-269/derivatives/all-domains-without-patch.txt (combined version)
  • /tuna1/scratch/aut-issue-269/derivatives/gephi-without-patch (contains backslash as in URL seetorontonow.canada-booknow.com\booking_results.php)

Interested parties

@lintool @ianmilligan1 @ruebot @greebie

@codecov-io
Copy link

codecov-io commented Oct 16, 2018

Codecov Report

Merging #276 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #276   +/-   ##
=======================================
  Coverage   70.36%   70.36%           
=======================================
  Files          41       41           
  Lines        1046     1046           
  Branches      192      192           
=======================================
  Hits          736      736           
  Misses        244      244           
  Partials       66       66
Impacted Files Coverage Δ
.../io/archivesunleashed/matchbox/ExtractDomain.scala 87.5% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4fe05a5...bf458e9. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented Oct 17, 2018

@ianmilligan1 you want to test this one out since it is for #269?

@borislin can you update your branch?

@ianmilligan1
Copy link
Member

@ruebot yep, will do!

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and works well – thanks @borislin!

@ruebot ruebot merged commit 7c3a80d into master Oct 17, 2018
@ruebot ruebot deleted the fix-url branch October 17, 2018 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants