Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

Merged
merged 4 commits into from
May 28, 2018

Conversation

TitusAn
Copy link
Contributor

@TitusAn TitusAn commented May 24, 2018

Data frame implementation of extractors. Also added cmd arguments to resolve #235


GitHub issue(s):

What does this Pull Request do?

Added data frame implementation and tests for DomainFrequencyExtractor, DomainGraphExtractor and PlainTextExtractor.

Also added new command line flags:

  • If --df is present, the program will use data frame to carry out analysis

  • If --split is present, the program will put results for each input file in its own folder. Otherwise they will be merged.

  • If --partition N is present, the program will partition RDD or Data Frame according to N before writing results. Otherwise, partition is left as is.

How should this be tested?

mvn install to run tests. Run jobs with --df option to use data frame implementation.

Additional Notes:

Data frame tests for DomainGraphExtractor is missing because the result is different from RDD implementation. (It has more vertices and edges.) I am investigating this and will provide update.

Interested parties

@lintool @greebie @ianmilligan1

* partition is left as is.
*/

class CmdAppConf(args: Seq[String]) extends ScallopConf(args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
}
def apply(d: DataFrame): Dataset[Row] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import spark.implicits._

d.select($"CrawlDate",
df.RemovePrefixWWW(df.ExtractDomain($"Src")).as("SrcDomain"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces, not tabs


object PlainTextExtractor {
def apply(records: RDD[ArchiveRecord]) = {
records
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
}
def apply(d: DataFrame): Dataset[Row] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line break, and doc comment needed https://docs.scala-lang.org/style/scaladoc.html

@@ -39,6 +39,11 @@ object WriteGEXF {
else makeFile (rdd, gexfPath)
}

def apply(ds: Dataset[Row], gexfPath: String): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true
}

def makeFile(ds: Dataset[Row], gexfPath: String): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codecov
Copy link

codecov bot commented May 24, 2018

Codecov Report

Merging #236 into master will increase coverage by 8.05%.
The diff coverage is 81.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   60.65%   68.71%   +8.05%     
==========================================
  Files          39       39              
  Lines         793      911     +118     
  Branches      139      168      +29     
==========================================
+ Hits          481      626     +145     
+ Misses        269      231      -38     
- Partials       43       54      +11
Impacted Files Coverage Δ
...ain/scala/io/archivesunleashed/app/WriteGEXF.scala 100% <100%> (ø) ⬆️
...o/archivesunleashed/app/DomainGraphExtractor.scala 100% <100%> (ø) ⬆️
...c/main/scala/io/archivesunleashed/df/package.scala 86.95% <100%> (+0.59%) ⬆️
...chivesunleashed/app/DomainFrequencyExtractor.scala 100% <100%> (ø) ⬆️
.../io/archivesunleashed/app/PlainTextExtractor.scala 100% <100%> (ø) ⬆️
...cala/io/archivesunleashed/app/CommandLineApp.scala 75% <75%> (ø)
src/main/scala/io/archivesunleashed/package.scala 84.11% <0%> (+10.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0a8b78...7b2e300. Read the comment docs.

@ruebot
Copy link
Member

ruebot commented May 24, 2018

@TitusAn you'll want to check the links to each file in the above CodeCov response. We dropped pretty bad here.

tl;dr

  • CommandLineApp.scala has no coverage
  • WriteGEXF.scala dropped about 50% in coverage
  • DomainGraphExtractor.scala dropped about 35% in coverage

@lintool
Copy link
Member

lintool commented May 24, 2018

@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.

@lintool
Copy link
Member

lintool commented May 27, 2018

@TitusAn nudge on this issue? I want to get this merged in because there are a number of issues I want to address that this is blocking...

@TitusAn
Copy link
Contributor Author

TitusAn commented May 27, 2018

Sorry about that! I will finish this by end of today.

@ianmilligan1
Copy link
Member

Looking good! Quick request just as we will eventually be documenting all this. In the PR you write:

Added data frame implementation and tests for DomainFrequencyExtractor, DomainGraphExtractor and PlainTextExtractor.

Also added new command line flags:

  • If --df is present, the program will use data frame to carry out analysis

  • If --split is present, the program will put results for each input file in its own folder. Otherwise they will be merged.

  • If --partition N is present, the program will partition RDD or Data Frame according to N before writing results. Otherwise, partition is left as is.

Would you be able to give a code example for each one (say running on a directory of sample WARCs)? I'm worried down the line when we document we'll lose the context of where things are. 😄

Copy link
Member

@ruebot ruebot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some doc comment cleanup, and we're good to go.

Thanks for taking care of the tests! Nice work! 😃

verify()
}

/** Main application that parse
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete sentence?

}
}

/** Generic routine for saving RDD obtained from Map Reduce operation of extractors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

})
)

/** Maps extractor type string to Data Frame Extractors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

}
}

/** Prepare for invoking RDD implementation of extractors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

}

/** Choose either Data Frame implementation or RDD implementation of extractors
* depending on the option specified in command line arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

}

/** Entry point for testing.
* Takes an existed spark session to prevent new ones from being created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

/** Entry point for testing.
* Takes an existed spark session to prevent new ones from being created
*
* @param argv command line arguments (array of strings) .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra space and full stop at end of line.

.countItems()
}

/** Extract domain frequency from web archive using Data Frame and Spark SQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
}

/** Extract domain graph from web archive using Data Frame and Spark SQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

def apply(records: RDD[ArchiveRecord]) = {
records
.keepValidPages()
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
}

/** Extract plain text from web archive using Data Frame and Spark SQL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full stop (period) at the end of line.

@ruebot ruebot merged commit c73a92b into archivesunleashed:master May 28, 2018
@TitusAn
Copy link
Contributor Author

TitusAn commented May 28, 2018

@ianmilligan1

Here are some example usage for new flags:

--df 

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output1 --df 

Data frame implementation of DomainGraphExtractor is used.

--partition:

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output2 --df  --partition 1

Output will be a single file rather than PART-0000, PART-0001, etc.

--split 

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output3 --df  --split

Results for example.arc.gz and example.warc.gz will be in their own directory, rather than merged together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CommandLineAppRunner.scala produces output per WARC instead of combined result.
4 participants