-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236
Conversation
* partition is left as is. | ||
*/ | ||
|
||
class CmdAppConf(args: Seq[String]) extends ScallopConf(args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing lots of doc comments https://docs.scala-lang.org/style/scaladoc.html
.filter(r => r._2 != "" && r._3 != "") | ||
.countItems() | ||
.filter(r => r._2 > 5) | ||
} | ||
def apply(d: DataFrame): Dataset[Row] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line break, and doc comment https://docs.scala-lang.org/style/scaladoc.html
import spark.implicits._ | ||
|
||
d.select($"CrawlDate", | ||
df.RemovePrefixWWW(df.ExtractDomain($"Src")).as("SrcDomain"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces, not tabs
|
||
object PlainTextExtractor { | ||
def apply(records: RDD[ArchiveRecord]) = { | ||
records | ||
.keepValidPages() | ||
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))) | ||
} | ||
def apply(d: DataFrame): Dataset[Row] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line break, and doc comment needed https://docs.scala-lang.org/style/scaladoc.html
@@ -39,6 +39,11 @@ object WriteGEXF { | |||
else makeFile (rdd, gexfPath) | |||
} | |||
|
|||
def apply(ds: Dataset[Row], gexfPath: String): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comment https://docs.scala-lang.org/style/scaladoc.html
true | ||
} | ||
|
||
def makeFile(ds: Dataset[Row], gexfPath: String): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comment https://docs.scala-lang.org/style/scaladoc.html
Codecov Report
@@ Coverage Diff @@
## master #236 +/- ##
==========================================
+ Coverage 60.65% 68.71% +8.05%
==========================================
Files 39 39
Lines 793 911 +118
Branches 139 168 +29
==========================================
+ Hits 481 626 +145
+ Misses 269 231 -38
- Partials 43 54 +11
Continue to review full report at Codecov.
|
@TitusAn you'll want to check the links to each file in the above CodeCov response. We dropped pretty bad here. tl;dr
|
@TitusAn I would suggest renaming the DF |
@TitusAn nudge on this issue? I want to get this merged in because there are a number of issues I want to address that this is blocking... |
Sorry about that! I will finish this by end of today. |
Looking good! Quick request just as we will eventually be documenting all this. In the PR you write:
Would you be able to give a code example for each one (say running on a directory of sample WARCs)? I'm worried down the line when we document we'll lose the context of where things are. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some doc comment cleanup, and we're good to go.
Thanks for taking care of the tests! Nice work! 😃
verify() | ||
} | ||
|
||
/** Main application that parse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete sentence?
} | ||
} | ||
|
||
/** Generic routine for saving RDD obtained from Map Reduce operation of extractors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
}) | ||
) | ||
|
||
/** Maps extractor type string to Data Frame Extractors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
} | ||
} | ||
|
||
/** Prepare for invoking RDD implementation of extractors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
} | ||
|
||
/** Choose either Data Frame implementation or RDD implementation of extractors | ||
* depending on the option specified in command line arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
} | ||
|
||
/** Entry point for testing. | ||
* Takes an existed spark session to prevent new ones from being created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
/** Entry point for testing. | ||
* Takes an existed spark session to prevent new ones from being created | ||
* | ||
* @param argv command line arguments (array of strings) . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra space and full stop at end of line.
.countItems() | ||
} | ||
|
||
/** Extract domain frequency from web archive using Data Frame and Spark SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
.filter(r => r._2 != "" && r._3 != "") | ||
.countItems() | ||
.filter(r => r._2 > 5) | ||
} | ||
|
||
/** Extract domain graph from web archive using Data Frame and Spark SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
def apply(records: RDD[ArchiveRecord]) = { | ||
records | ||
.keepValidPages() | ||
.map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))) | ||
} | ||
|
||
/** Extract plain text from web archive using Data Frame and Spark SQL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full stop (period) at the end of line.
Here are some example usage for new flags:
Data frame implementation of DomainGraphExtractor is used.
Output will be a single file rather than PART-0000, PART-0001, etc.
Results for example.arc.gz and example.warc.gz will be in their own directory, rather than merged together. |
Data frame implementation of extractors. Also added cmd arguments to resolve #235
GitHub issue(s):
What does this Pull Request do?
Added data frame implementation and tests for DomainFrequencyExtractor, DomainGraphExtractor and PlainTextExtractor.
Also added new command line flags:
If --df is present, the program will use data frame to carry out analysis
If --split is present, the program will put results for each input file in its own folder. Otherwise they will be merged.
If --partition N is present, the program will partition RDD or Data Frame according to N before writing results. Otherwise, partition is left as is.
How should this be tested?
mvn install
to run tests. Run jobs with--df
option to use data frame implementation.Additional Notes:
Data frame tests for DomainGraphExtractor is missing because the result is different from RDD implementation. (It has more vertices and edges.) I am investigating this and will provide update.
Interested parties
@lintool @greebie @ianmilligan1