Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

TitusAn · 2018-05-24T02:13:27Z

Data frame implementation of extractors. Also added cmd arguments to resolve #235

GitHub issue(s):

What does this Pull Request do?

Added data frame implementation and tests for DomainFrequencyExtractor, DomainGraphExtractor and PlainTextExtractor.

Also added new command line flags:

If --df is present, the program will use data frame to carry out analysis
If --split is present, the program will put results for each input file in its own folder. Otherwise they will be merged.
If --partition N is present, the program will partition RDD or Data Frame according to N before writing results. Otherwise, partition is left as is.

How should this be tested?

mvn install to run tests. Run jobs with --df option to use data frame implementation.

Additional Notes:

Data frame tests for DomainGraphExtractor is missing because the result is different from RDD implementation. (It has more vertices and edges.) I am investigating this and will provide update.

Interested parties

@lintool @greebie @ianmilligan1

ruebot · 2018-05-24T02:15:52Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+ * partition is left as is.
+ */
+
+class CmdAppConf(args: Seq[String]) extends ScallopConf(args) {


Missing lots of doc comments https://docs.scala-lang.org/style/scaladoc.html

ruebot · 2018-05-24T02:16:18Z

src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala

      .filter(r => r._2 != "" && r._3 != "")
      .countItems()
      .filter(r => r._2 > 5)
  }
+  def apply(d: DataFrame): Dataset[Row] = {


Line break, and doc comment https://docs.scala-lang.org/style/scaladoc.html

ruebot · 2018-05-24T02:16:36Z

src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala

+    import spark.implicits._
+
+    d.select($"CrawlDate",
+             df.RemovePrefixWWW(df.ExtractDomain($"Src")).as("SrcDomain"),


spaces, not tabs

ruebot · 2018-05-24T02:16:56Z

src/main/scala/io/archivesunleashed/app/PlainTextExtractor.scala


 object PlainTextExtractor {
  def apply(records: RDD[ArchiveRecord]) = {
    records
      .keepValidPages()
      .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  }
+  def apply(d: DataFrame): Dataset[Row] = {


Line break, and doc comment needed https://docs.scala-lang.org/style/scaladoc.html

ruebot · 2018-05-24T02:17:12Z

src/main/scala/io/archivesunleashed/app/WriteGEXF.scala

@@ -39,6 +39,11 @@ object WriteGEXF {
    else makeFile (rdd, gexfPath)
  }

+  def apply(ds: Dataset[Row], gexfPath: String): Boolean = {


doc comment https://docs.scala-lang.org/style/scaladoc.html

ruebot · 2018-05-24T02:17:22Z

src/main/scala/io/archivesunleashed/app/WriteGEXF.scala

+    true
+  }
+
+  def makeFile(ds: Dataset[Row], gexfPath: String): Boolean = {


doc comment https://docs.scala-lang.org/style/scaladoc.html

codecov · 2018-05-24T02:26:15Z

Codecov Report

Merging #236 into master will increase coverage by 8.05%.
The diff coverage is 81.65%.

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   60.65%   68.71%   +8.05%     
==========================================
  Files          39       39              
  Lines         793      911     +118     
  Branches      139      168      +29     
==========================================
+ Hits          481      626     +145     
+ Misses        269      231      -38     
- Partials       43       54      +11

Impacted Files	Coverage Δ
...ain/scala/io/archivesunleashed/app/WriteGEXF.scala	`100% <100%> (ø)`	⬆️
...o/archivesunleashed/app/DomainGraphExtractor.scala	`100% <100%> (ø)`	⬆️
...c/main/scala/io/archivesunleashed/df/package.scala	`86.95% <100%> (+0.59%)`	⬆️
...chivesunleashed/app/DomainFrequencyExtractor.scala	`100% <100%> (ø)`	⬆️
.../io/archivesunleashed/app/PlainTextExtractor.scala	`100% <100%> (ø)`	⬆️
...cala/io/archivesunleashed/app/CommandLineApp.scala	`75% <75%> (ø)`
src/main/scala/io/archivesunleashed/package.scala	`84.11% <0%> (+10.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c0a8b78...7b2e300. Read the comment docs.

ruebot · 2018-05-24T02:54:39Z

@TitusAn you'll want to check the links to each file in the above CodeCov response. We dropped pretty bad here.

tl;dr

CommandLineApp.scala has no coverage
WriteGEXF.scala dropped about 50% in coverage
DomainGraphExtractor.scala dropped about 35% in coverage

lintool · 2018-05-24T12:27:10Z

@TitusAn I would suggest renaming the DF ExtractDomain to ExtractBaseDomain since it also removes the www prefix. Giving it a different name will also reduce confusion in the matchbox version since it does something different.

lintool · 2018-05-27T12:16:38Z

@TitusAn nudge on this issue? I want to get this merged in because there are a number of issues I want to address that this is blocking...

TitusAn · 2018-05-27T14:52:44Z

Sorry about that! I will finish this by end of today.

…tractBaseDomain

ianmilligan1 · 2018-05-28T12:56:52Z

Looking good! Quick request just as we will eventually be documenting all this. In the PR you write:

Added data frame implementation and tests for DomainFrequencyExtractor, DomainGraphExtractor and PlainTextExtractor.

Also added new command line flags:

If --df is present, the program will use data frame to carry out analysis

If --split is present, the program will put results for each input file in its own folder. Otherwise they will be merged.

If --partition N is present, the program will partition RDD or Data Frame according to N before writing results. Otherwise, partition is left as is.

Would you be able to give a code example for each one (say running on a directory of sample WARCs)? I'm worried down the line when we document we'll lose the context of where things are. 😄

ruebot

Just some doc comment cleanup, and we're good to go.

Thanks for taking care of the tests! Nice work! 😃

ruebot · 2018-05-28T16:54:57Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+  verify()
+}
+
+/** Main application that parse


Incomplete sentence?

ruebot · 2018-05-28T16:55:36Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+    }
+  }
+
+  /** Generic routine for saving RDD obtained from Map Reduce operation of extractors


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:55:43Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+      })
+  )
+
+  /** Maps extractor type string to Data Frame Extractors


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:55:49Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+    }
+  }
+
+  /** Prepare for invoking RDD implementation of extractors


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:55:59Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+  }
+
+  /** Choose either Data Frame implementation or RDD implementation of extractors
+    * depending on the option specified in command line arguments


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:56:43Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+  }
+
+  /** Entry point for testing.
+    * Takes an existed spark session to prevent new ones from being created


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:57:07Z

src/main/scala/io/archivesunleashed/app/CommandLineApp.scala

+  /** Entry point for testing.
+    * Takes an existed spark session to prevent new ones from being created
+    *
+    * @param argv command line arguments (array of strings) .


Remove extra space and full stop at end of line.

ruebot · 2018-05-28T16:57:18Z

src/main/scala/io/archivesunleashed/app/DomainFrequencyExtractor.scala

        .countItems()
  }
+
+  /** Extract domain frequency from web archive using Data Frame and Spark SQL


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:57:28Z

src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala

      .filter(r => r._2 != "" && r._3 != "")
      .countItems()
      .filter(r => r._2 > 5)
  }
+
+  /** Extract domain graph from web archive using Data Frame and Spark SQL


Full stop (period) at the end of line.

ruebot · 2018-05-28T16:57:38Z

src/main/scala/io/archivesunleashed/app/PlainTextExtractor.scala

  def apply(records: RDD[ArchiveRecord]) = {
    records
      .keepValidPages()
      .map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
  }
+
+  /** Extract plain text from web archive using Data Frame and Spark SQL


Full stop (period) at the end of line.

TitusAn · 2018-05-28T22:20:18Z

@ianmilligan1

Here are some example usage for new flags:

--df 

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output1 --df

Data frame implementation of DomainGraphExtractor is used.

--partition:

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output2 --df  --partition 1

Output will be a single file rather than PART-0000, PART-0001, etc.

--split 

./aut_runtree/spark-2.3.0-bin-hadoop2.7/bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner ./aut_self/aut/target/aut-0.16.1-SNAPSHOT-fatjar.jar --extractor DomainFrequencyExtractor --input ./aut_self/aut/src/test/resources/warc/example.warc.gz ./aut_self/aut/src/test/resources/arc/example.arc.gz  --output output3 --df  --split

Results for example.arc.gz and example.warc.gz will be in their own directory, rather than merged together.

TitusAn added 2 commits May 23, 2018 00:15

initial implementation

6dd10b3

Data frame implementation of extractors.

b54a31e

ruebot requested changes May 24, 2018

View reviewed changes

ianmilligan1 mentioned this pull request May 25, 2018

Improve ExtractDomain Normalization #239

Closed

add tests for command line app runner. rename ExtractDomain udf to Ex…

b913137

…tractBaseDomain

ruebot requested changes May 28, 2018

View reviewed changes

fix documentation.

7b2e300

ruebot approved these changes May 28, 2018

View reviewed changes

ruebot merged commit c73a92b into archivesunleashed:master May 28, 2018

ianmilligan1 mentioned this pull request Oct 26, 2019

Document command line app archivesunleashed/aut-docs#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

TitusAn commented May 24, 2018 •

edited

Loading

ruebot May 24, 2018

ruebot May 24, 2018

ruebot May 24, 2018

ruebot May 24, 2018

ruebot May 24, 2018

ruebot May 24, 2018

codecov bot commented May 24, 2018 •

edited

Loading

ruebot commented May 24, 2018

lintool commented May 24, 2018

lintool commented May 27, 2018

TitusAn commented May 27, 2018

ianmilligan1 commented May 28, 2018

ruebot left a comment

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

ruebot May 28, 2018

TitusAn commented May 28, 2018

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

Data frame implementation of extractors. Also added cmd arguments to resolve #235 #236

Conversation

TitusAn commented May 24, 2018 • edited Loading

What does this Pull Request do?

How should this be tested?

Additional Notes:

Interested parties

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 24, 2018 • edited Loading

Codecov Report

ruebot commented May 24, 2018

lintool commented May 24, 2018

lintool commented May 27, 2018

TitusAn commented May 27, 2018

ianmilligan1 commented May 28, 2018

ruebot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TitusAn commented May 28, 2018

TitusAn commented May 24, 2018 •

edited

Loading

codecov bot commented May 24, 2018 •

edited

Loading