Documentation updates for https://github.com/archivesunleashed/aut/pu… (

#56) * Documentation updates for archivesunleashed/aut#450
archivesunleashed · Apr 20, 2020 · abd0a9b · abd0a9b
1 parent a6106de
commit abd0a9b
Showing 1 changed file with 8 additions and 70 deletions.
diff --git a/current/aut-spark-submit-app.md b/current/aut-spark-submit-app.md
@@ -2,8 +2,7 @@
 
 The Toolkit offers a variety of extraction jobs with
 [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html)
-. These extraction jobs have a few configuration options, and analysis can use
-RDD or DataFrame in most cases.
+. These extraction jobs have a few configuration options.
 
 The extraction jobs have a basic outline of:
 
@@ -14,11 +13,10 @@ spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JA
 Additional flags include:
 
 * `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the
-  options are `TEXT` (default) or `GEXF`.)
-* `--df` (The extractor will use a DataFrame to carry out analysis.)
+  options are `CSV` (default) or `GEXF`.)
 * `--split` (The extractor will put results for each input file in its own
   directory. Each directory name will be the name of the ARC/WARC file parsed.)
-* `--partition N` (The extractor will partition RDD or DataFrame according to N
+* `--partition N` (The extractor will partition the DataFrame according to N
   before writing results. The is useful to combine all the results to a single
   file.)
 
@@ -27,32 +25,16 @@ Additional flags include:
 This extractor outputs a directory of CSV files or a single CSV file with the
 following columns: `domain`, and `count`.
 
-### RDD
-
 Directory of CSV files:
 
 ```shell
 spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path
- ```
-
-A single CSV file:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1
-```
-
-### DataFrame
-
-Directory of CSV files:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df
 ```
 
 A single CSV file:
 
 ```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1
 ```
 
 ## Domain Graph
@@ -63,12 +45,6 @@ addition to the standard text output, an additional flag `--output-format` can
 output [GraphML](https://en.wikipedia.org/wiki/GraphML), or
 [GEXF](https://gephi.org/gexf/format/).
 
-### RDD
-
-**Note**: The RDD output is formatted slightly different. The first three
-columns are an array:
-`((CrawlDate, SourceDomain, DestinationDomain), Frequency)`
-
 Text output:
 
 ```shell
@@ -87,26 +63,6 @@ GraphML output:
 spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --output-format GRAPHML
 ```
 
-### DataFrame
-
-Text output:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --df --output-format TEXT
-```
-
-GEXF output:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --df --output-format GEXF
-```
-
-GraphML output:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --df --output-format GRAPHML
-```
-
 ## Image Graph
 
 This extractor outputs a directory of CSV files or a single CSV file with the
@@ -119,22 +75,20 @@ following columns: `crawl_date`, `src`, `image_url`, and `alt_text`.
 Directory of CSV files:
 
 ```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path
 ```
 
 A single CSV file:
 
 ``` shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --partition 1
 ```
 
 ## Plain Text
 
 This extractor outputs a directory of CSV files or a single CSV file with the
 following columns: `crawl_date`, `domain`, `url`, and `text`.
 
-### RDD
-
 Directory of CSV files:
 
 ```shell
@@ -147,20 +101,6 @@ A single CSV file:
 spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1
 ```
 
-### DataFrame
-
-Directory of CSV files:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
-```
-
-A single CSV file:
-
-```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
-```
-
 ## Web Pages
 
 This extractor outputs a directory of CSV files or a single CSV file with the
@@ -169,16 +109,14 @@ following columns: `crawl_date`, `url`, `mime_type_web_server`,
 
 **Note**: This extractor will only work with the DataFrame option.
 
-### DataFrame
-
 Directory of CSV files:
 
 ```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path
 ```
 
 A single CSV file:
 
 ```shell
-spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1
+spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --partition 1
 ```