Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
#56)

* Documentation updates for archivesunleashed/aut#450
  • Loading branch information
ruebot authored Apr 20, 2020
1 parent a6106de commit abd0a9b
Showing 1 changed file with 8 additions and 70 deletions.
78 changes: 8 additions & 70 deletions current/aut-spark-submit-app.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,7 @@

The Toolkit offers a variety of extraction jobs with
[`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html)
. These extraction jobs have a few configuration options, and analysis can use
RDD or DataFrame in most cases.
. These extraction jobs have a few configuration options.

The extraction jobs have a basic outline of:

Expand All @@ -14,11 +13,10 @@ spark-submit --class io.archivesunleashed.app.CommandLinAppRunner PATH_TO_AUT_JA
Additional flags include:

* `--output-format FORMAT` (Used only for the `DomainGraphExtractor`, and the
options are `TEXT` (default) or `GEXF`.)
* `--df` (The extractor will use a DataFrame to carry out analysis.)
options are `CSV` (default) or `GEXF`.)
* `--split` (The extractor will put results for each input file in its own
directory. Each directory name will be the name of the ARC/WARC file parsed.)
* `--partition N` (The extractor will partition RDD or DataFrame according to N
* `--partition N` (The extractor will partition the DataFrame according to N
before writing results. The is useful to combine all the results to a single
file.)

Expand All @@ -27,32 +25,16 @@ Additional flags include:
This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `domain`, and `count`.

### RDD

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1
```

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --df --partition 1
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainFrequencyExtractor --input /path/to/warcs/* --output output/path --partition 1
```

## Domain Graph
Expand All @@ -63,12 +45,6 @@ addition to the standard text output, an additional flag `--output-format` can
output [GraphML](https://en.wikipedia.org/wiki/GraphML), or
[GEXF](https://gephi.org/gexf/format/).

### RDD

**Note**: The RDD output is formatted slightly different. The first three
columns are an array:
`((CrawlDate, SourceDomain, DestinationDomain), Frequency)`

Text output:

```shell
Expand All @@ -87,26 +63,6 @@ GraphML output:
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --output-format GRAPHML
```

### DataFrame

Text output:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --df --output-format TEXT
```

GEXF output:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --df --output-format GEXF
```

GraphML output:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor DomainGraphExtractor --input /path/to/warcs/* --output output/path --df --output-format GRAPHML
```

## Image Graph

This extractor outputs a directory of CSV files or a single CSV file with the
Expand All @@ -119,22 +75,20 @@ following columns: `crawl_date`, `src`, `image_url`, and `alt_text`.
Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path
```

A single CSV file:

``` shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --partition 1
```

## Plain Text

This extractor outputs a directory of CSV files or a single CSV file with the
following columns: `crawl_date`, `domain`, `url`, and `text`.

### RDD

Directory of CSV files:

```shell
Expand All @@ -147,20 +101,6 @@ A single CSV file:
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor PlainTextExtractor --input /path/to/warcs/* --output output/path --partition 1
```

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor ImageGraphExtractor --input /path/to/warcs/* --output output/path --df --partition 1
```

## Web Pages

This extractor outputs a directory of CSV files or a single CSV file with the
Expand All @@ -169,16 +109,14 @@ following columns: `crawl_date`, `url`, `mime_type_web_server`,

**Note**: This extractor will only work with the DataFrame option.

### DataFrame

Directory of CSV files:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path
```

A single CSV file:

```shell
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --df --partition 1
spark-submit --class io.archivesunleashed.app.CommandLineAppRunner path/to/aut-fatjar.jar --extractor WebPagesExtractor --input /path/to/warcs/* --output output/path --partition 1
```

0 comments on commit abd0a9b

Please sign in to comment.