stuff

apache · May 28, 2014 · 61d72b4 · 61d72b4
1 parent 181f217
commit 61d72b4
Show file tree

Hide file tree

Showing 9 changed files with 55 additions and 156 deletions.
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
@@ -89,6 +89,8 @@
                             <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="cluster-overview.html">Overview</a></li>
+                                <li><a href="submitting-applications.html">Submitting Applications</a></li>
+                                <li class="divider"></li>
                                 <li><a href="ec2-scripts.html">Amazon EC2</a></li>
                                 <li><a href="spark-standalone.html">Standalone Mode</a></li>
                                 <li><a href="running-on-mesos.html">Mesos</a></li>
@@ -102,7 +104,7 @@
                                 <li><a href="configuration.html">Configuration</a></li>
                                 <li><a href="monitoring.html">Monitoring</a></li>
                                 <li><a href="tuning.html">Tuning Guide</a></li>
-                                <li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li>
+                                <li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li>
                                 <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
                                 <li><a href="job-scheduling.html">Job Scheduling</a></li>
                                 <li class="divider"></li>

diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
@@ -4,7 +4,8 @@ title: Cluster Mode Overview
 ---
 
 This document gives a short overview of how Spark runs on clusters, to make it easier to understand
-the components involved.
+the components involved. Read through the [application submission guide](submitting-applications.html)
+to submit applications to a cluster.
 
 # Components
 
@@ -50,107 +51,10 @@ The system currently supports three cluster managers:
 In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
 cluster on Amazon EC2.
 
-# Bundling and Launching Applications
-
-### Bundling Your Application's Dependencies
-If your code depends on other projects, you will need to package them alongside
-your application in order to distribute the code to a Spark cluster. To do this,
-to create an assembly jar (or "uber" jar) containing your code and its dependencies. Both
-[sbt](https://github.com/sbt/sbt-assembly) and
-[Maven](http://maven.apache.org/plugins/maven-shade-plugin/)
-have assembly plugins. When creating assembly jars, list Spark and Hadoop
-as `provided` dependencies; these need not be bundled since they are provided by
-the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit`
-script as shown here while passing your jar.
-
-For Python, you can use the `pyFiles` argument of SparkContext
-or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.
-
-### Launching Applications with Spark submit
-
-Once a user application is bundled, it can be launched using the `spark-submit` script located in
-the bin directory. This script takes care of setting up the classpath with Spark and its
-dependencies, and can support different cluster managers and deploy modes that Spark supports:
-
-    ./bin/spark-submit \
-      --class <main-class>
-      --master <master-url> \
-      --deploy-mode <deploy-mode> \
-      ... // other options
-      <application-jar>
-      [application-arguments]
-
-    main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
-    master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
-    deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
-    application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
-    application-arguments: Space delimited arguments passed to the main method of <main-class>, if any
-
-To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
-examples of common options:
-
-{% highlight bash %}
-# Run application locally
-./bin/spark-submit \
-  --class org.apache.spark.examples.SparkPi
-  --master local[8] \
-  /path/to/examples.jar \
-  100
-
-# Run on a Spark standalone cluster
-./bin/spark-submit \
-  --class org.apache.spark.examples.SparkPi
-  --master spark://207.184.161.138:7077 \
-  --executor-memory 20G \
-  --total-executor-cores 100 \
-  /path/to/examples.jar \
-  1000
-
-# Run on a YARN cluster
-HADOOP_CONF_DIR=XX ./bin/spark-submit \
-  --class org.apache.spark.examples.SparkPi
-  --master yarn-cluster \  # can also be `yarn-client` for client mode
-  --executor-memory 20G \
-  --num-executors 50 \
-  /path/to/examples.jar \
-  1000
-{% endhighlight %}
-
-### Loading Configurations from a File
-
-The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
-properties file and pass them on to your application. By default it will read configuration options
-from `conf/spark-defaults.conf`. For more detail, see the section on
-[loading default configurations](configuration.html#loading-default-configurations).
-
-Loading default Spark configurations this way can obviate the need for certain flags to
-`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
-`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
-`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
-defaults file.
-
-If you are ever unclear where configuration options are coming from, you can print out fine-grained
-debugging information by running `spark-submit` with the `--verbose` option.
-
-### Advanced Dependency Management
-When using `spark-submit`, the application jar along with any jars included with the `--jars` option
-will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
-different strategies for disseminating jars:
-
-- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
-  every executor pulls the file from the driver HTTP server.
-- **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
-- **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node.  This
-  means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
-  or shared via NFS, GlusterFS, etc.
-
-Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
-This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup
-is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
-`spark.worker.cleanup.appDataTtl` property.
-
-For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries
-to executors.
+# Submitting Applications
+
+Applications can be submitted to a cluster of any type using the `spark-submit` script.
+The [application submission guide](submitting-applications.html) describes how to do this.
 
 # Monitoring
 

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
@@ -86,7 +86,7 @@ support the [Bagel API](api/scala/index.html#org.apache.spark.bagel.package) and
 [Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to
 explore the new GraphX API and comment on issues that may complicate the transition from Bagel.
 
-## Upgrade Guide from Spark 0.9.1
+## Migrating from Spark 0.9.1
 
 GraphX in Spark {{site.SPARK_VERSION}} contains one user-facing interface change from Spark 0.9.1. [`EdgeRDD`][EdgeRDD] may now store adjacent vertex attributes to construct the triplets, so it has gained a type parameter. The edges of a graph of type `Graph[VD, ED]` are of type `EdgeRDD[ED, VD]` rather than `EdgeRDD[ED]`.
 

diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md
@@ -1,6 +1,6 @@
 ---
 layout: global
-title: Running with Cloudera and HortonWorks
+title: Third-Party Hadoop Distributions
 ---
 
 Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and

diff --git a/docs/index.md b/docs/index.md
@@ -29,7 +29,7 @@ Spark comes with several sample programs.  Scala, Java and Python examples are i
 `examples/src/main` directory. To run one of the Java or Scala sample programs, use
 `bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
 invokes the more general
-[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for
+[`spark-submit` script](submitting-applications.html) for
 launching applications). For example,
 
     ./bin/run-example SparkPi 10

diff --git a/docs/programming-guide.md b/docs/programming-guide.md
@@ -122,12 +122,6 @@ val conf = new SparkConf().setAppName(appName).setMaster(master)
 new SparkContext(conf)
 {% endhighlight %}
 
-The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
-to connect to, or a special "local" string to run in local mode, as described below. `appName` is
-a name for your application, which will be shown in the cluster web UI. It's also possible to set
-these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
-which avoids hard-coding the master url in your application.
-
 </div>
 
 <div data-lang="java"  markdown="1">
@@ -141,12 +135,6 @@ SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
 JavaSparkContext sc = new JavaSparkContext(conf);
 {% endhighlight %}
 
-The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
-to connect to, or a special "local" string to run in local mode, as described below. `appName` is
-a name for your application, which will be shown in the cluster web UI. It's also possible to set
-these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
-which avoids hard-coding the master url in your application.
-
 </div>
 
 <div data-lang="python"  markdown="1">
@@ -160,16 +148,19 @@ conf = SparkConf().setAppName(appName).setMaster(master)
 sc = SparkContext(conf)
 {% endhighlight %}
 
-The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
-to connect to, or a special "local" string to run in local mode, as described below. `appName` is
-a name for your application, which will be shown in the cluster web UI. It's also possible to set
-these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
-which avoids hard-coding the master url in your application.
-
 </div>
 
 </div>
 
+The `appName` parameter is a name for your application to show on the cluster UI.
+`master` is a [Spark, Mesos or YARN cluster URL](submitting-applications.html#master-urls),
+or a special "local" string to run in local mode.
+In practice, when running on a cluster, you will not want to hardcode `master` in the program,
+but rather [launch the application with `spark-submit`](submitting-applications.html) and
+receive it there. However, for local testing and unit tests, you can pass "local" to run Spark
+in-process.
+
+
 ## Using the Shell
 
 <div class="codetabs">
@@ -193,7 +184,7 @@ $ ./bin/spark-shell --master local[4] --jars code.jar
 {% endhighlight %}
 
 For a complete list of options, run `spark-shell --help`. Behind the scenes,
-`spark-shell` invokes the more general [Spark submit script](cluster-overview.html#launching-applications-with-spark-submit).
+`spark-shell` invokes the more general [`spark-submit` script](submitting-applications.html).
 
 </div>
 
@@ -216,7 +207,7 @@ $ ./bin/pyspark --master local[4] --py-files code.py
 {% endhighlight %}
 
 For a complete list of options, run `pyspark --help`. Behind the scenes,
-`pyspark` invokes the more general [Spark submit script](cluster-overview.html#launching-applications-with-spark-submit).
+`pyspark` invokes the more general [`spark-submit` script](submitting-applications.html).
 
 It is also possible to launch the PySpark shell in [IPython](http://ipython.org), the
 enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To
@@ -1221,33 +1212,19 @@ vecAccum = sc.accumulator(Vector(...))(VectorAccumulatorParam())
 
 </div>
 
-
 # Deploying to a Cluster
 
-### Master URLs
-
-The master URL passed to Spark can be in one of the following formats:
+The [application submission guide](submitting-applications.html) describes how to submit applications to a cluster.
+In short, once you package your application into a JAR (for Java/Scala) or a set of `.py` or `.zip` files (for Python),
+the `bin/spark-submit` script lets you submit it to any supported cluster manager.
 
-<table class="table">
-<tr><th>Master URL</th><th>Meaning</th></tr>
-<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
-<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
-<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
-<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
-        cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
-</td></tr>
-<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
-        The port must be whichever one your is configured to use, which is 5050 by default.
-        Or, for a Mesos cluster using ZooKeeper, use mesos://zk://....
-</td></tr>
-<tr><td> yarn-client </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
-client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.
-</td></tr>
-<tr><td> yarn-cluster </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
-cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
-</td></tr>
-</table>
+# Unit Testing
 
+Spark is friendly to unit testing with any popular unit test framework.
+Simply create a `SparkContext` in your test with the master URL set to `local`, run your operations,
+and then call `SparkContext.stop()` to tear it down.
+Make sure you stop the context within a `finally` block or the test framework's `tearDown` method,
+as Spark does not support two contexts running concurrently in the same program.
 
 # Migrating from pre-1.0 Versions of Spark
 
@@ -1288,8 +1265,8 @@ have changed from returning (key, list of values) pairs to (key, iterable of val
 
 </div>
 
-Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x)
-and [MLlib](mllib-guide.html#migration-guide).
+Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x),
+[MLlib](mllib-guide.html#migration-guide) and [GraphX](graphx-programming-guide.html#migrating-from-spark-091).
 
 
 # Where to Go from Here

diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -442,6 +442,19 @@ Lines with a: 46, Lines with b: 23
 # Where to Go from Here
 Congratulations on running your first Spark application!
 
-* For an in-depth overview of the API see "Programming Guides" menu section.
-* For running applications on a cluster head to the [deployment overview](cluster-overview.html).
-* For configuration options available to Spark applications see the [configuration page](configuration.html).
+* For an in-depth overview of the API, start with the [Spark programming guide](programming-guide.html),
+  or see "Programming Guides" menu for other components.
+* For running applications on a cluster, head to the [deployment overview](cluster-overview.html).
+* Finally, Spark includes several samples in the `examples` directory
+([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
+ [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
+ [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)).
+You can run them as follows:
+
+{% highlight bash %}
+# For Scala and Java, use run-example:
+./bin/run-example SparkPi
+
+# For Python examples, use spark-submit directly:
+./bin/spark-submit examples/src/main/python/pi.py
+{% endhighlight %}
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
@@ -116,7 +116,7 @@ The driver also needs some configuration in `spark-env.sh` to interact properly
 2. Also set `spark.executor.uri` to `<URL of spark-{{site.SPARK_VERSION}}.tar.gz>`.
 
 Now when starting a Spark application against the cluster, pass a `mesos://`
-or `zk://` URL as the master when creating a `SparkContext`. For example:
+URL as the master when creating a `SparkContext`. For example:
 
 {% highlight scala %}
 val conf = new SparkConf()
@@ -126,6 +126,9 @@ val conf = new SparkConf()
 val sc = new SparkContext(conf)
 {% endhighlight %}
 
+(You can also use [`spark-submit`](submitting-applications.html) and configure `spark.executor.uri`
+in the [conf/spark-defaults.conf](configuration.html#loading-default-configurations) file.)
+
 When running a shell, the `spark.executor.uri` parameter is inherited from `SPARK_EXECUTOR_URI`, so
 it does not need to be redundantly passed in as a system property.
 

diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
@@ -237,7 +237,7 @@ You can also pass an option `--cores <numCores>` to control the number of cores
 
 Spark supports two deploy modes: applications may run with the driver inside the client process or
 entirely inside the cluster. The
-[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) provides the
+[`spark-submit` script](submitting-applications.html) provides the
 most straightforward way to submit a compiled Spark application to the cluster in either deploy
 mode.