Skip to content

Commit

Permalink
stuff
Browse files Browse the repository at this point in the history
  • Loading branch information
mateiz committed May 28, 2014
1 parent 181f217 commit 61d72b4
Show file tree
Hide file tree
Showing 9 changed files with 55 additions and 156 deletions.
4 changes: 3 additions & 1 deletion docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="cluster-overview.html">Overview</a></li>
<li><a href="submitting-applications.html">Submitting Applications</a></li>
<li class="divider"></li>
<li><a href="ec2-scripts.html">Amazon EC2</a></li>
<li><a href="spark-standalone.html">Standalone Mode</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
Expand All @@ -102,7 +104,7 @@
<li><a href="configuration.html">Configuration</a></li>
<li><a href="monitoring.html">Monitoring</a></li>
<li><a href="tuning.html">Tuning Guide</a></li>
<li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li>
<li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li>
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
<li><a href="job-scheduling.html">Job Scheduling</a></li>
<li class="divider"></li>
Expand Down
108 changes: 6 additions & 102 deletions docs/cluster-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ title: Cluster Mode Overview
---

This document gives a short overview of how Spark runs on clusters, to make it easier to understand
the components involved.
the components involved. Read through the [application submission guide](submitting-applications.html)
to submit applications to a cluster.

# Components

Expand Down Expand Up @@ -50,107 +51,10 @@ The system currently supports three cluster managers:
In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
cluster on Amazon EC2.

# Bundling and Launching Applications

### Bundling Your Application's Dependencies
If your code depends on other projects, you will need to package them alongside
your application in order to distribute the code to a Spark cluster. To do this,
to create an assembly jar (or "uber" jar) containing your code and its dependencies. Both
[sbt](https://github.com/sbt/sbt-assembly) and
[Maven](http://maven.apache.org/plugins/maven-shade-plugin/)
have assembly plugins. When creating assembly jars, list Spark and Hadoop
as `provided` dependencies; these need not be bundled since they are provided by
the cluster manager at runtime. Once you have an assembled jar you can call the `bin/spark-submit`
script as shown here while passing your jar.

For Python, you can use the `pyFiles` argument of SparkContext
or its `addPyFile` method to add `.py`, `.zip` or `.egg` files to be distributed.

### Launching Applications with Spark submit

Once a user application is bundled, it can be launched using the `spark-submit` script located in
the bin directory. This script takes care of setting up the classpath with Spark and its
dependencies, and can support different cluster managers and deploy modes that Spark supports:

./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
... // other options
<application-jar>
[application-arguments]

main-class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
master-url: The URL of the master node (e.g. spark://23.195.26.187:7077)
deploy-mode: Whether to deploy this application within the cluster or from an external client (e.g. client)
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an `hdfs://` path or a `file://` path that is present on all nodes.
application-arguments: Space delimited arguments passed to the main method of <main-class>, if any

To enumerate all options available to `spark-submit` run it with the `--help` flag. Here are a few
examples of common options:

{% highlight bash %}
# Run application locally
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi
--master local[8] \
/path/to/examples.jar \
100

# Run on a Spark standalone cluster
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000

# Run on a YARN cluster
HADOOP_CONF_DIR=XX ./bin/spark-submit \
--class org.apache.spark.examples.SparkPi
--master yarn-cluster \ # can also be `yarn-client` for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000
{% endhighlight %}

### Loading Configurations from a File

The `spark-submit` script can load default [Spark configuration values](configuration.html) from a
properties file and pass them on to your application. By default it will read configuration options
from `conf/spark-defaults.conf`. For more detail, see the section on
[loading default configurations](configuration.html#loading-default-configurations).

Loading default Spark configurations this way can obviate the need for certain flags to
`spark-submit`. For instance, if the `spark.master` property is set, you can safely omit the
`--master` flag from `spark-submit`. In general, configuration values explicitly set on a
`SparkConf` take the highest precedence, then flags passed to `spark-submit`, then values in the
defaults file.

If you are ever unclear where configuration options are coming from, you can print out fine-grained
debugging information by running `spark-submit` with the `--verbose` option.

### Advanced Dependency Management
When using `spark-submit`, the application jar along with any jars included with the `--jars` option
will be automatically transferred to the cluster. Spark uses the following URL scheme to allow
different strategies for disseminating jars:

- **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
every executor pulls the file from the driver HTTP server.
- **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
- **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node. This
means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
`spark.worker.cleanup.appDataTtl` property.

For python, the equivalent `--py-files` option can be used to distribute .egg and .zip libraries
to executors.
# Submitting Applications

Applications can be submitted to a cluster of any type using the `spark-submit` script.
The [application submission guide](submitting-applications.html) describes how to do this.

# Monitoring

Expand Down
2 changes: 1 addition & 1 deletion docs/graphx-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ support the [Bagel API](api/scala/index.html#org.apache.spark.bagel.package) and
[Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to
explore the new GraphX API and comment on issues that may complicate the transition from Bagel.

## Upgrade Guide from Spark 0.9.1
## Migrating from Spark 0.9.1

GraphX in Spark {{site.SPARK_VERSION}} contains one user-facing interface change from Spark 0.9.1. [`EdgeRDD`][EdgeRDD] may now store adjacent vertex attributes to construct the triplets, so it has gained a type parameter. The edges of a graph of type `Graph[VD, ED]` are of type `EdgeRDD[ED, VD]` rather than `EdgeRDD[ED]`.

Expand Down
2 changes: 1 addition & 1 deletion docs/hadoop-third-party-distributions.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: global
title: Running with Cloudera and HortonWorks
title: Third-Party Hadoop Distributions
---

Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Spark comes with several sample programs. Scala, Java and Python examples are i
`examples/src/main` directory. To run one of the Java or Scala sample programs, use
`bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
invokes the more general
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) for
[`spark-submit` script](submitting-applications.html) for
launching applications). For example,

./bin/run-example SparkPi 10
Expand Down
67 changes: 22 additions & 45 deletions docs/programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,12 +122,6 @@ val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
{% endhighlight %}

The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
to connect to, or a special "local" string to run in local mode, as described below. `appName` is
a name for your application, which will be shown in the cluster web UI. It's also possible to set
these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
which avoids hard-coding the master url in your application.

</div>

<div data-lang="java" markdown="1">
Expand All @@ -141,12 +135,6 @@ SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
{% endhighlight %}

The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
to connect to, or a special "local" string to run in local mode, as described below. `appName` is
a name for your application, which will be shown in the cluster web UI. It's also possible to set
these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
which avoids hard-coding the master url in your application.

</div>

<div data-lang="python" markdown="1">
Expand All @@ -160,16 +148,19 @@ conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf)
{% endhighlight %}

The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
to connect to, or a special "local" string to run in local mode, as described below. `appName` is
a name for your application, which will be shown in the cluster web UI. It's also possible to set
these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
which avoids hard-coding the master url in your application.

</div>

</div>

The `appName` parameter is a name for your application to show on the cluster UI.
`master` is a [Spark, Mesos or YARN cluster URL](submitting-applications.html#master-urls),
or a special "local" string to run in local mode.
In practice, when running on a cluster, you will not want to hardcode `master` in the program,
but rather [launch the application with `spark-submit`](submitting-applications.html) and
receive it there. However, for local testing and unit tests, you can pass "local" to run Spark
in-process.


## Using the Shell

<div class="codetabs">
Expand All @@ -193,7 +184,7 @@ $ ./bin/spark-shell --master local[4] --jars code.jar
{% endhighlight %}

For a complete list of options, run `spark-shell --help`. Behind the scenes,
`spark-shell` invokes the more general [Spark submit script](cluster-overview.html#launching-applications-with-spark-submit).
`spark-shell` invokes the more general [`spark-submit` script](submitting-applications.html).

</div>

Expand All @@ -216,7 +207,7 @@ $ ./bin/pyspark --master local[4] --py-files code.py
{% endhighlight %}

For a complete list of options, run `pyspark --help`. Behind the scenes,
`pyspark` invokes the more general [Spark submit script](cluster-overview.html#launching-applications-with-spark-submit).
`pyspark` invokes the more general [`spark-submit` script](submitting-applications.html).

It is also possible to launch the PySpark shell in [IPython](http://ipython.org), the
enhanced Python interpreter. PySpark works with IPython 1.0.0 and later. To
Expand Down Expand Up @@ -1221,33 +1212,19 @@ vecAccum = sc.accumulator(Vector(...))(VectorAccumulatorParam())

</div>


# Deploying to a Cluster

### Master URLs

The master URL passed to Spark can be in one of the following formats:
The [application submission guide](submitting-applications.html) describes how to submit applications to a cluster.
In short, once you package your application into a JAR (for Java/Scala) or a set of `.py` or `.zip` files (for Python),
the `bin/spark-submit` script lets you submit it to any supported cluster manager.

<table class="table">
<tr><th>Master URL</th><th>Meaning</th></tr>
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
</td></tr>
<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
The port must be whichever one your is configured to use, which is 5050 by default.
Or, for a Mesos cluster using ZooKeeper, use mesos://zk://....
</td></tr>
<tr><td> yarn-client </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
client mode. The cluster location will be found based on the HADOOP_CONF_DIR variable.
</td></tr>
<tr><td> yarn-cluster </td><td> Connect to a <a href="running-on-yarn.html"> YARN </a> cluster in
cluster mode. The cluster location will be found based on HADOOP_CONF_DIR.
</td></tr>
</table>
# Unit Testing

Spark is friendly to unit testing with any popular unit test framework.
Simply create a `SparkContext` in your test with the master URL set to `local`, run your operations,
and then call `SparkContext.stop()` to tear it down.
Make sure you stop the context within a `finally` block or the test framework's `tearDown` method,
as Spark does not support two contexts running concurrently in the same program.

# Migrating from pre-1.0 Versions of Spark

Expand Down Expand Up @@ -1288,8 +1265,8 @@ have changed from returning (key, list of values) pairs to (key, iterable of val

</div>

Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x)
and [MLlib](mllib-guide.html#migration-guide).
Migration guides are also available for [Spark Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x),
[MLlib](mllib-guide.html#migration-guide) and [GraphX](graphx-programming-guide.html#migrating-from-spark-091).


# Where to Go from Here
Expand Down
19 changes: 16 additions & 3 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,19 @@ Lines with a: 46, Lines with b: 23
# Where to Go from Here
Congratulations on running your first Spark application!

* For an in-depth overview of the API see "Programming Guides" menu section.
* For running applications on a cluster head to the [deployment overview](cluster-overview.html).
* For configuration options available to Spark applications see the [configuration page](configuration.html).
* For an in-depth overview of the API, start with the [Spark programming guide](programming-guide.html),
or see "Programming Guides" menu for other components.
* For running applications on a cluster, head to the [deployment overview](cluster-overview.html).
* Finally, Spark includes several samples in the `examples` directory
([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
[Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)).
You can run them as follows:

{% highlight bash %}
# For Scala and Java, use run-example:
./bin/run-example SparkPi

# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py
{% endhighlight %}
5 changes: 4 additions & 1 deletion docs/running-on-mesos.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ The driver also needs some configuration in `spark-env.sh` to interact properly
2. Also set `spark.executor.uri` to `<URL of spark-{{site.SPARK_VERSION}}.tar.gz>`.

Now when starting a Spark application against the cluster, pass a `mesos://`
or `zk://` URL as the master when creating a `SparkContext`. For example:
URL as the master when creating a `SparkContext`. For example:

{% highlight scala %}
val conf = new SparkConf()
Expand All @@ -126,6 +126,9 @@ val conf = new SparkConf()
val sc = new SparkContext(conf)
{% endhighlight %}

(You can also use [`spark-submit`](submitting-applications.html) and configure `spark.executor.uri`
in the [conf/spark-defaults.conf](configuration.html#loading-default-configurations) file.)

When running a shell, the `spark.executor.uri` parameter is inherited from `SPARK_EXECUTOR_URI`, so
it does not need to be redundantly passed in as a system property.

Expand Down
2 changes: 1 addition & 1 deletion docs/spark-standalone.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ You can also pass an option `--cores <numCores>` to control the number of cores

Spark supports two deploy modes: applications may run with the driver inside the client process or
entirely inside the cluster. The
[Spark submit script](cluster-overview.html#launching-applications-with-spark-submit) provides the
[`spark-submit` script](submitting-applications.html) provides the
most straightforward way to submit a compiled Spark application to the cluster in either deploy
mode.

Expand Down

0 comments on commit 61d72b4

Please sign in to comment.