Skip to content

Commit

Permalink
First pass at updating programming guide to support all languages, plus
Browse files Browse the repository at this point in the history
other tweaks throughout
  • Loading branch information
mateiz committed May 28, 2014
1 parent 3b6a876 commit a33d6fe
Show file tree
Hide file tree
Showing 15 changed files with 60 additions and 497 deletions.
11 changes: 7 additions & 4 deletions docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@
<title>{{ page.title }} - Spark {{site.SPARK_VERSION_SHORT}} Documentation</title>
<meta name="description" content="">

{% if page.redirect %}
<meta http-equiv="refresh" content="0; url={{page.redirect}}">
<link rel="canonical" href="{{page.redirect}}" />
{% endif %}

<link rel="stylesheet" href="css/bootstrap.min.css">
<style>
body {
Expand Down Expand Up @@ -61,15 +66,13 @@
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
<li><a href="scala-programming-guide.html">Spark in Scala</a></li>
<li><a href="java-programming-guide.html">Spark in Java</a></li>
<li><a href="python-programming-guide.html">Spark in Python</a></li>
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
<li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">Spark SQL</a></li>
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
</ul>
</li>

Expand Down
2 changes: 1 addition & 1 deletion docs/bagel-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ To use Bagel in your program, add the following SBT or Maven dependency:

# Programming Model

Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
Bagel operates on a graph represented as a [distributed dataset](programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.

For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.

Expand Down
2 changes: 1 addition & 1 deletion docs/css/bootstrap.min.css

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/graphx-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -690,7 +690,7 @@ class GraphOps[VD, ED] {

In Spark, RDDs are not persisted in memory by default. To avoid recomputation, they must be explicitly cached when using them multiple times (see the [Spark Programming Guide][RDD Persistence]). Graphs in GraphX behave the same way. **When using a graph multiple times, make sure to call [`Graph.cache()`][Graph.cache] on it first.**

[RDD Persistence]: scala-programming-guide.html#rdd-persistence
[RDD Persistence]: programming-guide.html#rdd-persistence
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]

In iterative computations, *uncaching* may also be necessary for best performance. By default, cached RDDs and graphs will remain in memory until memory pressure forces them to be evicted in LRU order. For iterative computation, intermediate results from previous iterations will fill up the cache. Though they will eventually be evicted, the unnecessary data stored in memory will slow down garbage collection. It would be more efficient to uncache intermediate results as soon as they are no longer necessary. This involves materializing (caching and forcing) a graph or RDD every iteration, uncaching all other datasets, and only using the materialized dataset in future iterations. However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. **For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.**
Expand Down
18 changes: 9 additions & 9 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,19 @@ title: Spark Overview
---

Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
It provides high-level APIs in Java, Scala and Python,
and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).

# Downloading

Get Spark by visiting the [downloads page](http://spark.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page
Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page
contains Spark packages for many popular HDFS versions. If you'd like to build Spark from
scratch, visit the [building with Maven](building-with-maven.html) page.

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is
to have `java` to installed on your system `PATH`, or the `JAVA_HOME` environment variable
pointing to a Java installation.
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run
locally on one machine -- all you need is to have `java` installed on your system `PATH`,
or the `JAVA_HOME` environment variable pointing to a Java installation.

For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}.
If you write applications in Scala, you will need to use a compatible Scala version
Expand All @@ -39,7 +40,7 @@ great way to learn the framework.
./bin/spark-shell --master local[2]

The `--master` option specifies the
[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
[master URL for a distributed cluster](programming-guide.html#master-urls), or `local` to run
locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing. For a full list of options, run Spark shell with the `--help` option.

Expand Down Expand Up @@ -69,9 +70,8 @@ options for deployment:
**Programming guides:**

* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
* [Spark Programming Guide](programming-guide.html): a detailed overview of Spark concepts
in all supported languages (Scala, Java, Python)
* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
* [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
Expand Down
10 changes: 5 additions & 5 deletions docs/java-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ title: Java Programming Guide

The Spark Java API exposes all the Spark features available in the Scala version to Java.
To learn the basics of Spark, we recommend reading through the
[Scala programming guide](scala-programming-guide.html) first; it should be
[Scala programming guide](programming-guide.html) first; it should be
easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Java.

Expand Down Expand Up @@ -80,16 +80,16 @@ package. Each interface has a single abstract method, `call()`.

## Storage Levels

RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
RDD [storage level](programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To
define your own storage level, you can use StorageLevels.create(...).

# Other Features

The Java API supports other Spark features, including
[accumulators](scala-programming-guide.html#accumulators),
[broadcast variables](scala-programming-guide.html#broadcast-variables), and
[caching](scala-programming-guide.html#rdd-persistence).
[accumulators](programming-guide.html#accumulators),
[broadcast variables](programming-guide.html#broadcast-variables), and
[caching](programming-guide.html#rdd-persistence).

# Upgrading From Pre-1.0 Versions of Spark

Expand Down
2 changes: 1 addition & 1 deletion docs/mllib-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ is a stochastic gradient. Here `$S$` is the sampled subset of size `$|S|=$ miniB
$\cdot n$`.

In each iteration, the sampling over the distributed dataset
([RDD](scala-programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
([RDD](programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
computation of the sum of the partial results from each worker machine is performed by the
standard spark routines.

Expand Down
2 changes: 1 addition & 1 deletion docs/python-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: Python Programming Guide

The Spark Python API (PySpark) exposes the Spark programming model to Python.
To learn the basics of Spark, we recommend reading through the
[Scala programming guide](scala-programming-guide.html) first; it should be
[Scala programming guide](programming-guide.html) first; it should be
easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Python.

Expand Down
18 changes: 9 additions & 9 deletions docs/quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ title: Quick Start
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's
interactive shell (in Python or Scala),
then show how to write standalone applications in Java, Scala, and Python.
See the [programming guide](scala-programming-guide.html) for a more complete reference.
See the [programming guide](programming-guide.html) for a more complete reference.

To follow along with this guide, first download a packaged release of Spark from the
[Spark website](http://spark.apache.org/downloads.html). Since we won't be using HDFS,
Expand All @@ -35,7 +35,7 @@ scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
{% endhighlight %}

RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:

{% highlight scala %}
scala> textFile.count() // Number of items in this RDD
Expand All @@ -45,7 +45,7 @@ scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
{% endhighlight %}

Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.

{% highlight scala %}
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
Expand All @@ -70,7 +70,7 @@ Spark's primary abstraction is a distributed collection of items called a Resili
>>> textFile = sc.textFile("README.md")
{% endhighlight %}

RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:

{% highlight python %}
>>> textFile.count() # Number of items in this RDD
Expand All @@ -80,7 +80,7 @@ RDDs have _[actions](scala-programming-guide.html#actions)_, which return values
u'# Apache Spark'
{% endhighlight %}

Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.

{% highlight python %}
>>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
Expand Down Expand Up @@ -125,7 +125,7 @@ scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (w
wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
{% endhighlight %}

Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:

{% highlight scala %}
scala> wordCounts.collect()
Expand Down Expand Up @@ -162,7 +162,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i
>>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
{% endhighlight %}

Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:

{% highlight python %}
>>> wordCounts.collect()
Expand Down Expand Up @@ -192,7 +192,7 @@ res9: Long = 15
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
that these same functions can be used on very large data sets, even when they are striped across
tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).

</div>
<div data-lang="python" markdown="1">
Expand All @@ -210,7 +210,7 @@ a cluster, as described in the [programming guide](scala-programming-guide.html#
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
that these same functions can be used on very large data sets, even when they are striped across
tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to
a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).

</div>
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/running-on-mesos.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ the `make-distribution.sh` script included in a Spark source tarball/checkout.
## Using a Mesos Master URL

The Master URLs for Mesos are in the form `mesos://host:5050` for a single-master Mesos
cluster, or `zk://host:2181` for a multi-master Mesos cluster using ZooKeeper.
cluster, or `mesos://zk://host:2181` for a multi-master Mesos cluster using ZooKeeper.

The driver also needs some configuration in `spark-env.sh` to interact properly with Mesos:

Expand Down
Loading

0 comments on commit a33d6fe

Please sign in to comment.