First pass at updating programming guide to support all languages, plus

other tweaks throughout
pdeyhim · May 28, 2014 · a33d6fe · a33d6fe
1 parent 3b6a876
commit a33d6fe
Show file tree

Hide file tree

Showing 15 changed files with 60 additions and 497 deletions.
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
@@ -9,6 +9,11 @@
         <title>{{ page.title }} - Spark {{site.SPARK_VERSION_SHORT}} Documentation</title>
         <meta name="description" content="">
 
+        {% if page.redirect %}
+          <meta http-equiv="refresh" content="0; url={{page.redirect}}">
+          <link rel="canonical" href="{{page.redirect}}" />
+        {% endif %}
+
         <link rel="stylesheet" href="css/bootstrap.min.css">
         <style>
             body {
@@ -61,15 +66,13 @@
                             <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
                             <ul class="dropdown-menu">
                                 <li><a href="quick-start.html">Quick Start</a></li>
-                                <li><a href="scala-programming-guide.html">Spark in Scala</a></li>
-                                <li><a href="java-programming-guide.html">Spark in Java</a></li>
-                                <li><a href="python-programming-guide.html">Spark in Python</a></li>
+                                <li><a href="programming-guide.html">Spark Programming Guide</a></li>
                                 <li class="divider"></li>
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="sql-programming-guide.html">Spark SQL</a></li>
                                 <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
-                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
                                 <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
+                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
                             </ul>
                         </li>
 

diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
@@ -21,7 +21,7 @@ To use Bagel in your program, add the following SBT or Maven dependency:
 
 # Programming Model
 
-Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
+Bagel operates on a graph represented as a [distributed dataset](programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
 
 For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
 

diff --git a/docs/css/bootstrap.min.css b/docs/css/bootstrap.min.css
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
@@ -690,7 +690,7 @@ class GraphOps[VD, ED] {
 
 In Spark, RDDs are not persisted in memory by default. To avoid recomputation, they must be explicitly cached when using them multiple times (see the [Spark Programming Guide][RDD Persistence]). Graphs in GraphX behave the same way. **When using a graph multiple times, make sure to call [`Graph.cache()`][Graph.cache] on it first.**
 
-[RDD Persistence]: scala-programming-guide.html#rdd-persistence
+[RDD Persistence]: programming-guide.html#rdd-persistence
 [Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
 
 In iterative computations, *uncaching* may also be necessary for best performance. By default, cached RDDs and graphs will remain in memory until memory pressure forces them to be evicted in LRU order. For iterative computation, intermediate results from previous iterations will fill up the cache. Though they will eventually be evicted, the unnecessary data stored in memory will slow down garbage collection. It would be more efficient to uncache intermediate results as soon as they are no longer necessary. This involves materializing (caching and forcing) a graph or RDD every iteration, uncaching all other datasets, and only using the materialized dataset in future iterations. However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. **For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.**

diff --git a/docs/index.md b/docs/index.md
@@ -4,18 +4,19 @@ title: Spark Overview
 ---
 
 Apache Spark is a fast and general-purpose cluster computing system.
-It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
+It provides high-level APIs in Java, Scala and Python,
+and an optimized engine that supports general execution graphs.
 It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
 
 # Downloading
 
-Get Spark by visiting the [downloads page](http://spark.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page 
+Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page 
 contains Spark packages for many popular HDFS versions. If you'd like to build Spark from 
 scratch, visit the [building with Maven](building-with-maven.html) page.
 
-Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is 
-to have `java` to installed on your system `PATH`, or the `JAVA_HOME` environment variable 
-pointing to a Java installation.
+Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run
+locally on one machine -- all you need is to have `java` installed on your system `PATH`,
+or the `JAVA_HOME` environment variable pointing to a Java installation.
 
 For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}. 
 If you write applications in Scala, you will need to use a compatible Scala version 
@@ -39,7 +40,7 @@ great way to learn the framework.
     ./bin/spark-shell --master local[2]
 
 The `--master` option specifies the
-[master URL for a distributed cluster](scala-programming-guide.html#master-urls), or `local` to run
+[master URL for a distributed cluster](programming-guide.html#master-urls), or `local` to run
 locally with one thread, or `local[N]` to run locally with N threads. You should start by using
 `local` for testing. For a full list of options, run Spark shell with the `--help` option.
 
@@ -69,9 +70,8 @@ options for deployment:
 **Programming guides:**
 
 * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
-* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
-  * [Java Programming Guide](java-programming-guide.html): using Spark from Java
-  * [Python Programming Guide](python-programming-guide.html): using Spark from Python
+* [Spark Programming Guide](programming-guide.html): a detailed overview of Spark concepts
+  in all supported languages (Scala, Java, Python)
 * [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
 * [Spark SQL](sql-programming-guide.html): Support for running relational queries on Spark
 * [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library

diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
@@ -5,7 +5,7 @@ title: Java Programming Guide
 
 The Spark Java API exposes all the Spark features available in the Scala version to Java.
 To learn the basics of Spark, we recommend reading through the
-[Scala programming guide](scala-programming-guide.html) first; it should be
+[Scala programming guide](programming-guide.html) first; it should be
 easy to follow even if you don't know Scala.
 This guide will show how to use the Spark features described there in Java.
 
@@ -80,16 +80,16 @@ package. Each interface has a single abstract method, `call()`.
 
 ## Storage Levels
 
-RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
+RDD [storage level](programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
 declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To
 define your own storage level, you can use StorageLevels.create(...). 
 
 # Other Features
 
 The Java API supports other Spark features, including
-[accumulators](scala-programming-guide.html#accumulators),
-[broadcast variables](scala-programming-guide.html#broadcast-variables), and
-[caching](scala-programming-guide.html#rdd-persistence).
+[accumulators](programming-guide.html#accumulators),
+[broadcast variables](programming-guide.html#broadcast-variables), and
+[caching](programming-guide.html#rdd-persistence).
 
 # Upgrading From Pre-1.0 Versions of Spark
 

diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
@@ -116,7 +116,7 @@ is a stochastic gradient. Here `$S$` is the sampled subset of size `$|S|=$ miniB
 $\cdot n$`.
 
 In each iteration, the sampling over the distributed dataset
-([RDD](scala-programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
+([RDD](programming-guide.html#resilient-distributed-datasets-rdds)), as well as the
 computation of the sum of the partial results from each worker machine is performed by the
 standard spark routines.
 

diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
@@ -6,7 +6,7 @@ title: Python Programming Guide
 
 The Spark Python API (PySpark) exposes the Spark programming model to Python.
 To learn the basics of Spark, we recommend reading through the
-[Scala programming guide](scala-programming-guide.html) first; it should be
+[Scala programming guide](programming-guide.html) first; it should be
 easy to follow even if you don't know Scala.
 This guide will show how to use the Spark features described there in Python.
 

diff --git a/docs/quick-start.md b/docs/quick-start.md
@@ -9,7 +9,7 @@ title: Quick Start
 This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark's
 interactive shell (in Python or Scala),
 then show how to write standalone applications in Java, Scala, and Python.
-See the [programming guide](scala-programming-guide.html) for a more complete reference.
+See the [programming guide](programming-guide.html) for a more complete reference.
 
 To follow along with this guide, first download a packaged release of Spark from the
 [Spark website](http://spark.apache.org/downloads.html). Since we won't be using HDFS,
@@ -35,7 +35,7 @@ scala> val textFile = sc.textFile("README.md")
 textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
 {% endhighlight %}
 
-RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
+RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
 
 {% highlight scala %}
 scala> textFile.count() // Number of items in this RDD
@@ -45,7 +45,7 @@ scala> textFile.first() // First item in this RDD
 res1: String = # Apache Spark
 {% endhighlight %}
 
-Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
+Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
 
 {% highlight scala %}
 scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
@@ -70,7 +70,7 @@ Spark's primary abstraction is a distributed collection of items called a Resili
 >>> textFile = sc.textFile("README.md")
 {% endhighlight %}
 
-RDDs have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
+RDDs have _[actions](programming-guide.html#actions)_, which return values, and _[transformations](programming-guide.html#transformations)_, which return pointers to new RDDs. Let's start with a few actions:
 
 {% highlight python %}
 >>> textFile.count() # Number of items in this RDD
@@ -80,7 +80,7 @@ RDDs have _[actions](scala-programming-guide.html#actions)_, which return values
 u'# Apache Spark'
 {% endhighlight %}
 
-Now let's use a transformation. We will use the [`filter`](scala-programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
+Now let's use a transformation. We will use the [`filter`](programming-guide.html#transformations) transformation to return a new RDD with a subset of the items in the file.
 
 {% highlight python %}
 >>> linesWithSpark = textFile.filter(lambda line: "Spark" in line)
@@ -125,7 +125,7 @@ scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (w
 wordCounts: spark.RDD[(String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
 {% endhighlight %}
 
-Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
+Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
 
 {% highlight scala %}
 scala> wordCounts.collect()
@@ -162,7 +162,7 @@ One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can i
 >>> wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
 {% endhighlight %}
 
-Here, we combined the [`flatMap`](scala-programming-guide.html#transformations), [`map`](scala-programming-guide.html#transformations) and [`reduceByKey`](scala-programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](scala-programming-guide.html#actions) action:
+Here, we combined the [`flatMap`](programming-guide.html#transformations), [`map`](programming-guide.html#transformations) and [`reduceByKey`](programming-guide.html#transformations) transformations to compute the per-word counts in the file as an RDD of (string, int) pairs. To collect the word counts in our shell, we can use the [`collect`](programming-guide.html#actions) action:
 
 {% highlight python %}
 >>> wordCounts.collect()
@@ -192,7 +192,7 @@ res9: Long = 15
 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
 that these same functions can be used on very large data sets, even when they are striped across
 tens or hundreds of nodes. You can also do this interactively by connecting `bin/spark-shell` to
-a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
+a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
 
 </div>
 <div data-lang="python" markdown="1">
@@ -210,7 +210,7 @@ a cluster, as described in the [programming guide](scala-programming-guide.html#
 It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
 that these same functions can be used on very large data sets, even when they are striped across
 tens or hundreds of nodes. You can also do this interactively by connecting `bin/pyspark` to
-a cluster, as described in the [programming guide](scala-programming-guide.html#initializing-spark).
+a cluster, as described in the [programming guide](programming-guide.html#initializing-spark).
 
 </div>
 </div>

diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
@@ -103,7 +103,7 @@ the `make-distribution.sh` script included in a Spark source tarball/checkout.
 ## Using a Mesos Master URL
 
 The Master URLs for Mesos are in the form `mesos://host:5050` for a single-master Mesos
-cluster, or `zk://host:2181` for a multi-master Mesos cluster using ZooKeeper.
+cluster, or `mesos://zk://host:2181` for a multi-master Mesos cluster using ZooKeeper.
 
 The driver also needs some configuration in `spark-env.sh` to interact properly with Mesos: