From 318d2c9384e48859ae1cda45069e9b334f3d9130 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Tue, 27 May 2014 01:04:10 -0700 Subject: [PATCH] tweaks --- docs/programming-guide.md | 124 ++++++++++++++++++++------------------ 1 file changed, 67 insertions(+), 57 deletions(-) diff --git a/docs/programming-guide.md b/docs/programming-guide.md index f7e1ae05a1765..6391b3685f7af 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -275,9 +275,7 @@ We describe operations on distributed datasets later on. **Note:** *In this guide, we'll often use the concise Java 8 lambda syntax to specify Java functions, but in older versions of Java you can implement the interfaces in the [org.apache.spark.api.java.function](api/java/org/apache/spark/api/java/function/package-summary.html) package. -For example, for the `reduce` above, we could create a -[Function2](api/java/org/apache/spark/api/java/function/Function2.html) that adds two numbers. -We describe [writing functions in Java](#java-functions) in more detail below.* +We describe [passing functions to Spark](#passing-functions-to-spark) in more detail below.* @@ -409,7 +407,7 @@ By default, each transformed RDD may be recomputed each time you run an action o
-
+
To illustrate RDD basics, consider the simple program below: @@ -435,7 +433,71 @@ lineLengths.persist() which would cause it to be saved in memory after the first time it is computed. -

Passing Functions in Scala

+
+ +
+ +To illustrate RDD basics, consider the simple program below: + +{% highlight java %} +JavaRDD lines = sc.textFile("data.txt"); +JavaRDD lineLengths = lines.map(s -> s.length()); +int totalLength = lineLengths.reduce((a, b) -> a + b); +{% endhighlight %} + +The first line defines a base RDD from an external file. This dataset is not loaded in memory or +otherwise acted on: `lines` is merely a pointer to the file. +The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths` +is *not* immediately computed, due to laziness. +Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks +to run on separate machines, and each machine runs both its part of the map and a local reduction, +returning only its answer to the driver program. + +If we also wanted to use `lineLengths` again later, we could add: + +{% highlight java %} +lineLengths.persist(); +{% endhighlight %} + +which would cause it to be saved in memory after the first time it is computed. + +
+ +
+ +To illustrate RDD basics, consider the simple program below: + +{% highlight python %} +lines = sc.textFile("data.txt") +lineLengths = lines.map(lambda s: len(s)) +totalLength = lineLengths.reduce(lambda a, b: a + b) +{% endhighlight %} + +The first line defines a base RDD from an external file. This dataset is not loaded in memory or +otherwise acted on: `lines` is merely a pointer to the file. +The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths` +is *not* immediately computed, due to laziness. +Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks +to run on separate machines, and each machine runs both its part of the map and a local reduction, +returning only its answer to the driver program. + +If we also wanted to use `lineLengths` again later, we could add: + +{% highlight scala %} +lineLengths.persist() +{% endhighlight %} + +which would cause it to be saved in memory after the first time it is computed. + +
+ +
+ +### Passing Functions to Spark + +
+ +
Spark's API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this: @@ -491,32 +553,6 @@ def doStuff(rdd: RDD[String]): RDD[String] = {
-To illustrate RDD basics, consider the simple program below: - -{% highlight java %} -JavaRDD lines = sc.textFile("data.txt"); -JavaRDD lineLengths = lines.map(s -> s.length()); -int totalLength = lineLengths.reduce((a, b) -> a + b); -{% endhighlight %} - -The first line defines a base RDD from an external file. This dataset is not loaded in memory or -otherwise acted on: `lines` is merely a pointer to the file. -The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths` -is *not* immediately computed, due to laziness. -Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks -to run on separate machines, and each machine runs both its part of the map and a local reduction, -returning only its answer to the driver program. - -If we also wanted to use `lineLengths` again later, we could add: - -{% highlight java %} -lineLengths.persist(); -{% endhighlight %} - -which would cause it to be saved in memory after the first time it is computed. - -

Passing Functions in Java

- Spark's API relies heavily on passing functions in the driver program to run on the cluster. In Java, functions are represented by classes implementing the interfaces in the [org.apache.spark.api.java.function](api/java/org/apache/spark/api/java/function/package-summary.html) package. @@ -563,32 +599,6 @@ for other languages.
-To illustrate RDD basics, consider the simple program below: - -{% highlight python %} -lines = sc.textFile("data.txt") -lineLengths = lines.map(lambda s: len(s)) -totalLength = lineLengths.reduce(lambda a, b: a + b) -{% endhighlight %} - -The first line defines a base RDD from an external file. This dataset is not loaded in memory or -otherwise acted on: `lines` is merely a pointer to the file. -The second line defines `lineLengths` as the result of a `map` transformation. Again, `lineLengths` -is *not* immediately computed, due to laziness. -Finally, we run `reduce`, which is an action. At this point Spark breaks the computation into tasks -to run on separate machines, and each machine runs both its part of the map and a local reduction, -returning only its answer to the driver program. - -If we also wanted to use `lineLengths` again later, we could add: - -{% highlight scala %} -lineLengths.persist() -{% endhighlight %} - -which would cause it to be saved in memory after the first time it is computed. - -

Passing Functions in Python

- Spark's API relies heavily on passing functions in the driver program to run on the cluster. There are three recommended ways to do this: