Skip to content

Commit

Permalink
Updated based on Josh's comments, updated receiver reliability and de…
Browse files Browse the repository at this point in the history
…ploying section, and also updated configuration.
  • Loading branch information
tdas committed Dec 10, 2014
1 parent 17b99fb commit 195852c
Show file tree
Hide file tree
Showing 2 changed files with 259 additions and 180 deletions.
125 changes: 67 additions & 58 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ application. These properties can be set directly on a
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
`set()` method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
which can help detect bugs that only exist when we run in a distributed context.
Note that we run with local[2], meaning two threads - which represents "minimal" parallelism,
which can help detect bugs that only exist when we run in a distributed context.

{% highlight scala %}
val conf = new SparkConf()
Expand All @@ -35,7 +35,7 @@ val sc = new SparkContext(conf)
{% endhighlight %}

Note that we can have more than 1 thread in local mode, and in cases like spark streaming, we may actually
require one to prevent any sort of starvation issues.
require one to prevent any sort of starvation issues.

## Dynamically Loading Spark Properties
In some cases, you may want to avoid hard-coding certain configurations in a `SparkConf`. For
Expand All @@ -48,8 +48,8 @@ val sc = new SparkContext(new SparkConf())

Then, you can supply configuration values at runtime:
{% highlight bash %}
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
./bin/spark-submit --name "My app" --master local[4] --conf spark.shuffle.spill=false
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar
{% endhighlight %}

The Spark shell and [`spark-submit`](submitting-applications.html)
Expand Down Expand Up @@ -123,7 +123,7 @@ of the most common options to set are:
<td>
Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).
Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size
is above this limit.
is above this limit.
Having a high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
and memory overhead of objects in JVM). Setting a proper limit can protect the driver from
out-of-memory errors.
Expand Down Expand Up @@ -217,6 +217,45 @@ Apart from these, the following properties are also available, and may be useful
Set a special library path to use when launching executor JVM's.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
the maximum file size for rolling.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.time.interval</code></td>
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
<td>(none)</td>
<td>
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. Value is set in terms of bytes.
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
<td>(none)</td>
<td>
Sets the number of latest rolling log files that are going to be retained by the system.
Older log files will be deleted. Disabled by default.
</td>
</tr>
<tr>
<td><code>spark.files.userClassPathFirst</code></td>
<td>false</td>
Expand Down Expand Up @@ -250,10 +289,11 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.python.profile.dump</code></td>
<td>(none)</td>
<td>
The directory which is used to dump the profile result before driver exiting.
The directory which is used to dump the profile result before driver exiting.
The results will be dumped as separated file for each RDD. They can be loaded
by ptats.Stats(). If this is specified, the profile result will not be displayed
automatically.
</td>
</tr>
<tr>
<td><code>spark.python.worker.reuse</code></td>
Expand All @@ -269,8 +309,8 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
<td>(none)</td>
<td>
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
process. The user can specify multiple of these and to set multiple environment variables.
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
process. The user can specify multiple of these and to set multiple environment variables.
</td>
</tr>
<tr>
Expand Down Expand Up @@ -475,9 +515,9 @@ Apart from these, the following properties are also available, and may be useful
<td>
The codec used to compress internal data such as RDD partitions, broadcast variables and
shuffle outputs. By default, Spark provides three codecs: <code>lz4</code>, <code>lzf</code>,
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
e.g.
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
and <code>snappy</code>. You can also use fully qualified class names to specify the codec,
e.g.
<code>org.apache.spark.io.LZ4CompressionCodec</code>,
<code>org.apache.spark.io.LZFCompressionCodec</code>,
and <code>org.apache.spark.io.SnappyCompressionCodec</code>.
</td>
Expand Down Expand Up @@ -945,7 +985,7 @@ Apart from these, the following properties are also available, and may be useful
(resources are executors in yarn mode, CPU cores in standalone mode)
to wait for before scheduling begins. Specified as a double between 0.0 and 1.0.
Regardless of whether the minimum ratio of resources has been reached,
the maximum amount of time it will wait before scheduling begins is controlled by config
the maximum amount of time it will wait before scheduling begins is controlled by config
<code>spark.scheduler.maxRegisteredResourcesWaitingTime</code>.
</td>
</tr>
Expand All @@ -954,7 +994,7 @@ Apart from these, the following properties are also available, and may be useful
<td>30000</td>
<td>
Maximum amount of time to wait for resources to register before scheduling begins
(in milliseconds).
(in milliseconds).
</td>
</tr>
<tr>
Expand Down Expand Up @@ -1023,7 +1063,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Whether Spark acls should are enabled. If enabled, this checks to see if the user has
access permissions to view or modify the job. Note this requires the user to be known,
access permissions to view or modify the job. Note this requires the user to be known,
so if the user comes across as null no checks are done. Filters can be used with the UI
to authenticate and set the user.
</td>
Expand Down Expand Up @@ -1062,19 +1102,27 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.streaming.blockInterval</code></td>
<td>200</td>
<td>
Interval (milliseconds) at which data received by Spark Streaming receivers is coalesced
into blocks of data before storing them in Spark.
Interval (milliseconds) at which data received by Spark Streaming receivers is chunked
into blocks of data before storing them in Spark. Minimum recommended - 50 ms.
</td>
</tr>
<tr>
<td><code>spark.streaming.receiver.maxRate</code></td>
<td>infinite</td>
<td>
Maximum rate (per second) at which each receiver will push data into blocks. Effectively,
each stream will consume at most this number of records per second.
Maximum number records per second at which each receiver will receive data.
Effectively, each stream will consume at most this number of records per second.
Setting this configuration to 0 or a negative number will put no limit on the rate.
</td>
</tr>
<tr>
<td><code>spark.streaming.receiver.writeAheadLogs.enable</code></td>
<td>false</td>
<td>
Enable write ahead logs for receivers. All the input data received through receivers
will be saved to write ahead logs that will allow it to be recovered after driver failures.
</td>
</tr>
<tr>
<td><code>spark.streaming.unpersist</code></td>
<td>true</td>
Expand All @@ -1086,45 +1134,6 @@ Apart from these, the following properties are also available, and may be useful
higher memory usage in Spark.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
Set the strategy of rolling of executor logs. By default it is disabled. It can
be set to "time" (time-based rolling) or "size" (size-based rolling). For "time",
use <code>spark.executor.logs.rolling.time.interval</code> to set the rolling interval.
For "size", use <code>spark.executor.logs.rolling.size.maxBytes</code> to set
the maximum file size for rolling.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.time.interval</code></td>
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
<td>(none)</td>
<td>
Set the max size of the file by which the executor logs will be rolled over.
Rolling is disabled by default. Value is set in terms of bytes.
See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
</tr>
<tr>
<td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
<td>(none)</td>
<td>
Sets the number of latest rolling log files that are going to be retained by the system.
Older log files will be deleted. Disabled by default.
</td>
</tr>
</table>

#### Cluster Managers
Expand Down
Loading

0 comments on commit 195852c

Please sign in to comment.