diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index e231d3773b7cd..43da6a570a9c1 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -48,12 +48,13 @@ all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages. -**Note:** *Python API has been introduced in Spark 1.2. It has all the DStream transformations -and almost all the output operations available in Scala and Java interfaces. +**Note:** Python API for Spark Streaming has been introduced in Spark 1.2. It has all the DStream +transformations and almost all the output operations available in Scala and Java interfaces. However, it has only support for basic sources like text files and text data over sockets. API for creating more sources like Kafka, and Flume will be available in future. Further information about available features in Python API are mentioned throughout this -document; look out for the tag* "**Note on Python API**". +document; look out for the tag +Python API. *************************************************************************************************** @@ -588,14 +589,13 @@ These operations are discussed in detail in later sections. ## Input DStreams and Receivers Input DStreams are DStreams representing the stream of input data received from streaming sources. In the [quick example](#a-quick-example), `lines` was an input DStream as it represented -the stream of data received from the netcat server. +the stream of data received from the netcat server. Every input DStream +(except file stream, discussed later in this section) is associated with a **Receiver** +([Scala doc](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver), +[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)) object which receives the +data from a source and stores it in Spark's memory for processing. -Every input DStream (except file stream, discussed later) is associated with a **Receiver** -([Scala](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver), -[Java](api/java/org/apache/spark/streaming/receiver/Receiver.html)) object which receives the -data from a source and stores it in Spark's memory for processing. There are - -Spark Streaming has two categories of streaming sources. +Spark Streaming provides two categories of built-in streaming sources. - *Basic sources*: Sources directly available in the StreamingContext API. Example: file systems, socket connections, and Akka actors. @@ -631,6 +631,7 @@ as well as, to run the receiver(s). ### Basic Sources {:.no_toc} + We have already taken a look at the `ssc.socketTextStream(...)` in the [quick example](#a-quick-example) which creates a DStream from text data received over a TCP socket connection. Besides sockets, the StreamingContext API provides @@ -659,15 +660,14 @@ methods for creating DStreams from files and Akka actors as input sources. For simple text files, there is an easier method `streamingContext.textFileStream(dataDirectory)`. And file streams do not require running a receiver, hence does not require allocating cores. - **Note on Python API:** *As of Spark 1.2, `fileStream` is not available in the Python API, only - `textFileStream` is available*. + Python API As of Spark 1.2, + `fileStream` is not available in the Python API, only `textFileStream` is available. - **Streams based on Custom Actors:** DStreams can be created with data streams received through Akka actors by using `streamingContext.actorStream(actorProps, actor-name)`. See the [Custom Receiver - Guide](streaming-custom-receivers.html#implementing-and-using-a-custom-actor-based-receiver) for - more details. + Guide](streaming-custom-receivers.html) for more details. - *Note on Python API:** Since actors are available only in the Java and Scala + Python API Since actors are available only in the Java and Scala libraries, `actorStream` is not available in the Python API. - **Queue of RDDs as a Stream:** For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream. @@ -680,17 +680,22 @@ for Java, and [StreamingContext]. ### Advanced Sources {:.no_toc} -**Note on Python API:** *As of Spark 1.2, these sources are not available in the Python API.* +Python API As of Spark 1.2, +these sources are not available in the Python API. This category of sources require interfacing with external non-Spark libraries, some of them with complex dependencies (e.g., Kafka and Flume). Hence, to minimize issues related to version conflicts of dependencies, the functionality to create DStreams from these sources have been moved to separate -libraries, that can be [linked to](#linking) explicitly as necessary. For example, if you want to +libraries, that can be [linked](#linking) to explicitly when necessary. For example, if you want to create a DStream using data from Twitter's stream of tweets, you have to do the following. -1. *Linking*: Add the artifact `spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}}` to the SBT/Maven project dependencies. -1. *Programming*: Import the `TwitterUtils` class and create a DStream with `TwitterUtils.createStream` as shown below. -1. *Deploying*: Generate an uber JAR with all the dependencies (including the dependency `spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}}` and its transitive dependencies) and then deploy the application. This is further explained in the [Deploying section](#deploying-applications). +1. *Linking*: Add the artifact `spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}}` to the + SBT/Maven project dependencies. +1. *Programming*: Import the `TwitterUtils` class and create a DStream with + `TwitterUtils.createStream` as shown below. +1. *Deploying*: Generate an uber JAR with all the dependencies (including the dependency + `spark-streaming-twitter_{{site.SCALA_BINARY_VERSION}}` and its transitive dependencies) and + then deploy the application. This is further explained in the [Deploying section](#deploying-applications).