[SPARK-3954][Streaming] Optimization to FileInputDStream #2811

surq · 2014-10-15T07:20:37Z

about convert files to RDDS there are 3 loops with files sequence in spark source.
loops files sequence:
1.files.map(...)
2.files.zip(fileRDDs)
3.files-size.foreach
It's will very time consuming when lots of files.So I do the following correction:
3 loops with files sequence => only one loop

AmplabJenkins · 2014-10-15T07:22:09Z

Can one of the admins verify this patch?

jerryshao · 2014-10-15T09:01:17Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

@@ -27,6 +27,7 @@ import org.apache.spark.rdd.RDD
 import org.apache.spark.rdd.UnionRDD
 import org.apache.spark.streaming.{StreamingContext, Time}
 import org.apache.spark.util.TimeStampedHashMap
+import scala.collection.mutable.ArrayBuffer


Does your modification use ArrayBuffer, seems it is not used.

jerryshao · 2014-10-15T09:02:06Z

Looks good to me about the improvement.

jerryshao · 2014-10-15T09:03:18Z

Besides would you mind creating a related JIRA and change the title like other PR.

AmplabJenkins · 2014-10-21T23:14:34Z

Can one of the admins verify this patch?

surq · 2014-10-23T10:05:44Z

Does someone take notice of this patch?

surq · 2014-10-28T05:55:15Z

@jerryshao ：Is this a inessential patch? why no manager to merge-commit?

jerryshao · 2014-10-28T06:00:00Z

Maybe they are quite busy, let me ping @tdas .

surq · 2014-10-28T06:17:39Z

@tdas
Since this patch was proposed also some day.If you have time, please pay more attention to this patch.
@jerryshao
Thanks for your kind help.

liancheng · 2014-11-02T02:12:28Z

streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

@@ -120,14 +120,14 @@ class FileInputDStream[K: ClassTag, V: ClassTag, F <: NewInputFormat[K,V] : Clas

  /** Generate one RDD from an array of files */
  private def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
-    val fileRDDs = files.map(file => context.sparkContext.newAPIHadoopFile[K, V, F](file))
-    files.zip(fileRDDs).foreach { case (file, rdd) => {
+    val fileRDDs = for (file <- files; rdd = context.sparkContext.newAPIHadoopFile[K, V, F](file)) yield {


Exceeds 100 columns.

rxin · 2014-11-02T03:26:59Z

Does this actually improve performance?

tdas · 2014-11-02T20:25:14Z

I dont see any reference to file-size.foreach in the patch. which line are you referring to. Were you dealing with so many files that this proved to be a problem? Could post some bench mark numbers on what is the improvement in speed?

tianyi · 2014-11-04T01:41:48Z

I think this PR is more concentrate on logical optimization, not speed. Spark used three iterations to get the fileRDD list which is not necessary.

surq · 2014-11-04T05:28:43Z

@tdas
Through to do the bench mark, actually the spending time difference is very small before and after of the correction.From the code elegant point of view, I think this patch is necessary.So the title is changed. [source code optimization]

surq · 2014-11-04T06:36:27Z

Bench mark:
10000 files run 10 times.

import org.apache.spark.Logging
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import scala.reflect.ClassTag
import org.apache.hadoop.mapreduce.{ InputFormat => NewInputFormat }
import org.apache.spark.rdd.UnionRDD
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import scala.sys.process._
import java.io.File
import java.util.Calendar
import java.io.FileWriter

object SPARK_3954Bench {
  // new files's path
  val newFilePath = "/home/hadoop/test/bid/bids"
  // source file
  val sourceCopyFile = "/home/hadoop/test/bid/web415-bid-k1-.1414554678231"
  val resultFile = "/home/hadoop/test/bid/result.txt"
  var newFileList: Seq[String] = _
  // test result print.
  val fw = new FileWriter(resultFile)

  def main(args: Array[String]): Unit = {

    createFiles(sourceCopyFile, newFilePath, 10000)
    newFileList = getNewFileList(newFilePath)
    val ssc = new StreamingContext("local[2]", "BenchWork_test", Seconds(20), System.getenv("SPARK_HOME"))

    //----Choose one of the following test cases to test----------
    0 until 10 foreach benchMarkTest(ssc, "filesToRDD", true)
    // 0 until 10 foreach benchMarkTest(ssc, "filesToRDD_SPARK_3954", true)
    // 0 until 10 foreach benchMarkTest(ssc, "threeLoops", true)
    // 0 until 10 foreach benchMarkTest(ssc, "oneLoop", true)
  }

  /**
   * bench mark test
   * @param ssc: StreamingContext
   */
  def benchMarkTest(ssc: StreamingContext, funName: String, writeFile_flg: Boolean) = (count: Int) => {

    val bt = new BenchTest[LongWritable, Text, TextInputFormat](ssc)
    var streamingStartTime = 0l
    var streamingEndTime = 0l
    funName match {
      case "filesToRDD_SPARK_3954" => {
        streamingStartTime = Calendar.getInstance().getTimeInMillis()
        bt.filesToRDD_SPARK_3954(newFileList)
        streamingEndTime = Calendar.getInstance().getTimeInMillis()
      }
      case "filesToRDD" => {
        streamingStartTime = Calendar.getInstance().getTimeInMillis()
        bt.filesToRDD(newFileList)
        streamingEndTime = Calendar.getInstance().getTimeInMillis()
      }
      case "threeLoops" => {
        streamingStartTime = Calendar.getInstance().getTimeInMillis()
        bt.threeLoops(newFileList)
        streamingEndTime = Calendar.getInstance().getTimeInMillis()
      }
      case "oneLoop" => {
        streamingStartTime = Calendar.getInstance().getTimeInMillis()
        bt.oneLoop(newFileList)
        streamingEndTime = Calendar.getInstance().getTimeInMillis()
      }
    }
    outPint(funName, streamingStartTime, streamingEndTime, writeFile_flg)
  }

  /**
   * @param sourceFile:be coyied file.
   * @param path: new files's path.
   * copy file to another file.
   */
  def createFiles(sourceFile: String, path: String, num: Int) = 0 until num foreach (f => {
    ("cat " + sourceCopyFile) #> new File(path + "/copy" + f)!
  })

  /**
   * @param path: new files's path.
   * get created files list.
   */
  def getNewFileList(path: String) = {
    val path = new File(newFilePath)
    val list = for (file <- path.listFiles) yield (file.getAbsoluteFile().toString())
    list.toSeq
  }

  /**
   * print output result
   */
  def outPint(funName: String, startTime: Long, endTime: Long, writeFile_flg: Boolean) = {
    val contents = "Test function name:[" + funName + "] time consuming:" +
      (endTime - startTime) + "ms (" + startTime + "~" + endTime + ")"
    println(contents)
    if (writeFile_flg) {
      fw.write(contents + System.getProperty("line.separator"))
      fw.flush
    }
  }
}

class BenchTest[K: ClassTag, V: ClassTag, F <: NewInputFormat[K, V]: ClassTag](context: StreamingContext) extends Logging {

  /**
   * patch SPARK-3954
   */
  /** Generate one RDD from an array of files */
  //  def filesToRDD_new(files: Seq[String]): RDD[(K, V)] = {
  def filesToRDD_SPARK_3954(files: Seq[String]) = {
    val fileRDDs = for (file <- files; rdd = context.sparkContext.newAPIHadoopFile[K, V, F](file)) yield {
      if (rdd.partitions.size == 0) {
        logError("File " + file + " has no data in it. Spark Streaming can only ingest " +
          "files that have been \"moved\" to the directory assigned to the file stream. " +
          "Refer to the streaming programming guide for more details.")
      }
      rdd
    }
    new UnionRDD(context.sparkContext, fileRDDs)
  }

  /** Generate one RDD from an array of files */
  //  def filesToRDD(files: Seq[String]): RDD[(K, V)] = {
  def filesToRDD(files: Seq[String]) = {
    val fileRDDs = files.map(file => context.sparkContext.newAPIHadoopFile[K, V, F](file))
    files.zip(fileRDDs).foreach {
      case (file, rdd) => {
        if (rdd.partitions.size == 0) {
          logError("File " + file + " has no data in it. Spark Streaming can only ingest " +
            "files that have been \"moved\" to the directory assigned to the file stream. " +
            "Refer to the streaming programming guide for more details.")
        }
      }
    }
    new UnionRDD(context.sparkContext, fileRDDs)
  }

  /**
   * three recursions Test.
   */
  def threeLoops(files: Seq[String]) = {

    val fileRDDs = files.map(file => file)
    files.zip(fileRDDs).foreach {
      case (file, rdd) => {
        if (rdd.size == 0) {
          logError("File " + file + " has no data in it. Spark Streaming can only ingest " +
            "files that have been \"moved\" to the directory assigned to the file stream. " +
            "Refer to the streaming programming guide for more details.")
        }
      }
    }
  }

  /**
   * only one recursion Test.
   */
  def oneLoop(files: Seq[String]) = {
    val fileRDDs = for (file <- files; rdd = file) yield {
      if (rdd.size == 0) {
        logError("File " + file + " has no data in it. Spark Streaming can only ingest " +
          "files that have been \"moved\" to the directory assigned to the file stream. " +
          "Refer to the streaming programming guide for more details.")
      }
      rdd
    }
  }
}

tdas · 2014-11-07T22:20:26Z

Its not a big deal but its a good code update nonetheless. However, i am not sure for...yield is used much in the code base. So a better approach would be to do the following

val fileRDDs = files.map { file => 
  val rdd = context.sparkContext.newAPIHadoopFile[K, V, F](file))
  if (rdd.partitions.size == 0) { ... }
  rdd
}

Mind updating the code to use this style?

[files.map(file=>{})]

tdas · 2014-11-11T00:04:42Z

Jenkins, this is ok to test.

SparkQA · 2014-11-11T00:07:34Z

Test build #23173 has started for PR 2811 at commit 321bbe8.

This patch merges cleanly.

SparkQA · 2014-11-11T01:33:03Z

Test build #23173 has finished for PR 2811 at commit 321bbe8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-11T01:33:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23173/
Test PASSed.

tdas · 2014-11-11T01:39:11Z

Thanks, I have merged this!

about convert files to RDDS there are 3 loops with files sequence in spark source. loops files sequence: 1.files.map(...) 2.files.zip(fileRDDs) 3.files-size.foreach It's will very time consuming when lots of files.So I do the following correction: 3 loops with files sequence => only one loop Author: surq <[email protected]> Closes #2811 from surq/SPARK-3954 and squashes the following commits: 321bbe8 [surq] updated the code style.The style from [for...yield]to [files.map(file=>{})] 88a2c20 [surq] Merge branch 'master' of https://github.com/apache/spark into SPARK-3954 178066f [surq] modify code's style. [Exceeds 100 columns] 626ef97 [surq] remove redundant import(ArrayBuffer) 739341f [surq] promote the speed of convert files to RDDS (cherry picked from commit ce6ed2a) Signed-off-by: Tathagata Das <[email protected]>

promote the speed of convert files to RDDS

739341f

jerryshao reviewed Oct 15, 2014
View reviewed changes

remove redundant import(ArrayBuffer)

626ef97

surq changed the title ~~promote the speed of convert files to RDDS~~ [SPARK-3954][Streaming] promote the speed of convert files to RDDS Oct 15, 2014

liancheng reviewed Nov 2, 2014
View reviewed changes

surongquan added 2 commits November 3, 2014 12:00

modify code's style. [Exceeds 100 columns]

178066f

Merge branch 'master' of https://github.com/apache/spark into SPARK-3954

88a2c20

surq changed the title ~~[SPARK-3954][Streaming] promote the speed of convert files to RDDS~~ [SPARK-3954][Streaming] source code optimization Nov 4, 2014

surq changed the title ~~[SPARK-3954][Streaming] source code optimization~~ [SPARK-3954][Streaming] Optimization to FileInputDStream Nov 5, 2014

updated the code style.The style from [for...yield]to

321bbe8

[files.map(file=>{})]

asfgit closed this in ce6ed2a Nov 11, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3954][Streaming] Optimization to FileInputDStream #2811

[SPARK-3954][Streaming] Optimization to FileInputDStream #2811

surq commented Oct 15, 2014

AmplabJenkins commented Oct 15, 2014

jerryshao Oct 15, 2014

jerryshao commented Oct 15, 2014

jerryshao commented Oct 15, 2014

AmplabJenkins commented Oct 21, 2014

surq commented Oct 23, 2014

surq commented Oct 28, 2014

jerryshao commented Oct 28, 2014

surq commented Oct 28, 2014

liancheng Nov 2, 2014

rxin commented Nov 2, 2014

tdas commented Nov 2, 2014

tianyi commented Nov 4, 2014

surq commented Nov 4, 2014

surq commented Nov 4, 2014

tdas commented Nov 7, 2014

tdas commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

tdas commented Nov 11, 2014

[SPARK-3954][Streaming] Optimization to FileInputDStream #2811

[SPARK-3954][Streaming] Optimization to FileInputDStream #2811

Conversation

surq commented Oct 15, 2014

AmplabJenkins commented Oct 15, 2014

jerryshao Oct 15, 2014

Choose a reason for hiding this comment

jerryshao commented Oct 15, 2014

jerryshao commented Oct 15, 2014

AmplabJenkins commented Oct 21, 2014

surq commented Oct 23, 2014

surq commented Oct 28, 2014

jerryshao commented Oct 28, 2014

surq commented Oct 28, 2014

liancheng Nov 2, 2014

Choose a reason for hiding this comment

rxin commented Nov 2, 2014

tdas commented Nov 2, 2014

tianyi commented Nov 4, 2014

surq commented Nov 4, 2014

surq commented Nov 4, 2014

tdas commented Nov 7, 2014

tdas commented Nov 11, 2014

SparkQA commented Nov 11, 2014

SparkQA commented Nov 11, 2014

AmplabJenkins commented Nov 11, 2014

tdas commented Nov 11, 2014