[SPARK-2759][CORE] Generic Binary File Support in Spark #1658

kmader · 2014-07-30T11:32:51Z

The additions add the abstract BinaryFileInputFormat and BinaryRecordReader classes for reading in data as a byte stream and converting it to another format using the def parseByteArray(inArray: Array[Byte]): T function.
As a trivial example ByteInputFormat and ByteRecordReader are included which just return the Array[Byte] from a given file.
Finally a RDD for BinaryFileInputFormat (to allow for easier partitioning changes as was done for WholeFileInput) was added and the appropriate byteFiles to the SparkContext so the functions can be easily used by others.
A common use case might be to read in a folder

sc.byteFiles("s3://mydrive/tif/*.tif").map(rawData => ReadTiffFromByteArray(rawData))

Updating to the lastest spark repository

mateiz · 2014-07-30T23:19:22Z

Do you mind opening a JIRA issue on https://issues.apache.org/jira/browse/SPARK to track this?

Also, I wonder if we should make the API just return an RDD of InputStreams. That way users can read directly from a stream and don't need to load the whole file in memory into a byte array. The only awkward thing is that calling cache() on an RDD of InputStreams wouldn't work, but hopefully this is obvious (and will be documented). Or if that doesn't sound good, we could return some objects that let you open a stream repeatedly (some kind of BinaryFile object with a stream method).

mateiz · 2014-07-30T23:19:29Z

Jenkins, this is ok to test

mateiz · 2014-07-30T23:20:14Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   *
+   * @note Small files are preferred, large file is also allowable, but may cause bad performance.
+   */
+  def byteFiles(path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, Array[Byte])] = {


I'd call this binaryFiles.

Also, please add it to JavaSparkContext, and ideally we'd have a way to add it to Python as well. That one will be trickier -- we probably need to read the file in chunks and pass them to Python. But I think it's important to design the API as part of this change.

SparkQA · 2014-07-30T23:24:12Z

QA tests have started for PR 1658. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17505/consoleFull

SparkQA · 2014-07-30T23:24:18Z

QA results for PR 1658:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
@serializable abstract class BinaryRecordReader[T](
@serializable class ByteInputFormat extends BinaryFileInputFormat[Array[Byte]] {
@serializable class ByteRecordReader(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17505/consoleFull

…ion for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api

SparkQA · 2014-07-31T00:19:10Z

QA tests have started for PR 1658. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17517/consoleFull

SparkQA · 2014-07-31T00:19:58Z

QA results for PR 1658:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class StreamBasedRecordReader[T](
class StreamRecordReader(
* A class for extracting the information from the file using the BinaryRecordReader (as Byte array)
class StreamInputFormat extends StreamFileInputFormat[DataInputStream] {
abstract class BinaryRecordReader[T](
class ByteRecordReader(
* A class for extracting the information from the file using the BinaryRecordReader (as Byte array)
class ByteInputFormat extends StreamFileInputFormat[Array[Byte]] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17517/consoleFull

SparkQA · 2014-07-31T00:38:55Z

QA tests have started for PR 1658. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17523/consoleFull

kmader · 2014-07-31T00:42:33Z

Thanks for the feedback, I have made the changes requested, created an issue (https://issues.apache.org/jira/browse/SPARK-2759), and added a dataStreamFiles to both SparkContext and JavaSparkContext which returns the DataInputStream itself (I have a feeling this might create a few more new issues with serialization or properly closing or rerunning tasks, but I guess we'll see).

My recommendation (as I have done in my own code) would be to use the abstract class StreamBasedRecordReader and implement an appropriate version for custom filetypes by implementing def parseStream(inStream: DataInputStream): T

As for PySpark it is my guess that is would be easiest to create a library of StreamBasedRecordReader classes for common file types since it is much less expensive to do IO on the Scala/Java level. Alternatively a Spark function could copy the file into a local directory on demand and provide the local filename to Python

SparkQA · 2014-07-31T01:35:07Z

QA results for PR 1658:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class StreamBasedRecordReader[T](
class StreamRecordReader(
class StreamInputFormat extends StreamFileInputFormat[DataInputStream] {
abstract class BinaryRecordReader[T](
class ByteRecordReader(
* A class for reading the file using the BinaryRecordReader (as Byte array)
class ByteInputFormat extends StreamFileInputFormat[Array[Byte]] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17523/consoleFull

freeman-lab · 2014-08-13T00:13:15Z

@kmader @mateiz this looks really useful! I was about to submit a related PR for an InputFormat that reads and splits large flat binary files into records (of a specified length), rather than read one file per record as here. We find this is the easiest easy way for users to bring large numerical data from existing NumPy / Matlab pipelines into Spark. Here's a gist. Would these be compatible? Perhaps analogous to the text case, we could have both byteFile and wholeByteFiles?

kmader · 2014-08-13T13:43:36Z

@freeman-lab looks good, I will add it to this pull request if that's ok for you. I think my personal preference would be do keep binaryFiles for standard operations and fixedLengthBinaryFiles for other files since many standard binary formats are not so easily partition-able and trying to read in tif, jpg, even hdf5 and raw under such conditions will be rather difficult to do correctly. Where as for text files line by line is a common partitioning. Perhaps there are other use cases that I am not familiar with that speak against this though.

…and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile

SparkQA · 2014-08-13T14:14:54Z

QA tests have started for PR 1658. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18439/consoleFull

SparkQA · 2014-08-13T14:15:47Z

QA results for PR 1658:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18439/consoleFull

kmader · 2014-08-13T14:21:55Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   * @param path Directory to the input data files
+   * @return An RDD of data with values, RDD[(Array[Byte])]
+   */
+  def fixedLengthBinaryFiles(path: String): RDD[Array[Byte]] = {


This has been taken almost directly from
https://github.com/freeman-lab/thunder/blob/master/scala/src/main/scala/thunder/util/Load.scala without the extra formatting to load it as a a list of doubles

SparkQA · 2014-08-13T14:24:51Z

QA tests have started for PR 1658. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18441/consoleFull

SparkQA · 2014-08-13T15:16:45Z

QA results for PR 1658:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class FixedLengthBinaryInputFormat extends FileInputFormat[LongWritable, BytesWritable] {
class FixedLengthBinaryRecordReader extends RecordReader[LongWritable, BytesWritable] {
abstract class StreamBasedRecordReader[T](
class StreamRecordReader(
class StreamInputFormat extends StreamFileInputFormat[DataInputStream] {
abstract class BinaryRecordReader[T](
class ByteRecordReader(
* A class for reading the file using the BinaryRecordReader (as Byte array)
class ByteInputFormat extends StreamFileInputFormat[Array[Byte]] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18441/consoleFull

mateiz · 2014-08-13T21:19:41Z

Hey, sorry for taking a bit of time to get back to this (I've been looking at 1.1 stuff), but I have a few comments on the API:

Do we need both a stream API and a byte array one? I'd personally offer only the stream one because it's less likely to cause crashes (with the other one there's a risk of OutOfMemoryError).
For the files vs fixed-length records, maybe we can call the methods binaryFiles and binaryRecords.
Are you planning to create saveAsBinaryFiles / saveAsBinaryRecords too? We don't have to have it in this PR but it would be useful.

mateiz · 2014-08-13T21:22:20Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+   *
+   * @param minPartitions A suggestion value of the minimal splitting number for input data.
+   *
+   * @note Care must be taken to close the files afterwards


It is a bit unfortunate that users have to close the streams by hand. If you want to get around this, you can create a custom RDD wrapping around the HadoopRDD, whose compute() method can add a cleanup hook to its TaskContext to close the stream. Take a look at TaskContext.addOnCompleteCallback().

Hey Kevin, is this @note still relevant? using addOnCompleteCallback you might be able to avoid this.

…s, formatted code more nicely

SparkQA · 2014-10-20T17:24:59Z

QA tests have started for PR 1658 at commit 92bda0d.

This patch merges cleanly.

SparkQA · 2014-10-20T17:26:07Z

QA tests have finished for PR 1658 at commit 92bda0d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

kmader · 2014-10-20T20:14:47Z

So I made the requested changes and added a few more tests, but the tests appear to have not run for a strange reason: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21922/console, the build runs out of memory with Maven but works fine in IntelliJ but I do not get any feedback on the style is there any single maven phase I can run to get that?

mateiz · 2014-10-21T01:34:07Z

There might've been some Jenkins issues recently; going to restart it.

mateiz · 2014-10-21T01:34:33Z

BTW for the style, you can do "sbt/sbt scalastyle" locally if you want. Not sure there's a command in Maven.

mateiz · 2014-10-21T01:34:40Z

Jenkins, retest this please

SparkQA · 2014-10-21T01:39:47Z

QA tests have started for PR 1658 at commit 92bda0d.

This patch merges cleanly.

SparkQA · 2014-10-21T01:40:55Z

QA tests have finished for PR 1658 at commit 92bda0d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-10-21T05:59:47Z

QA tests have started for PR 1658 at commit 8ac288b.

This patch merges cleanly.

SparkQA · 2014-10-21T07:08:38Z

QA tests have finished for PR 1658 at commit 8ac288b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PortableDataStream(@transient isplit: CombineFileSplit,

SparkQA · 2014-10-21T22:19:40Z

QA tests have started for PR 1658 at commit 6379be4.

This patch merges cleanly.

SparkQA · 2014-10-21T23:29:07Z

QA tests have finished for PR 1658 at commit 6379be4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PortableDataStream(@transient isplit: CombineFileSplit,

mateiz · 2014-10-28T04:42:16Z

Thanks for the update, Kevin. Note that there are still a few comments from me on https://github.com/apache/spark/pull/1658/files, do you mind dealing with those?

mateiz · 2014-10-29T18:31:45Z

@kmader btw if you don't have time to deal with these comments, let me know; I might be able to take the patch from where it is and implement them.

…functions to make their usage clearer

…n as in BinaryData files

mateiz · 2014-10-30T01:15:18Z

Thanks for the update, Kevin. Looks like Jenkins had some issues with git, will retry it.

mateiz · 2014-10-30T01:15:27Z

Jenkins, retest this please

SparkQA · 2014-10-30T02:08:58Z

Test build #22503 has finished for PR 1658 at commit 3c49a30.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-10-30T02:30:35Z

Test build #22505 has finished for PR 1658 at commit 3c49a30.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class PortableDataStream(@transient isplit: CombineFileSplit,

mateiz · 2014-11-01T19:02:06Z

Thanks @kmader, I merged this now. I manually amended the patch a bit to fix style issues (there were still a bunch of commas without spaces, etc), and I also changed the name of the recordLength property in Hadoop JobConfs to start with org.apache.spark so that it's less likely to clash with other Hadoop properties. Finally I marked this API as @Experimental for now since it's new in this release, though we can probably make it non-experimental in 1.3.

JoshRosen · 2014-12-26T20:32:50Z

core/src/main/scala/org/apache/spark/SparkContext.scala

+      classOf[LongWritable],
+      classOf[BytesWritable],
+      conf=conf)
+    val data = br.map{ case (k, v) => v.getBytes}


It turns out that getBytes returns a padded byte array, so I think you may need to copy / slice out the subarray with the data using v.getLength; see HADOOP-6298: "BytesWritable#getBytes is a bad name that leads to programming mistakes" for more details.

Using getBytes without getLength has caused bugs in Spark in the past: #2712.

Is the use of getBytes in this patch a bug? Or is it somehow safe due to our use of FixedLengthBinaryInputFormat? If it is somehow safe, we should have a comment which explains this so that readers who know about the getBytes issue aren't confused (or better yet, an assert that getBytes returns an array of the expected length).

kmader and others added 2 commits July 30, 2014 12:32

Merge pull request #1 from apache/master

81c5f12

Updating to the lastest spark repository

adding binary and byte file support spark

84035f1

mateiz reviewed Jul 30, 2014
View reviewed changes

added apache headers, added datainputstream directly as an output opt…

1cfa38a

…ion for more complicated readers (HDF5 perhaps), and renamed several of the functions and files to be more consistent. Also added parallel functions to the java api

changing the line lengths to make jenkins happy

1622935

kmader changed the title ~~Generic Binary File Support in Spark~~ [SPARK-2759][CORE] Generic Binary File Support in Spark Jul 31, 2014

Added FixedLengthBinaryInputFormat and RecordReader from freeman-lab …

eacfaa6

…and added them to both the JavaSparkContext and the SparkContext as fixedLengthBinaryFile

un-optimizing imports, silly intellij

f4841dc

fixing line lengths, adding new lines

edf5829

kmader reviewed Aug 13, 2014
View reviewed changes

mateiz reviewed Aug 13, 2014
View reviewed changes

added new tests, renamed files, fixed several of the javaapi function…

92bda0d

…s, formatted code more nicely

fixed a single slightly over 100 character line

8ac288b

kmader added 2 commits October 22, 2014 00:14

removing developer API, cleaning up imports

7b9d181

reorganizing code

6379be4

kmader added 2 commits October 30, 2014 11:43

making the final corrections suggested by @mateiz and renaming a few …

359a096

…functions to make their usage clearer

fixing wholetextfileinput to it has the same setMinPartitions functio…

3c49a30

…n as in BinaryData files

asfgit closed this in 7136719 Nov 1, 2014

JoshRosen reviewed Dec 26, 2014
View reviewed changes

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://104138570 (Upgrade callhome service to 0.2.23) (apache#1658)

6e6c3f7

[SPARK-2759][CORE] Generic Binary File Support in Spark #1658

[SPARK-2759][CORE] Generic Binary File Support in Spark #1658

Conversation

kmader commented Jul 30, 2014

mateiz commented Jul 30, 2014

mateiz commented Jul 30, 2014

mateiz Jul 30, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 31, 2014

SparkQA commented Jul 31, 2014

SparkQA commented Jul 31, 2014

kmader commented Jul 31, 2014

SparkQA commented Jul 31, 2014

freeman-lab commented Aug 13, 2014

kmader commented Aug 13, 2014

SparkQA commented Aug 13, 2014

SparkQA commented Aug 13, 2014

kmader Aug 13, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 13, 2014

SparkQA commented Aug 13, 2014

mateiz commented Aug 13, 2014

mateiz Aug 13, 2014

Choose a reason for hiding this comment

mateiz Oct 8, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 20, 2014

SparkQA commented Oct 20, 2014

kmader commented Oct 20, 2014

mateiz commented Oct 21, 2014

mateiz commented Oct 21, 2014

mateiz commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

mateiz commented Oct 28, 2014

mateiz commented Oct 29, 2014

mateiz commented Oct 30, 2014

mateiz commented Oct 30, 2014

SparkQA commented Oct 30, 2014

SparkQA commented Oct 30, 2014

mateiz commented Nov 1, 2014

JoshRosen Dec 26, 2014

Choose a reason for hiding this comment