[SPARK-2024] Add saveAsSequenceFile to PySpark #1338

kanzhang · 2014-07-09T05:30:54Z

JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024

This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.

Added RDD methods saveAsSequenceFile, saveAsHadoopFile and saveAsHadoopDataset, for both old and new MapReduce APIs.
Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to Object[] and they get pickled to Python tuples.
Added HBase and Cassandra output examples to show how custom output formats and converters can be used.

cc @MLnick @mateiz @ahirreddy @pwendell

AmplabJenkins · 2014-07-09T05:31:09Z

Merged build triggered.

AmplabJenkins · 2014-07-09T05:31:18Z

Merged build started.

AmplabJenkins · 2014-07-09T06:12:21Z

Merged build finished.

AmplabJenkins · 2014-07-09T06:12:21Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16446/

AmplabJenkins · 2014-07-09T16:46:11Z

Merged build triggered.

AmplabJenkins · 2014-07-09T16:46:17Z

Merged build started.

AmplabJenkins · 2014-07-09T17:42:23Z

Merged build finished.

AmplabJenkins · 2014-07-09T17:42:23Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16468/

davies · 2014-07-11T21:07:03Z

LGTM, awesome!

mateiz · 2014-07-12T17:20:33Z

Jenkins, retest this please

SparkQA · 2014-07-12T17:22:36Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16588/consoleFull

mateiz · 2014-07-12T17:23:39Z

docs/programming-guide.md

@@ -403,31 +403,30 @@ PySpark SequenceFile support loads an RDD within Java, and pickles the resulting
 <tr><td>BooleanWritable</td><td>bool</td></tr>
 <tr><td>BytesWritable</td><td>bytearray</td></tr>
 <tr><td>NullWritable</td><td>None</td></tr>
-<tr><td>ArrayWritable</td><td>list of primitives, or tuple of objects</td></tr>


Did this work before and get removed now, or was it a mistake in the docs?

@mateiz we don't handle arrays currently and this is also the case for Scala API. The reason is ArrayWritable class doesn't have a no-arg constructor for creating an empty instance upon reading. User needs to create subtypes. Although we could add subtypes for handling primitive arrays, that makes Spark a dependency for users, which we probably don't want to do.

For conversion between arrays and ArrayWritable subtypes, when reading we can convert automatically as long as the subtype is on the class path. However, when writing we can't convert arrays to ArrayWritable subtypes automatically since we don't know which subtype to use. User needs to specify custom converters.

We should look into ArrayPrimitiveWritable, which is not available in Hadoop v1.0.4.

Ah, I see, it looks like in Scala we can write them but not read them. It's probably fine to remove them from the table then.

We can't write arrays in Scala either (the implicit conversion from Array to ArrayWritable is marked private). Otherwise, it can be awkward as we can't read it back since ArrayWritable doesn't have a no-arg constructor. For user supplied ArrayWritable subtypes, we can read them, it's just they won't be implicitly converted. Essentially the same support as we have in Python.

mateiz · 2014-07-12T17:27:59Z

This looks awesome, thanks for putting it together! One comment I have though is that we should add more test coverage, to make sure we cover all the data types supported. Instead of doing this in doc comments, which gets unwieldy, you can do it in python/pyspark/tests.py, which is a standalone test file. Just make sure we have tests that cover each supported data type in sequence files.

@MLnick you should look at this too when you have a chance.

MLnick · 2014-07-12T18:32:14Z

I have had a quick look over and will try to do a more detailed one this weekend.

High level looks good, 2 comments so far:

Agree with Matei that I think the tests should live in tests.py as opposed to docstrings, and add tests for other datatypes in a similar manner to the input format tests
Would be great to add a couple of examples of using the custom Converter in reverse for output. Again, a Cassandra and HBase example in similar vein to the input format examples would be valuable I think

Will provide any more feedback as I go through it in more detail.

(btw thanks for fixing up the ArrayWritable stuff too).

SparkQA · 2014-07-12T18:59:26Z

QA results for PR 1338:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
dict of public properties (via JavaBean getters and setters) class for the class type

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16588/consoleFull

kanzhang · 2014-07-14T21:28:00Z

@davies @mateiz @MLnick thanks for the review and suggestions. I'll try to add standalone tests for every data type.

MLnick · 2014-07-16T17:27:22Z

Great - I will review in more detail after that. Would be great to get this
merged before 1.1 freeze so PySpark I/O for inputformat and outputformat is
in for the next release!

On Tue, Jul 15, 2014 at 1:07 AM, kanzhang [email protected] wrote:

@MLnick https://github.com/MLnick I'll see if I can add couple output
converter examples as well. Thx.

—
Reply to this email directly or view it on GitHub
#1338 (comment).

kanzhang · 2014-07-20T17:58:46Z

@MLnick I'm thinking of removing the tests and programming guide entry for custom classes (JavaBeans). It seems to be a feature of Pyrolite and I can't think of any obvious use of it in the context of RDDs. For example, Pyrolite maps a JavaBean to a dict of its attributes in Python, but one can't go reverse. Listing it as a supported data type may add confusion to users. Thoughts?

mateiz · 2014-07-21T22:35:27Z

Regarding the JavaBeans, is there a reason to believe Pyrolite won't support them in the future? Or are you just suggesting to remove it because we can't also save data? That would be a bit of a regression for the reading side, though maybe InputFormats that return JavaBeans are not that common.

MLnick · 2014-07-22T07:50:00Z

@kanzhang @mateiz Yeah this is one issue with Pyrolite vs MsgPack. MsgPack supported case classes out the box, which would likely be a bit more common that beans.

I'd say that custom serde via Converter will be far more common (as we've already seen with various Avro commentary etc).

Thinking about it some more, I would be ok to remove from the docs. This would still be available as undocumented functionality so if relevant use cases did come up on the mailing list, we could point to it and in the unlikely case that there was demand we could simply document it as read-only functionality.

Bearing in mind this is also still marked experimental and we'll need to see how users use it in the wild a bit and make any amendments as required.

kanzhang · 2014-07-22T21:59:01Z

@MLnick I merely removed it from programming guide. The functionality (and your test) is still there should anyone wants to try it.

@mateiz I meant when reading JavaBeans, you get a dict of attributes to values on the Python side. But you can't turn around and save it as JavaBeans from Python. What you save is a Java Map since that's what Pyrolite will pickle a dict to. I was trying to confirm the same asymmetry on the saving side (i.e., saving a Python custom object as Java Map, and reads it back as a Python dict), but I got the following exception and gave up.

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsSequenceFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 33.0:2 failed 1 times, most recent failure: Exception failure in TID 70 on host localhost: net.razorvine.pickle.InvalidOpcodeException: opcode not implemented: OBJ
        net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:223)
        net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
        net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)

SparkQA · 2014-07-22T22:08:35Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16994/consoleFull

SparkQA · 2014-07-22T22:09:31Z

QA results for PR 1338:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16994/consoleFull

kanzhang · 2014-07-22T22:29:14Z

Major changes for the updated patch.

Replaced doctests with standalone tests
Fixed converter for converting BytesWritables and added read/write tests for BytesWritable and byte arrays
Added HBase and Cassandra output format and converter examples
I used to inspect array element types and try to convert Object[] to array of primitive types whenever possible (so that they get pickled to Python arrays, whereas Object[] gets pickled to Python tuples). But I removed that code, since I can't determine element types for empty arrays. Users have to supply custom converters if they want Java arrays to appear as Python arrays (if they know their array types a priori).
No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon deserializing. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to Object[] and they get pickled to Python tuples.

kanzhang · 2014-07-22T22:35:00Z

@pwendell I renamed file HBaseConverter.scala to HBaseConverters.scala. Now I failed Scala style checks. How can I fix it? Thx.

mateiz · 2014-07-23T08:00:33Z

The style check error is different, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16994/consoleFull. It's a bit hidden in there but it says:

error file=/home/jenkins/workspace/SparkPullRequestBuilder@7/core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala message=There should be a space after the plus (+) sign line=34 column=19

mateiz · 2014-07-23T08:01:31Z

core/src/main/scala/org/apache/spark/api/python/PythonHadoopUtil.scala

@@ -31,13 +31,14 @@ import org.apache.spark.annotation.Experimental
 * transformation code by overriding the convert method.
 */
 @Experimental
-trait Converter[T, U] extends Serializable {
+trait Converter[T, +U] extends Serializable {


Actually the style checker seems to be complaining about this +, which is a mistake in the style checker. You can add a space after the + for now. But do we really need covariance here?

(For better or worse, we don't really use it elsewhere in Spark)

@mateiz thanks, Matei. I saw it but I couldn't believe that was the reason :-). I added the + sign because some of our converters have more specific types like [Any, Writable] and the compiler complains when assigning them to where [Any, Any] is required. I don't have a strong preference here and could change them back to [Any, Any]. Let me know.

SparkQA · 2014-07-23T16:08:19Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17042/consoleFull

SparkQA · 2014-07-29T21:49:46Z

QA results for PR 1338:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17384/consoleFull

kanzhang · 2014-07-29T22:00:14Z

Now I have got the following error, since saveAsHadoopFile has 11 params. Can relax it a bit?

Scalastyle checks failed at following occurrences:
error file=/home/jenkins/workspace/SparkPullRequestBuilder@3/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala message=The number of parameters should not exceed 10 line=627 column=6

SparkQA · 2014-07-29T22:19:03Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17386/consoleFull

kanzhang · 2014-07-29T22:19:31Z

Nevermind. I'm refactoring.

SparkQA · 2014-07-29T22:19:58Z

QA results for PR 1338:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17386/consoleFull

SparkQA · 2014-07-29T23:04:01Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17389/consoleFull

SparkQA · 2014-07-29T23:52:04Z

QA results for PR 1338:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17389/consoleFull

SparkQA · 2014-07-30T05:23:59Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17419/consoleFull

SparkQA · 2014-07-30T05:59:20Z

QA tests have started for PR 1338. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17424/consoleFull

SparkQA · 2014-07-30T06:11:20Z

QA results for PR 1338:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17419/consoleFull

SparkQA · 2014-07-30T06:53:30Z

QA results for PR 1338:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17424/consoleFull

JoshRosen · 2014-07-30T18:11:54Z

I think we should remove the batchSerialized arguments from PythonRDD's saveAs* methods and add a batchSerialized field to PythonRDD's constructor, since it's an attribute of the RDD itself rather than an option.

JoshRosen · 2014-07-30T18:15:04Z

core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala

+    }
+    pyRDD.mapPartitions { iter =>
+      val unpickle = new Unpickler
+      val unpickled =


This batchSerialized-respecting unpickling logic should probably live in its own function so that it can also be used by pythonToJavaMap.

Can we defer this refactoring to when we update pythonToJavaMap, since I don't want to touch SchemaRDD code in this patch?

kanzhang · 2014-07-30T19:41:39Z

I think we should remove the batchSerialized arguments from PythonRDD's saveAs* methods and add a batchSerialized field to PythonRDD's constructor, since it's an attribute of the RDD itself rather than an option.

Problem with that is currently PythonRDD objects are only created by PipelinedRDD, whereas in other cases (e.g., PythonRDD.readRDDFromFile and SchemaRDD.javaToPython), _jrdd (or JavaRDD[Array[Byte]]) are created directly without PythonRDD objects. I feel the change to use PythonRDD everywhere is too big for this patch. Maybe a followup JIRA?

JoshRosen · 2014-07-30T20:13:46Z

Ah, I see. I don't mind deferring that refactoring to a later patch. I'll create some PySpark refactoring JIRAs later.

JoshRosen · 2014-07-30T20:20:27Z

I've merged this. Thanks!

JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024 This PR is a followup to apache#455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats. * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs. * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types. * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples. * Added HBase and Cassandra output examples to show how custom output formats and converters can be used. cc MLnick mateiz ahirreddy pwendell Author: Kan Zhang <[email protected]> Closes apache#1338 from kanzhang/SPARK-2024 and squashes the following commits: c01e3ef [Kan Zhang] [SPARK-2024] code formatting 6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10 57a7a5e [Kan Zhang] [SPARK-2024] correcting typo 75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD 0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests 9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests 0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases 7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark

kanzhang changed the title ~~[SPARK-2024] Add saveAsSequenceFile and saveAsHadoopFile to PySpark~~ [SPARK-2024] Add saveAsSequenceFile to PySpark Jul 9, 2014

mateiz reviewed Jul 12, 2014
View reviewed changes

mateiz reviewed Jul 23, 2014
View reviewed changes

[SPARK-2024] correcting typo

57a7a5e

[SPARK-2024] refectoring to get method params below 10

d998ad6

[SPARK-2024] renaming pickled -> pickledRDD

6591e37

[SPARK-2024] code formatting

c01e3ef

JoshRosen reviewed Jul 30, 2014
View reviewed changes

kanzhang changed the title ~~[SPARK-2024] Add saveAsSequenceFile to PySpark~~ [SPARK-2024] [PySpark] Add saveAsSequenceFile to PySpark Jul 30, 2014

kanzhang changed the title ~~[SPARK-2024] [PySpark] Add saveAsSequenceFile to PySpark~~ [SPARK-2024] Add saveAsSequenceFile to PySpark Jul 30, 2014

asfgit closed this in 94d1f46 Jul 30, 2014

JoshRosen mentioned this pull request Jul 31, 2014

[SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD #1598

Closed

kanzhang deleted the SPARK-2024 branch December 12, 2014 01:32

[SPARK-2024] Add saveAsSequenceFile to PySpark #1338

[SPARK-2024] Add saveAsSequenceFile to PySpark #1338

Conversation

kanzhang commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

AmplabJenkins commented Jul 9, 2014

davies commented Jul 11, 2014

mateiz commented Jul 12, 2014

SparkQA commented Jul 12, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mateiz commented Jul 12, 2014

MLnick commented Jul 12, 2014

SparkQA commented Jul 12, 2014

kanzhang commented Jul 14, 2014

MLnick commented Jul 16, 2014

kanzhang commented Jul 20, 2014

mateiz commented Jul 21, 2014

MLnick commented Jul 22, 2014

kanzhang commented Jul 22, 2014

SparkQA commented Jul 22, 2014

SparkQA commented Jul 22, 2014

kanzhang commented Jul 22, 2014

kanzhang commented Jul 22, 2014

mateiz commented Jul 23, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 23, 2014

SparkQA commented Jul 29, 2014

kanzhang commented Jul 29, 2014

SparkQA commented Jul 29, 2014

kanzhang commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kanzhang commented Jul 30, 2014

JoshRosen commented Jul 30, 2014

JoshRosen commented Jul 30, 2014