Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob #812

jthi3rry · 2015-03-04T11:44:56Z

@Tarrasch, @berkerpeksag, this is my follow up PR from comments in #806.

I deprecated both Spark1xJob and PySpark1xJob and created a single SparkSubmitTask that follows the command usage and can handle java, scala and python apps using spark standalone, yarn or mesos. I'm wondering whether SparkJob should be deprecated too.

Happy to discuss and make any changes.

Thanks

landscape-bot · 2015-03-04T11:47:42Z

Repository health increased by 0.00% when pulling 0c8606b on jthi3rry:master into fed8dd5 on spotify:master.

3 new problems were found (including 0 errors and 1 code smell).
1 problem was fixed (including 0 errors and 0 code smells).

landscape-bot · 2015-03-04T12:11:53Z

Repository health increased by 0.00% when pulling 203ffd4 on jthi3rry:master into fed8dd5 on spotify:master.

3 new problems were found (including 0 errors and 1 code smell).
1 problem was fixed (including 0 errors and 0 code smells).

berkerpeksag · 2015-03-04T12:32:57Z

luigi/contrib/spark.py

+
+    Strictly follows spark-submit usage::
+
+        Usage: spark-submit [options] <app jar | python file> [app options]


It would be better to just say something like "See spark-submit -h for more information.". Or just add documentation link of the spark-submit script.

erikbern · 2015-03-04T13:35:40Z

Looks great but on a more general note I think it would be awesome if we can support inline Python code like in luigi.hadoop.

landscape-bot · 2015-03-05T10:28:40Z

Repository health increased by 0.00% when pulling 1ed80a4 on jthi3rry:master into fed8dd5 on spotify:master.

5 new problems were found (including 0 errors and 1 code smell).
2 problems were fixed (including 0 errors and 0 code smells).

jthi3rry · 2015-03-05T10:54:46Z

Thanks for the review. Updated PR contains:

Incorporation of @berkerpeksag's comments and more global configuration options for the class and in the docs.
Deprecation warnings using version 1.1.0. @Tarrasch, let me know if that changes and I'll update.
Deprecation of SparkJob too as it seemed very YARN specific. Does that sound alright?

@erikbern, that would be awesome as an additional feature. A simple support for spark submit is necessary though to be able to use existing python drivers and java or scala drivers.

That could be introduced as a subclass (PySparkTask?) with a main(sc) abstract method where sc is a given spark context (like when using the pyspark shell). I'll have a look at luigi.hadoop to see how it's done. That's a good one for another PR :-)

On an unrelated note. Is there any kind of support in luigi for streaming tasks? It could be good to also be able to incorporate spark streaming in pipelines.

jthi3rry · 2015-03-07T10:50:40Z

@erikbern, added PySparkTask to run spark jobs inline.

Usage looks like this:

class MyJob(PySparkTask):
    param = IntParameter(default=2)

    def output(self):
        # return a target accessible from the spark cluster (HDFS, S3, ...)

    def main(self, sc, *args):
        exponent = self.param
        output = self.output().path
        sc.parallelize([1, 2, 3]).map(lambda x: x ** exponent).saveAsTextFile(output)

erikbern · 2015-03-07T15:22:01Z

That's really cool – can you map over an input source as well?

erikbern · 2015-03-07T15:22:41Z

If you want to, feel free to leave out the last commit for a separate PR. That way we can merge what we have so far

jthi3rry · 2015-03-08T02:05:21Z

I removed the last commit and squashed the existing ones together.

Yes, the input path can be used as a source by doing something like sc.textFile(self.input().path)

landscape-bot · 2015-03-09T11:37:05Z

Repository health decreased by 0.01% when pulling 021fcba on jthi3rry:master into 10064c8 on spotify:master.

7 new problems were found (including 0 errors and 2 code smells).
2 problems were fixed (including 0 errors and 0 code smells).

ghost · 2015-03-09T13:42:46Z

luigi/contrib/spark.py

+        env = os.environ.copy()
+        temp_stderr = tempfile.TemporaryFile()
+        logger.info('Running: %s', repr(args))
+        proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=temp_stderr, env=env, close_fds=True)


If stdout and stderr will always be text, you can open the process with universal_newline=True. The process will accept unicode as stdin and return unicode.

Also, any reason for env=env. According to my understanding of the doc, it's the default behavior.

Added universal_newline=True to #837.

#837 adds a get_environment() that alters the env before passing it to Popen.

erikbern · 2015-03-09T21:40:19Z

Something seems weird with the delta now. Can you rebase on master?

…1xJob

jthi3rry · 2015-03-09T21:52:32Z

Done @erikbern, does that fix it?

erikbern · 2015-03-09T22:13:18Z

Yes looks good. Is this ready to be merged? How backwards-incompatible is it? Happy to merge just want to make sure there's no risk

jthi3rry · 2015-03-09T22:30:00Z

It should be fully backwards compatible, all existing unit tests pass, SparkJob is unchanged, (Py)Spark1xJob's have more features now as they extend SparkSubmitTask but their behaviour is unchanged.

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob

berkerpeksag reviewed Mar 4, 2015
View reviewed changes

jthi3rry force-pushed the master branch from 203ffd4 to 1ed80a4 Compare March 5, 2015 10:21

jthi3rry force-pushed the master branch 3 times, most recently from 83884a0 to 058bfb0 Compare March 6, 2015 22:50

jthi3rry changed the title ~~Refactored (Py)Spark1xJob's into SparkSubmitTask~~ SparkSubmitTask & PySparkTask Mar 7, 2015

jthi3rry force-pushed the master branch from 4ec559c to e3cf777 Compare March 7, 2015 23:16

jthi3rry changed the title ~~SparkSubmitTask & PySparkTask~~ Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob Mar 7, 2015

jthi3rry force-pushed the master branch 2 times, most recently from e05320e to 1a9b820 Compare March 8, 2015 01:35

jthi3rry mentioned this pull request Mar 9, 2015

PySparkTask & SparkSubmitTask improvements #837

Merged

jthi3rry force-pushed the master branch 2 times, most recently from 30ef92c to b7b7449 Compare March 9, 2015 11:55

ghost reviewed Mar 9, 2015
View reviewed changes

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark…

aa0d0a9

…1xJob

jthi3rry force-pushed the master branch from b7b7449 to aa0d0a9 Compare March 9, 2015 21:46

erikbern pushed a commit that referenced this pull request Mar 10, 2015

Merge pull request #812 from jthi3rry/master

021cb2b

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob

erikbern merged commit 021cb2b into spotify:master Mar 10, 2015

erikbern mentioned this pull request Mar 16, 2015

Spark job written in Python support #511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob #812

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob #812

jthi3rry commented Mar 4, 2015

landscape-bot commented Mar 4, 2015

landscape-bot commented Mar 4, 2015

berkerpeksag Mar 4, 2015

erikbern commented Mar 4, 2015

landscape-bot commented Mar 5, 2015

jthi3rry commented Mar 5, 2015

jthi3rry commented Mar 7, 2015

erikbern commented Mar 7, 2015

erikbern commented Mar 7, 2015

jthi3rry commented Mar 8, 2015

landscape-bot commented Mar 9, 2015

ghost Mar 9, 2015

jthi3rry Mar 9, 2015

erikbern commented Mar 9, 2015

jthi3rry commented Mar 9, 2015

erikbern commented Mar 9, 2015

jthi3rry commented Mar 9, 2015


		Strictly follows spark-submit usage::

		Usage: spark-submit [options] <app jar \| python file> [app options]

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob #812

Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob #812

Conversation

jthi3rry commented Mar 4, 2015

landscape-bot commented Mar 4, 2015

landscape-bot commented Mar 4, 2015

berkerpeksag Mar 4, 2015

Choose a reason for hiding this comment

erikbern commented Mar 4, 2015

landscape-bot commented Mar 5, 2015

jthi3rry commented Mar 5, 2015

jthi3rry commented Mar 7, 2015

erikbern commented Mar 7, 2015

erikbern commented Mar 7, 2015

jthi3rry commented Mar 8, 2015

landscape-bot commented Mar 9, 2015

ghost Mar 9, 2015

Choose a reason for hiding this comment

jthi3rry Mar 9, 2015

Choose a reason for hiding this comment

erikbern commented Mar 9, 2015

jthi3rry commented Mar 9, 2015

erikbern commented Mar 9, 2015

jthi3rry commented Mar 9, 2015