-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob #812
Conversation
|
|
|
||
Strictly follows spark-submit usage:: | ||
|
||
Usage: spark-submit [options] <app jar | python file> [app options] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to just say something like "See spark-submit -h
for more information.". Or just add documentation link of the spark-submit script.
Looks great but on a more general note I think it would be awesome if we can support inline Python code like in |
|
Thanks for the review. Updated PR contains:
@erikbern, that would be awesome as an additional feature. A simple support for spark submit is necessary though to be able to use existing python drivers and java or scala drivers. That could be introduced as a subclass ( On an unrelated note. Is there any kind of support in luigi for streaming tasks? It could be good to also be able to incorporate spark streaming in pipelines. |
83884a0
to
058bfb0
Compare
@erikbern, added Usage looks like this:
|
That's really cool – can you map over an input source as well? |
If you want to, feel free to leave out the last commit for a separate PR. That way we can merge what we have so far |
e05320e
to
1a9b820
Compare
I removed the last commit and squashed the existing ones together. Yes, the input path can be used as a source by doing something like |
|
30ef92c
to
b7b7449
Compare
env = os.environ.copy() | ||
temp_stderr = tempfile.TemporaryFile() | ||
logger.info('Running: %s', repr(args)) | ||
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=temp_stderr, env=env, close_fds=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If stdout and stderr will always be text, you can open the process with universal_newline=True
. The process will accept unicode as stdin and return unicode.
Also, any reason for env=env
. According to my understanding of the doc, it's the default behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something seems weird with the delta now. Can you rebase on master? |
Done @erikbern, does that fix it? |
Yes looks good. Is this ready to be merged? How backwards-incompatible is it? Happy to merge just want to make sure there's no risk |
It should be fully backwards compatible, all existing unit tests pass, SparkJob is unchanged, (Py)Spark1xJob's have more features now as they extend SparkSubmitTask but their behaviour is unchanged. |
Added SparkSubmitTask and deprecated SparkJob, Spark1xJob and PySpark1xJob
@Tarrasch, @berkerpeksag, this is my follow up PR from comments in #806.
I deprecated both Spark1xJob and PySpark1xJob and created a single SparkSubmitTask that follows the command usage and can handle java, scala and python apps using spark standalone, yarn or mesos. I'm wondering whether SparkJob should be deprecated too.
Happy to discuss and make any changes.
Thanks