-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a Job to be a DAG instead of a chain? #18
Comments
I like this implementation. It seems backward compatible. This seems similar to the MultiMapper class (examples/multicount.py), except that it allows separate reduces for the two maps. In the example above, the reducers (and mappers?) are the same, so it both maps could be done with MultiMapper, then fed into the same reducer, then the third map/reduce could process that (single) output. But if we wanted two distinct map/reduce pipelines feeding into a third, I don't think the MultiMapper could do it. |
Agreed, this is more general than MultiMapper and would be a good addition to dumbo |
Here's my attempt at the modification: http://github.com/hydropyrum/dumbo/commit/545a11b64d67404fbb30987cbe9d0f8fc885d202 |
This is now in my master branch. I made some minor stylistic changes, but apart from those it went in unchanged. Updated example:
|
I don't see how I can mix this nice feature with MultiMapper/JoinReducer. edit: I may have found a way:
So
should do the trick. |
It would be nice if it were possible, when calling Job.additer, to specify that the input for an iteration should be the output of one or more previous iterations of the Job. Something like...
job = dumbo.Job()
job.input # is an id for job's input (i.e., specified by -input on the command line)
o0 = job.additer(mapper, reducer) # returns an id for the iteration's output
o1 = job.additer(mapper, reducer, input=job.input) # take input from the job input instead of iteration 0
o2 = job.additer(mapper, reducer, input=[o0,o1]) # take input from both iteration 0 and 1
The job's output would be the output of the last iteration as always.
It seems to me this would be a fairly easy modification that would add a lot of flexibility.
The text was updated successfully, but these errors were encountered: