Allow a Job to be a DAG instead of a chain? #18

hydropyrum · 2010-08-24T23:03:47Z

It would be nice if it were possible, when calling Job.additer, to specify that the input for an iteration should be the output of one or more previous iterations of the Job. Something like...

job = dumbo.Job()
job.input # is an id for job's input (i.e., specified by -input on the command line)
o0 = job.additer(mapper, reducer) # returns an id for the iteration's output
o1 = job.additer(mapper, reducer, input=job.input) # take input from the job input instead of iteration 0
o2 = job.additer(mapper, reducer, input=[o0,o1]) # take input from both iteration 0 and 1

The job's output would be the output of the last iteration as always.

It seems to me this would be a fairly easy modification that would add a lot of flexibility.

sdeneefe · 2010-08-27T17:14:37Z

I like this implementation. It seems backward compatible.

This seems similar to the MultiMapper class (examples/multicount.py), except that it allows separate reduces for the two maps. In the example above, the reducers (and mappers?) are the same, so it both maps could be done with MultiMapper, then fed into the same reducer, then the third map/reduce could process that (single) output. But if we wanted two distinct map/reduce pipelines feeding into a third, I don't think the MultiMapper could do it.

desilinguist · 2010-08-27T18:53:10Z

Agreed, this is more general than MultiMapper and would be a good addition to dumbo

hydropyrum · 2010-08-27T21:32:45Z

Here's my attempt at the modification:

http://github.com/hydropyrum/dumbo/commit/545a11b64d67404fbb30987cbe9d0f8fc885d202

klbostee · 2010-12-10T16:58:32Z

This is now in my master branch. I made some minor stylistic changes, but apart from those it went in unchanged.

Updated example:

job = dumbo.Job()
job.root # id for the job's root input (i.e., specified by -input on the command line)
o0 = job.additer(mapper1, reducer1) # returns an id for the iteration's output
o1 = job.additer(mapper2, reducer2, input=job.root) # consume root input, not the output of iteration 0
o2 = job.additer(mapper3, reducer3, input=[o0,o1]) # take input from both iteration 0 and 1

brisssou · 2011-07-13T14:59:15Z

I don't see how I can mix this nice feature with MultiMapper/JoinReducer.
Did I miss something?

edit: I may have found a way:

kwargs['output'] = job_output + "_pre" + str(iter + 1)

So

multimapper = MultiMapper()
multimapper.add("_pre1", primary(mapper))
multimapper.add("_pre2", secondary(mapper))

should do the trick.

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow a Job to be a DAG instead of a chain? #18

Allow a Job to be a DAG instead of a chain? #18

hydropyrum commented Aug 24, 2010

sdeneefe commented Aug 27, 2010

desilinguist commented Aug 27, 2010

hydropyrum commented Aug 27, 2010

klbostee commented Dec 10, 2010

brisssou commented Jul 13, 2011

Allow a Job to be a DAG instead of a chain? #18

Allow a Job to be a DAG instead of a chain? #18

Comments

hydropyrum commented Aug 24, 2010

sdeneefe commented Aug 27, 2010

desilinguist commented Aug 27, 2010

hydropyrum commented Aug 27, 2010

klbostee commented Dec 10, 2010

brisssou commented Jul 13, 2011