Collate data sets in UI #511

greenbeard1 · 2016-02-04T20:26:53Z

When working with a series of runs on a series of datasets that produce multiple outputs (for instance mixed HCV pipeline on several samples) it is advantageous to be able to download the kind of results files for all samples easily as a BATCH. Perhaps a list of results files next to each run with tick boxes and a download button?

Updated changes

This is a pretty old issue, but the current plan is to support pipeline arguments with multiple input files, and then write a pipeline that will collate multiple files. As described in the "Features to Add" section of #739, the full set of arguments supported in a Singularity configuration would look something like this:

KIVE_INPUTS=name1 name2 --option_name --multiple_option_name*
KIVE_OUTPUTS=name1 name2 --output_option --directory_name/

If we can't come up with an easy user interface at first, these extra option types might only be available when launching runs through the API.

Configure named arguments in Singularity files, like --option_name, in addition to positional arguments.
Launch runs with named arguments, at least through the API.
Configure named arguments with multiple values, like --multiple_option_name*.
Launch runs with multivalue arguments.
Write an example pipeline for collating files.
Configure named output arguments in Singularity files.
Launch runs with named output arguments.
Configure output directory arguments, like --directory_name/. Allow positional or named arguments.
Collect datasets from all files in an output directory. Should they use the file name as the dataset name?

The text was updated successfully, but these errors were encountered:

donkirkby · 2016-02-05T17:46:37Z

This is symmetrical with issue #488 to create a new batch interface for launching runs. User would have to choose a pipeline, then choose which outputs to download, then choose which runs to include.

It might also be useful to collate the outputs from several runs into a single file with run name as a new column. Record this as a new dataset so it can be traced.

donkirkby · 2016-07-21T23:15:15Z

As we discussed in the Kive meeting this week, we have to track how a collated dataset was created. I think the cleanest way is to add a feature for methods to accept an arbitrary list of input files, then to let the user choose a pipeline for collating outputs from a bunch of runs.
However, when I thought through all the impacts of allowing an arbitrary list of inputs, this issue looks too big to fit in the current milestone. Do you all think we should push it to the next milestone?

Here's a summary of the changes I thought of so far:

Document how to accept an arbitrary list of input files in a code resource.
Add an extra_inputs flag to methods, and possibly a compound datatype for extra inputs.
Add an extra inputs node to pipelines.
Wire extra inputs to methods in a pipeline.
Allow a typed source to feed a raw input. This lets you write a generic collating pipeline.
Add an extra input on the run form.
Pass extra inputs during execution.

Here's a proof of concept for accepting an arbitrary list of input files in a code resource:

from argparse import ArgumentParser, FileType
from itertools import chain


def main():
    parser = ArgumentParser(description='test.')
    parser.add_argument('outfile', type=FileType('w'))
    parser.add_argument('infiles', type=FileType('rU'), nargs='*')
    args = parser.parse_args()
    lines = chain(*args.infiles)
    for line in lines:
        args.outfile.write(line)


if __name__ == '__main__':
    main()

The example just concatenates a list of files together into a single output file.

rhliang · 2016-07-22T17:51:14Z

I agree that arbitrary lists of inputs are a big change and should be punted. I wonder if it would be more sensible for us to introduce the use of command-line options in our Methods, so that when we have multiple files to deal with we could use an invokation like

foo.py --multiinput input1 --multiinput input2 --multiinput listnput3 --multioutput output1 --multioutput output2 --multioutput output3

where multiinput is something the Method would define and Kive would have to know about. This way we could specify more than one arbitrary-length input.

If we're only collating outputs, though, perhaps a reasonable alternative approach that might be easier would be to create something more akin to a RunCable that has a ManyToManyField to track which Runs it's collating output from, or perhaps just a ForeignKey to a RunBatch. In this scheme, I'd imagine that we could avoid using the fleet for doing these collations.

donkirkby · 2016-07-22T19:03:11Z

Using names for the extra inputs is interesting. I didn't like the order I had: standard inputs, outputs, extra inputs. Also, as you say, names let you specify more than one set of extra inputs.

I thought about doing something simpler that only supports collating, but I couldn't think of something simple enough that I don't mind throwing it away when we add support for lists of inputs. For the next release or two, I suggest we collate outside of Kive the same way MiCall does.

donkirkby · 2018-05-30T21:50:34Z

We've discussed a couple of options for methods with a flexible number of input files:

Pass a directory as the command-line argument, and put multiple files in that directory.
Use a named command-line argument instead of a positional argument, and accept multiple values.

Input directory

This option is probably simpler, but it's not clear how to specify an order for the input files. We could use a naming convention, but that feels like a hack.

Named argument

This makes it more complicated to parse the command line, but Python's argparse module will do it for us. We can use nargs='*', just like we did in the docker wrapper. A command line might look like this:

my_script.py --names english_names.csv french_names.csv -- greetings.csv

The -- tells the script that the list of optional arguments has ended, and greetings.csv is the only positional argument: an output.

Recommendation

Especially when we're collating results, we will care about the order of the input files, so I recommend using named arguments.

The method definition will need a new section: optional inputs. They will need a name, and they can have an option to allow multiple files.

We'll have to figure out how to wire optional inputs in a pipeline. Are they only useful for pipeline inputs?

Also check optional inputs in Pipeline.check_inputs().

donkirkby · 2018-07-26T18:44:29Z

I made some progress on this in the optional-inputs branch, but I think we should postpone the change until after we decide about #739. It would be much easier to support different arguments if the launching code were simpler.

donkirkby · 2020-03-05T00:38:03Z

Now that we've finished #739, #752, and #721, it's time to come back to this issue. I've reduced the scope a bit from a batch download in the UI to supporting a pipeline that can collate all the result files from a batch of runs.

The main benefit of this is that we'll be able to do the collating inside Kive, so we can use the file lookup feature on the collated result files. At first, we might have to launch all the collating jobs through Kive's API.

Once this is done, we can think about how to add a batch download feature to the UI.

This commit adds multi-valued optinal input parameters to Kive pipelines. This partially addresses #511

This commit adds two examples to the `api` directory. One example shows how to start a pipeline using a mono-valued keyword style pipeline arguments. The other shows how to use a multi-valued keyword style pipeline argument. Both examples depend on scripts that have been added to the samplecode singularity image in a separate branch, pending merge. This commit pertains to #511.

ArtPoon added this to the 0.8 - user batch processing milestone Mar 7, 2016

donkirkby added the ready label Mar 23, 2016

donkirkby modified the milestones: 0.9 - Streamline pipeline creation workflow, 0.8 - user batch processing Jul 25, 2016

donkirkby removed the ready label Jul 25, 2016

donkirkby mentioned this issue Aug 22, 2016

Download more outputs from mixed-hcv pipeline cfe-lab/MiCall#350

Closed

4 tasks

donkirkby added the ready label Oct 4, 2016

donkirkby modified the milestones: 0.9 - Streamline pipeline creation workflow, 0.10 - Slurm execution Jan 6, 2017

donkirkby removed the ready label Jan 6, 2017

donkirkby added the ready label Feb 11, 2017

rhliang modified the milestones: 0.10 - Slurm execution, 0.11 - TBD Mar 28, 2017

donkirkby modified the milestones: 0.11 - Slurm UI, Near future Sep 25, 2017

donkirkby removed the ready label Sep 25, 2017

donkirkby modified the milestones: Near future, 0.12 simpler pipeline setup Dec 18, 2017

donkirkby self-assigned this May 30, 2018

donkirkby added a commit that referenced this issue Jul 25, 2018

Add optional method inputs for #511.

8159b1f

donkirkby added a commit that referenced this issue Jul 25, 2018

Add optional inputs to method forms. Part of #511.

1a7a3e9

donkirkby added a commit that referenced this issue Jul 25, 2018

Pass optional inputs when launching a method for #511.

426e6eb

Also check optional inputs in Pipeline.check_inputs().

donkirkby mentioned this issue Jul 26, 2018

Simplify pipeline configuration #739

Closed

24 tasks

donkirkby modified the milestones: 0.12 simpler pipeline setup, Near future Jul 26, 2018

donkirkby modified the milestones: Near future, 0.17 Mar 5, 2020

donkirkby changed the title ~~Batch Download of Results Across Runs~~ Collate data sets Mar 5, 2020

donkirkby mentioned this issue May 20, 2020

Record step performance #924

Open

nathanielknight self-assigned this Jun 2, 2020

donkirkby removed their assignment Jun 3, 2020

nathanielknight mentioned this issue Jun 10, 2020

Optional input args #948

Merged

nathanielknight added a commit that referenced this issue Jun 13, 2020

Add multi-valued optional inputs

b5a59af

This commit adds multi-valued optinal input parameters to Kive pipelines. This partially addresses #511

This was referenced Jun 13, 2020

Add multi-valued optional inputs #951

Merged

Add a simple collation script #956

Merged

nathanielknight mentioned this issue Jun 26, 2020

Keyword-style argument API examples #966

Merged

donkirkby unassigned nathanielknight Jun 2, 2022

donkirkby changed the title ~~Collate data sets~~ Collate data sets in UI Oct 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collate data sets in UI #511

Collate data sets in UI #511

greenbeard1 commented Feb 4, 2016 •

edited by nathanielknight

Loading

donkirkby commented Feb 5, 2016

donkirkby commented Jul 21, 2016

rhliang commented Jul 22, 2016 •

edited

Loading

donkirkby commented Jul 22, 2016

donkirkby commented May 30, 2018

donkirkby commented Jul 26, 2018

donkirkby commented Mar 5, 2020

Collate data sets in UI #511

Collate data sets in UI #511

Comments

greenbeard1 commented Feb 4, 2016 • edited by nathanielknight Loading

Updated changes

donkirkby commented Feb 5, 2016

donkirkby commented Jul 21, 2016

rhliang commented Jul 22, 2016 • edited Loading

donkirkby commented Jul 22, 2016

donkirkby commented May 30, 2018

Input directory

Named argument

Recommendation

donkirkby commented Jul 26, 2018

donkirkby commented Mar 5, 2020

greenbeard1 commented Feb 4, 2016 •

edited by nathanielknight

Loading

rhliang commented Jul 22, 2016 •

edited

Loading