Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collate data sets in UI #511

Open
4 of 9 tasks
greenbeard1 opened this issue Feb 4, 2016 · 7 comments
Open
4 of 9 tasks

Collate data sets in UI #511

greenbeard1 opened this issue Feb 4, 2016 · 7 comments

Comments

@greenbeard1
Copy link

greenbeard1 commented Feb 4, 2016

When working with a series of runs on a series of datasets that produce multiple outputs (for instance mixed HCV pipeline on several samples) it is advantageous to be able to download the kind of results files for all samples easily as a BATCH. Perhaps a list of results files next to each run with tick boxes and a download button?

Updated changes

This is a pretty old issue, but the current plan is to support pipeline arguments with multiple input files, and then write a pipeline that will collate multiple files. As described in the "Features to Add" section of #739, the full set of arguments supported in a Singularity configuration would look something like this:

KIVE_INPUTS=name1 name2 --option_name --multiple_option_name*
KIVE_OUTPUTS=name1 name2 --output_option --directory_name/

If we can't come up with an easy user interface at first, these extra option types might only be available when launching runs through the API.

  • Configure named arguments in Singularity files, like --option_name, in addition to positional arguments.
  • Launch runs with named arguments, at least through the API.
  • Configure named arguments with multiple values, like --multiple_option_name*.
  • Launch runs with multivalue arguments.
  • Write an example pipeline for collating files.
  • Configure named output arguments in Singularity files.
  • Launch runs with named output arguments.
  • Configure output directory arguments, like --directory_name/. Allow positional or named arguments.
  • Collect datasets from all files in an output directory. Should they use the file name as the dataset name?
@donkirkby
Copy link
Member

This is symmetrical with issue #488 to create a new batch interface for launching runs. User would have to choose a pipeline, then choose which outputs to download, then choose which runs to include.

It might also be useful to collate the outputs from several runs into a single file with run name as a new column. Record this as a new dataset so it can be traced.

@ArtPoon ArtPoon added this to the 0.8 - user batch processing milestone Mar 7, 2016
@donkirkby
Copy link
Member

As we discussed in the Kive meeting this week, we have to track how a collated dataset was created. I think the cleanest way is to add a feature for methods to accept an arbitrary list of input files, then to let the user choose a pipeline for collating outputs from a bunch of runs.
However, when I thought through all the impacts of allowing an arbitrary list of inputs, this issue looks too big to fit in the current milestone. Do you all think we should push it to the next milestone?

Here's a summary of the changes I thought of so far:

  • Document how to accept an arbitrary list of input files in a code resource.
  • Add an extra_inputs flag to methods, and possibly a compound datatype for extra inputs.
  • Add an extra inputs node to pipelines.
  • Wire extra inputs to methods in a pipeline.
  • Allow a typed source to feed a raw input. This lets you write a generic collating pipeline.
  • Add an extra input on the run form.
  • Pass extra inputs during execution.

Here's a proof of concept for accepting an arbitrary list of input files in a code resource:

from argparse import ArgumentParser, FileType
from itertools import chain


def main():
    parser = ArgumentParser(description='test.')
    parser.add_argument('outfile', type=FileType('w'))
    parser.add_argument('infiles', type=FileType('rU'), nargs='*')
    args = parser.parse_args()
    lines = chain(*args.infiles)
    for line in lines:
        args.outfile.write(line)


if __name__ == '__main__':
    main()

The example just concatenates a list of files together into a single output file.

@rhliang
Copy link
Contributor

rhliang commented Jul 22, 2016

I agree that arbitrary lists of inputs are a big change and should be punted. I wonder if it would be more sensible for us to introduce the use of command-line options in our Methods, so that when we have multiple files to deal with we could use an invokation like

foo.py --multiinput input1 --multiinput input2 --multiinput listnput3 --multioutput output1 --multioutput output2 --multioutput output3

where multiinput is something the Method would define and Kive would have to know about. This way we could specify more than one arbitrary-length input.

If we're only collating outputs, though, perhaps a reasonable alternative approach that might be easier would be to create something more akin to a RunCable that has a ManyToManyField to track which Runs it's collating output from, or perhaps just a ForeignKey to a RunBatch. In this scheme, I'd imagine that we could avoid using the fleet for doing these collations.

@donkirkby
Copy link
Member

Using names for the extra inputs is interesting. I didn't like the order I had: standard inputs, outputs, extra inputs. Also, as you say, names let you specify more than one set of extra inputs.

I thought about doing something simpler that only supports collating, but I couldn't think of something simple enough that I don't mind throwing it away when we add support for lists of inputs. For the next release or two, I suggest we collate outside of Kive the same way MiCall does.

@donkirkby donkirkby modified the milestones: 0.9 - Streamline pipeline creation workflow, 0.8 - user batch processing Jul 25, 2016
@donkirkby donkirkby removed the ready label Jul 25, 2016
@donkirkby donkirkby added the ready label Oct 4, 2016
@donkirkby donkirkby modified the milestones: 0.9 - Streamline pipeline creation workflow, 0.10 - Slurm execution Jan 6, 2017
@donkirkby donkirkby removed the ready label Jan 6, 2017
@rhliang rhliang modified the milestones: 0.10 - Slurm execution, 0.11 - TBD Mar 28, 2017
@donkirkby donkirkby modified the milestones: 0.11 - Slurm UI, Near future Sep 25, 2017
@donkirkby donkirkby removed the ready label Sep 25, 2017
@donkirkby donkirkby modified the milestones: Near future, 0.12 simpler pipeline setup Dec 18, 2017
@donkirkby
Copy link
Member

We've discussed a couple of options for methods with a flexible number of input files:

  1. Pass a directory as the command-line argument, and put multiple files in that directory.
  2. Use a named command-line argument instead of a positional argument, and accept multiple values.

Input directory

This option is probably simpler, but it's not clear how to specify an order for the input files. We could use a naming convention, but that feels like a hack.

Named argument

This makes it more complicated to parse the command line, but Python's argparse module will do it for us. We can use nargs='*', just like we did in the docker wrapper. A command line might look like this:

my_script.py --names english_names.csv french_names.csv -- greetings.csv

The -- tells the script that the list of optional arguments has ended, and greetings.csv is the only positional argument: an output.

Recommendation

Especially when we're collating results, we will care about the order of the input files, so I recommend using named arguments.

The method definition will need a new section: optional inputs. They will need a name, and they can have an option to allow multiple files.

We'll have to figure out how to wire optional inputs in a pipeline. Are they only useful for pipeline inputs?

@donkirkby donkirkby self-assigned this May 30, 2018
donkirkby added a commit that referenced this issue Jul 25, 2018
donkirkby added a commit that referenced this issue Jul 25, 2018
Also check optional inputs in Pipeline.check_inputs().
@donkirkby
Copy link
Member

I made some progress on this in the optional-inputs branch, but I think we should postpone the change until after we decide about #739. It would be much easier to support different arguments if the launching code were simpler.

@donkirkby donkirkby modified the milestones: Near future, 0.17 Mar 5, 2020
@donkirkby donkirkby changed the title Batch Download of Results Across Runs Collate data sets Mar 5, 2020
@donkirkby
Copy link
Member

Now that we've finished #739, #752, and #721, it's time to come back to this issue. I've reduced the scope a bit from a batch download in the UI to supporting a pipeline that can collate all the result files from a batch of runs.

The main benefit of this is that we'll be able to do the collating inside Kive, so we can use the file lookup feature on the collated result files. At first, we might have to launch all the collating jobs through Kive's API.

Once this is done, we can think about how to add a batch download feature to the UI.

@nathanielknight nathanielknight self-assigned this Jun 2, 2020
@donkirkby donkirkby removed their assignment Jun 3, 2020
nathanielknight added a commit that referenced this issue Jun 13, 2020
This commit adds multi-valued optinal input parameters to Kive
pipelines.

This partially addresses #511
nathanielknight added a commit that referenced this issue Jun 26, 2020
This commit adds two examples to the `api` directory.

One example shows how to start a pipeline using a mono-valued keyword
style pipeline arguments. The other shows how to use a multi-valued
keyword style pipeline argument.

Both examples depend on scripts that have been added to the samplecode
singularity image in a separate branch, pending merge.

This commit pertains to #511.
@donkirkby donkirkby changed the title Collate data sets Collate data sets in UI Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants