-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collate data sets in UI #511
Comments
This is symmetrical with issue #488 to create a new batch interface for launching runs. User would have to choose a pipeline, then choose which outputs to download, then choose which runs to include. It might also be useful to collate the outputs from several runs into a single file with run name as a new column. Record this as a new dataset so it can be traced. |
As we discussed in the Kive meeting this week, we have to track how a collated dataset was created. I think the cleanest way is to add a feature for methods to accept an arbitrary list of input files, then to let the user choose a pipeline for collating outputs from a bunch of runs. Here's a summary of the changes I thought of so far:
Here's a proof of concept for accepting an arbitrary list of input files in a code resource:
The example just concatenates a list of files together into a single output file. |
I agree that arbitrary lists of inputs are a big change and should be punted. I wonder if it would be more sensible for us to introduce the use of command-line options in our Methods, so that when we have multiple files to deal with we could use an invokation like
where If we're only collating outputs, though, perhaps a reasonable alternative approach that might be easier would be to create something more akin to a RunCable that has a ManyToManyField to track which Runs it's collating output from, or perhaps just a ForeignKey to a RunBatch. In this scheme, I'd imagine that we could avoid using the fleet for doing these collations. |
Using names for the extra inputs is interesting. I didn't like the order I had: standard inputs, outputs, extra inputs. Also, as you say, names let you specify more than one set of extra inputs. I thought about doing something simpler that only supports collating, but I couldn't think of something simple enough that I don't mind throwing it away when we add support for lists of inputs. For the next release or two, I suggest we collate outside of Kive the same way MiCall does. |
We've discussed a couple of options for methods with a flexible number of input files:
Input directoryThis option is probably simpler, but it's not clear how to specify an order for the input files. We could use a naming convention, but that feels like a hack. Named argumentThis makes it more complicated to parse the command line, but Python's
The RecommendationEspecially when we're collating results, we will care about the order of the input files, so I recommend using named arguments. The method definition will need a new section: optional inputs. They will need a name, and they can have an option to allow multiple files. We'll have to figure out how to wire optional inputs in a pipeline. Are they only useful for pipeline inputs? |
Also check optional inputs in Pipeline.check_inputs().
I made some progress on this in the |
Now that we've finished #739, #752, and #721, it's time to come back to this issue. I've reduced the scope a bit from a batch download in the UI to supporting a pipeline that can collate all the result files from a batch of runs. The main benefit of this is that we'll be able to do the collating inside Kive, so we can use the file lookup feature on the collated result files. At first, we might have to launch all the collating jobs through Kive's API. Once this is done, we can think about how to add a batch download feature to the UI. |
This commit adds multi-valued optinal input parameters to Kive pipelines. This partially addresses #511
This commit adds two examples to the `api` directory. One example shows how to start a pipeline using a mono-valued keyword style pipeline arguments. The other shows how to use a multi-valued keyword style pipeline argument. Both examples depend on scripts that have been added to the samplecode singularity image in a separate branch, pending merge. This commit pertains to #511.
When working with a series of runs on a series of datasets that produce multiple outputs (for instance mixed HCV pipeline on several samples) it is advantageous to be able to download the kind of results files for all samples easily as a BATCH. Perhaps a list of results files next to each run with tick boxes and a download button?
Updated changes
This is a pretty old issue, but the current plan is to support pipeline arguments with multiple input files, and then write a pipeline that will collate multiple files. As described in the "Features to Add" section of #739, the full set of arguments supported in a Singularity configuration would look something like this:
If we can't come up with an easy user interface at first, these extra option types might only be available when launching runs through the API.
--option_name
, in addition to positional arguments.--multiple_option_name*
.--directory_name/
. Allow positional or named arguments.The text was updated successfully, but these errors were encountered: