-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion of Documentation Content & Workflows #61
Comments
Titus pointed to spacegraphcats, which provides a nice example of how the Snakemake call can be made from Python, allowing it to be wrapped with ArgumentParser and providing a more user-friendly command line interface than Snakemake. I like this idea for a couple of reasons:
|
A couple of thoughts ref the spacegraphcats approach, and also command-line interfaces:
could run the
would show all workflows, data sets, and parameter files; and For this to work, we'd need a naming convention, and I think something like the following scheme might work:
Basically I think this is a situation where a nice "simple" (looking) layer of CLI syntax and Python could result in a tool that is acceptable for both beginner and expert people used to command line software. cc @luizirber |
I'm totally on board with these thoughts, I think they provide a clear path
to move forward (with a few questions we can clear up along the way).
[2.] I like the idea of having orthogonal parameter sets to data set descriptions,
This is a good idea for keeping the logic of input parameters easily
understandable. Is this equivalent to saying "input parameters from output
parameters", or is it more subtle than that? (Or, could you expand on the
parameter set vs data set verbiage?)
[3.] it has also been useful to be able to override select parameters and data sets at the command line
When implementing a command line interface, it could be a useful planning
tool to draw the user's actions as nodes on a graph, where each layer of
subcommand presents a set of choices (children nodes), each of which may
expose a further layer of choices, or just expose various --flag options.
It could help provide a roadmap for what choices are presented where. (Can
be implemented as a flat text file with yaml/bullet list/json.)
[4.] stricter default naming scheme would be better for dahak... default
names be names of config files... prefix everything with a dahak. by
default.
Agreed - although the user should not need to programmatically extract the
parameter value from the file name (outputs should include a json file[s]
with values for all settings.)
[5.] simple command to build the data set description file
Is this just a settings file with different parameter values related to the
data set, or something else?
[6.] we'd also want to make it possible to discover which exact files are
being used quite easily, so that someone skilled could look at the output
and go 'ah-hah, if I want to know which parameters are being used in
lenient, I can look in this file here.
Agreed - this seems important enough that any command using a settings file
should print what settings file it is using and in which order.
[7.] workflow, data set description, and parameter file discovery is also
important - we'd want users to be able to show workflows/data
sets/parameter files under the current directory
I follow this at a high level, but I'd need to think about it more
concretely to know how to implement this.
Charles
|
One other follow-up question - @ctb gave an example of trimmed reads, which (I think - correct me if I'm wrong) is a workflow step & not a workflow itself. (Workflows.md) Do we want to provide subcommands for workflow steps, in addition to subcommands for entire workflows? Or just focus on workflows and only add standalone subcommands for a few of the steps? I will put together a prototype command line interface for a single workflow, the taxonomic classification workflow, and we can use that as a basis for further discussions. |
That’s a great question that will likely vary depending on the workflow. For the read filtering protocol we can generate the fastqc reports by default and the maybe add a “-no reports” flag. For the trimming step In the read filtering workflow we will need to have a flag for quality score. Similarly for taxonomic classification the user will need to be able to specify a kmer size and the default will likely be 51. We shouldn’t get carried away with flags but also need to ensure that the user can use all of the features of the software. One option for 1.0 might be including a small set of sub commands and then providing guidance for running things in ‘interactive’ mode if more tweaking is necessary. Just a thought. |
On Wed, Mar 14, 2018 at 10:56:42PM +0000, Chaz Reid wrote:
I will put together a prototype command line interface for a single workflow, the taxonomic classification workflow, and we can use that as a basis for further discussions.
great, thanks!
|
On Mar 14, 2018, at 1:06 PM, Chaz Reid ***@***.***> wrote:
I'm totally on board with these thoughts, I think they provide a clear path
to move forward (with a few questions we can clear up along the way).
> [2.] I like the idea of having orthogonal parameter sets to data set
descriptions,
This is a good idea for keeping the logic of input parameters easily
understandable. Is this equivalent to saying "input parameters from output
parameters", or is it more subtle than that? (Or, could you expand on the
parameter set vs data set verbiage?)
data set: here are the files I’m analyzing.
parameter: here are the parameters with which I’m analyzing the data set.
So I could have data sets mydata1, mydata2, etc. (JSON files describing a set of FASTA/FASTQ), that I then analyze with lenient, stringent, etc. parameter sets.
> [3.] it has also been useful to be able to override select parameters and
data sets at the command line
When implementing a command line interface, it could be a useful planning
tool to draw the user's actions as nodes on a graph, where each layer of
subcommand presents a set of choices (children nodes), each of which may
expose a further layer of choices, or just expose various --flag options.
It could help provide a roadmap for what choices are presented where. (Can
be implemented as a flat text file with yaml/bullet list/json.)
Maybe? No objection :). An alternative approach would be to write a tutorial and game out various tasks to be performed.
> [4.] stricter default naming scheme would be better for dahak... default
names be names of config files... prefix everything with a dahak. by
default.
Agreed - although the user should not need to programmatically extract the
parameter value from the file name (outputs should include a json file[s]
with values for all settings.)
nice idea!
> [5.] simple command to build the data set description file
Is this just a settings file with different parameter values related to the
data set, or something else?
List of sequence files.
> [6.] we'd also want to make it possible to discover which exact files are
being used quite easily, so that someone skilled could look at the output
and go 'ah-hah, if I want to know which parameters are being used in
lenient, I can look in this file here.
Agreed - this seems important enough that any command using a settings file
should print what settings file it is using and in which order.
yep!
> [7.] workflow, data set description, and parameter file discovery is also
important - we'd want users to be able to show workflows/data
sets/parameter files under the current directory
I follow this at a high level, but I'd need to think about it more
concretely to know how to implement this.
ok!
… |
Since this issue got a bit sidetracked with the spacegraphcats stuff, I wanted to add:
|
Hi all, here's a simple expansion of the spacegraphcats idea with parameterized workflows: https://github.com/ctb/2018-snakemake-cli. With only a little elaboration in the idea I think it could be powerful enough to provide a good, configurable, CLI for dahak, where you could run either entire workflows or specific stages of workflows. |
After looking at that example I now understand exactly what you were getting at. Thank you. |
Great! May not work in the end but worth a shot since it’s simple...
…--
Titus Brown, [email protected]
On Mar 20, 2018, at 10:56 AM, Chaz Reid ***@***.***> wrote:
After looking at that example I now understand exactly what you were getting at. Thank you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This issue is to kick off a discussion of strategy around documentation content, with an eye toward issue #43, implementing documentation of workflows in sphinx.
There are a number of different versions of workflows that we might cover, and/or formats we might use when discussing workflows in the documentation:
The readme with shell commands represents maximum human-readability with minimal flexibility - the user can read the document and follow each command without even needing to run them. This is good for users if they are running an unfamiliar workflow, or if they want to see an example. This results in more transparent pipelines that are easier to modify.
Snakemake/Python scripts represent the maximum flexibility solution, but are are much less readable (although bash flags like
set -x
still allow you to see the commands being run). These are good if the user has a known, tried-and-tested pipeline and they know what output they want from what input. This results in a set of defined pipelines that that are not intended to be modified.How much do we balance the two approaches?
Also think it is worth mentioning any projects we might look to for inspiration. I'll start:
The text was updated successfully, but these errors were encountered: