Discussion of Documentation Content & Workflows #61

charlesreid1 · 2018-03-12T19:40:41Z

This issue is to kick off a discussion of strategy around documentation content, with an eye toward issue #43, implementing documentation of workflows in sphinx.

There are a number of different versions of workflows that we might cover, and/or formats we might use when discussing workflows in the documentation:

Readme with shell commands for the user to copy-and-paste
Bash scripts for the user to run
Python scripts for the user to run
Snakemake file with various make targets
Snakemake modules
Wrapping around Snakemake - cloud/hpc deployment scripts

The readme with shell commands represents maximum human-readability with minimal flexibility - the user can read the document and follow each command without even needing to run them. This is good for users if they are running an unfamiliar workflow, or if they want to see an example. This results in more transparent pipelines that are easier to modify.

Snakemake/Python scripts represent the maximum flexibility solution, but are are much less readable (although bash flags like set -x still allow you to see the commands being run). These are good if the user has a known, tried-and-tested pipeline and they know what output they want from what input. This results in a set of defined pipelines that that are not intended to be modified.

How much do we balance the two approaches?

Also think it is worth mentioning any projects we might look to for inspiration. I'll start:

ymp - library for generalizing and extending snakemake workflows
snakemake-wrappers

The text was updated successfully, but these errors were encountered:

charlesreid1 · 2018-03-13T20:25:48Z

Titus pointed to spacegraphcats, which provides a nice example of how the Snakemake call can be made from Python, allowing it to be wrapped with ArgumentParser and providing a more user-friendly command line interface than Snakemake.

I like this idea for a couple of reasons:

You can separate the Snakemake logic from the command line logic - and you can validate user input before running the Snakemake workflow.
The user can be more easily guided along different workflows
Most importantly, this should resolve the documentation question, because it provides a layer of separation between the user and the shell commands being run. The documentation can focus on the command line interface tool, because the user won't need to tweak the Snakemake file or the shell commands being run, because the command line tool can validate user input!

ctb · 2018-03-14T14:31:11Z

A couple of thoughts ref the spacegraphcats approach, and also command-line interfaces:

the spacegraphcats approach meshes well with the subcommand style of sourmash, e.g.

dahak trim-reads lenient mydata

could run the trim-reads workflow with lenient parameters, on the data set described in mydata. The parameter set and

the parameter set stuff is an extension based on my experience with spacegraphcats; I like the idea of having orthogonal parameter sets to data set descriptions, unlike the way we're doing in spacegraphcats (where they are all in one file).
it has also been useful to be able to override select parameters and data sets at the command line, e.g. dahak trim-reads lenient mydata --qual-cutoff 30. BUT, in order to track provenance and lower confusion, this comes with additional documentation and reporting needs; the output files need to be tied to the modified parameters somehow.
on the provenance front, we use a fairly simplistic naming method in spacegraphcats that fits well with snakemake: the output directories are named dataset_k31_r1 which describes the essential parameters. However, I think a stricter default naming scheme would be better for dahak: something like mydata.trim-reads.lenient, (with this being overridable with a -o parameter). And I'd suggest that the default names be the names of the config files, e.g. 'trim-reads' is the workflow description name, lenient is the name of the parameter file, and mydata is the name of the data description. We could prefix everything with a dahak. or output. by default, too.
we'd need a simple command to build the data set description file.
we'd also want to make it possible to discover which exact files are being used quite easily, so that someone skilled could look at the output and go 'ah-hah, if I want to know which parameters are being used in lenient, I can look in this file here.
workflow, data set description, and parameter file discovery is also important - we'd want users to be able to show workflows/data sets/parameter files under the current directory, e.g.

dahak ls

would show all workflows, data sets, and parameter files; and dahak ls --workflow etc. etc.

For this to work, we'd need a naming convention, and I think something like the following scheme might work:

workflows are Snakefiles; a set of default workflows come with dahak, and new ones can be added easily for someone familiar with snakemake; this would be a place for help docs, but is lowest priority at this point;
data set descriptions are something like {name}.data.json;
parameter files are something like param.{name}.json or {name}.param.json and come with dahak, but new ones can be created easily by people who are not Python or snakemake experts.

Basically I think this is a situation where a nice "simple" (looking) layer of CLI syntax and Python could result in a tool that is acceptable for both beginner and expert people used to command line software.

cc @luizirber

charlesreid1 · 2018-03-14T20:05:59Z

I'm totally on board with these thoughts, I think they provide a clear path to move forward (with a few questions we can clear up along the way).

[2.] I like the idea of having orthogonal parameter sets to data set descriptions,

This is a good idea for keeping the logic of input parameters easily understandable. Is this equivalent to saying "input parameters from output parameters", or is it more subtle than that? (Or, could you expand on the parameter set vs data set verbiage?)

[3.] it has also been useful to be able to override select parameters and data sets at the command line

When implementing a command line interface, it could be a useful planning tool to draw the user's actions as nodes on a graph, where each layer of subcommand presents a set of choices (children nodes), each of which may expose a further layer of choices, or just expose various --flag options. It could help provide a roadmap for what choices are presented where. (Can be implemented as a flat text file with yaml/bullet list/json.)

[4.] stricter default naming scheme would be better for dahak... default names be names of config files... prefix everything with a dahak. by default.

Agreed - although the user should not need to programmatically extract the parameter value from the file name (outputs should include a json file[s] with values for all settings.)

[5.] simple command to build the data set description file

Is this just a settings file with different parameter values related to the data set, or something else?

[6.] we'd also want to make it possible to discover which exact files are being used quite easily, so that someone skilled could look at the output and go 'ah-hah, if I want to know which parameters are being used in lenient, I can look in this file here.

Agreed - this seems important enough that any command using a settings file should print what settings file it is using and in which order.

[7.] workflow, data set description, and parameter file discovery is also important - we'd want users to be able to show workflows/data sets/parameter files under the current directory

I follow this at a high level, but I'd need to think about it more concretely to know how to implement this. Charles

charlesreid1 · 2018-03-14T22:56:41Z

One other follow-up question - @ctb gave an example of trimmed reads, which (I think - correct me if I'm wrong) is a workflow step & not a workflow itself. (Workflows.md) Do we want to provide subcommands for workflow steps, in addition to subcommands for entire workflows? Or just focus on workflows and only add standalone subcommands for a few of the steps?

I will put together a prototype command line interface for a single workflow, the taxonomic classification workflow, and we can use that as a basis for further discussions.

brooksph · 2018-03-14T23:16:52Z

That’s a great question that will likely vary depending on the workflow. For the read filtering protocol we can generate the fastqc reports by default and the maybe add a “-no reports” flag. For the trimming step In the read filtering workflow we will need to have a flag for quality score. Similarly for taxonomic classification the user will need to be able to specify a kmer size and the default will likely be 51. We shouldn’t get carried away with flags but also need to ensure that the user can use all of the features of the software. One option for 1.0 might be including a small set of sub commands and then providing guidance for running things in ‘interactive’ mode if more tweaking is necessary. Just a thought.

ctb · 2018-03-14T23:33:01Z

On Wed, Mar 14, 2018 at 10:56:42PM +0000, Chaz Reid wrote: I will put together a prototype command line interface for a single workflow, the taxonomic classification workflow, and we can use that as a basis for further discussions.

great, thanks!

ctb · 2018-03-14T23:44:30Z

On Mar 14, 2018, at 1:06 PM, Chaz Reid ***@***.***> wrote: I'm totally on board with these thoughts, I think they provide a clear path to move forward (with a few questions we can clear up along the way). > [2.] I like the idea of having orthogonal parameter sets to data set descriptions, This is a good idea for keeping the logic of input parameters easily understandable. Is this equivalent to saying "input parameters from output parameters", or is it more subtle than that? (Or, could you expand on the parameter set vs data set verbiage?)

data set: here are the files I’m analyzing. parameter: here are the parameters with which I’m analyzing the data set. So I could have data sets mydata1, mydata2, etc. (JSON files describing a set of FASTA/FASTQ), that I then analyze with lenient, stringent, etc. parameter sets.

> [3.] it has also been useful to be able to override select parameters and data sets at the command line When implementing a command line interface, it could be a useful planning tool to draw the user's actions as nodes on a graph, where each layer of subcommand presents a set of choices (children nodes), each of which may expose a further layer of choices, or just expose various --flag options. It could help provide a roadmap for what choices are presented where. (Can be implemented as a flat text file with yaml/bullet list/json.)

Maybe? No objection :). An alternative approach would be to write a tutorial and game out various tasks to be performed.

> [4.] stricter default naming scheme would be better for dahak... default names be names of config files... prefix everything with a dahak. by default. Agreed - although the user should not need to programmatically extract the parameter value from the file name (outputs should include a json file[s] with values for all settings.)

nice idea!

> [5.] simple command to build the data set description file Is this just a settings file with different parameter values related to the data set, or something else?

List of sequence files.

> [6.] we'd also want to make it possible to discover which exact files are being used quite easily, so that someone skilled could look at the output and go 'ah-hah, if I want to know which parameters are being used in lenient, I can look in this file here. Agreed - this seems important enough that any command using a settings file should print what settings file it is using and in which order.

yep!

> [7.] workflow, data set description, and parameter file discovery is also important - we'd want users to be able to show workflows/data sets/parameter files under the current directory I follow this at a high level, but I'd need to think about it more concretely to know how to implement this.

ok!

…

ctb · 2018-03-15T13:45:46Z

Since this issue got a bit sidetracked with the spacegraphcats stuff, I wanted to add:

documentation doesn't need to be comprehensive;
a quickstart tutorial for expert command-line users is probably the top priority, since no one other than the developers can use this until we have that;
as mentioned above a few times, the commands should be self documenting to a great extent - e.g. when running a task or workflow, it should be easy to figure out where the config files being used are coming from, as well as what commands are being run. This could be enabled by a command line switch that turns on this information, and how to do that should be output with every command IMO.

ctb · 2018-03-17T03:49:07Z

Hi all, here's a simple expansion of the spacegraphcats idea with parameterized workflows: https://github.com/ctb/2018-snakemake-cli. With only a little elaboration in the idea I think it could be powerful enough to provide a good, configurable, CLI for dahak, where you could run either entire workflows or specific stages of workflows.

charlesreid1 · 2018-03-20T17:56:42Z

After looking at that example I now understand exactly what you were getting at. Thank you.

ctb · 2018-03-20T18:03:16Z

Great! May not work in the end but worth a shot since it’s simple...

…

-- Titus Brown, [email protected]

On Mar 20, 2018, at 10:56 AM, Chaz Reid ***@***.***> wrote: After looking at that example I now understand exactly what you were getting at. Thank you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

charlesreid1 mentioned this issue Mar 13, 2018

How does one run a workflow? #60

Closed

ctb changed the title ~~Discussion of Documentation Content~~ Discussion of Documentation Content & Workflows Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion of Documentation Content & Workflows #61

Discussion of Documentation Content & Workflows #61

charlesreid1 commented Mar 12, 2018 •

edited

Loading

charlesreid1 commented Mar 13, 2018 •

edited

Loading

ctb commented Mar 14, 2018

charlesreid1 commented Mar 14, 2018 via email •

edited

Loading

charlesreid1 commented Mar 14, 2018 •

edited

Loading

brooksph commented Mar 14, 2018

ctb commented Mar 14, 2018 via email

ctb commented Mar 14, 2018 via email

ctb commented Mar 15, 2018

ctb commented Mar 17, 2018

charlesreid1 commented Mar 20, 2018

ctb commented Mar 20, 2018 via email

Discussion of Documentation Content & Workflows #61

Discussion of Documentation Content & Workflows #61

Comments

charlesreid1 commented Mar 12, 2018 • edited Loading

charlesreid1 commented Mar 13, 2018 • edited Loading

ctb commented Mar 14, 2018

charlesreid1 commented Mar 14, 2018 via email • edited Loading

charlesreid1 commented Mar 14, 2018 • edited Loading

brooksph commented Mar 14, 2018

ctb commented Mar 14, 2018 via email

ctb commented Mar 14, 2018 via email

ctb commented Mar 15, 2018

ctb commented Mar 17, 2018

charlesreid1 commented Mar 20, 2018

ctb commented Mar 20, 2018 via email

charlesreid1 commented Mar 12, 2018 •

edited

Loading

charlesreid1 commented Mar 13, 2018 •

edited

Loading

charlesreid1 commented Mar 14, 2018 via email •

edited

Loading

charlesreid1 commented Mar 14, 2018 •

edited

Loading