Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of Documentation Content & Workflows #61

Open
charlesreid1 opened this issue Mar 12, 2018 · 11 comments
Open

Discussion of Documentation Content & Workflows #61

charlesreid1 opened this issue Mar 12, 2018 · 11 comments

Comments

@charlesreid1
Copy link
Member

charlesreid1 commented Mar 12, 2018

This issue is to kick off a discussion of strategy around documentation content, with an eye toward issue #43, implementing documentation of workflows in sphinx.

There are a number of different versions of workflows that we might cover, and/or formats we might use when discussing workflows in the documentation:

  • Readme with shell commands for the user to copy-and-paste
  • Bash scripts for the user to run
  • Python scripts for the user to run
  • Snakemake file with various make targets
  • Snakemake modules
  • Wrapping around Snakemake - cloud/hpc deployment scripts

The readme with shell commands represents maximum human-readability with minimal flexibility - the user can read the document and follow each command without even needing to run them. This is good for users if they are running an unfamiliar workflow, or if they want to see an example. This results in more transparent pipelines that are easier to modify.

Snakemake/Python scripts represent the maximum flexibility solution, but are are much less readable (although bash flags like set -x still allow you to see the commands being run). These are good if the user has a known, tried-and-tested pipeline and they know what output they want from what input. This results in a set of defined pipelines that that are not intended to be modified.

How much do we balance the two approaches?

Also think it is worth mentioning any projects we might look to for inspiration. I'll start:

@charlesreid1
Copy link
Member Author

charlesreid1 commented Mar 13, 2018

Titus pointed to spacegraphcats, which provides a nice example of how the Snakemake call can be made from Python, allowing it to be wrapped with ArgumentParser and providing a more user-friendly command line interface than Snakemake.

I like this idea for a couple of reasons:

  • You can separate the Snakemake logic from the command line logic - and you can validate user input before running the Snakemake workflow.
  • The user can be more easily guided along different workflows
  • Most importantly, this should resolve the documentation question, because it provides a layer of separation between the user and the shell commands being run. The documentation can focus on the command line interface tool, because the user won't need to tweak the Snakemake file or the shell commands being run, because the command line tool can validate user input!

@ctb ctb changed the title Discussion of Documentation Content Discussion of Documentation Content & Workflows Mar 14, 2018
@ctb
Copy link
Contributor

ctb commented Mar 14, 2018

A couple of thoughts ref the spacegraphcats approach, and also command-line interfaces:

  1. the spacegraphcats approach meshes well with the subcommand style of sourmash, e.g.
dahak trim-reads lenient mydata

could run the trim-reads workflow with lenient parameters, on the data set described in mydata. The parameter set and

  1. the parameter set stuff is an extension based on my experience with spacegraphcats; I like the idea of having orthogonal parameter sets to data set descriptions, unlike the way we're doing in spacegraphcats (where they are all in one file).

  2. it has also been useful to be able to override select parameters and data sets at the command line, e.g. dahak trim-reads lenient mydata --qual-cutoff 30. BUT, in order to track provenance and lower confusion, this comes with additional documentation and reporting needs; the output files need to be tied to the modified parameters somehow.

  3. on the provenance front, we use a fairly simplistic naming method in spacegraphcats that fits well with snakemake: the output directories are named dataset_k31_r1 which describes the essential parameters. However, I think a stricter default naming scheme would be better for dahak: something like mydata.trim-reads.lenient, (with this being overridable with a -o parameter). And I'd suggest that the default names be the names of the config files, e.g. 'trim-reads' is the workflow description name, lenient is the name of the parameter file, and mydata is the name of the data description. We could prefix everything with a dahak. or output. by default, too.

  4. we'd need a simple command to build the data set description file.

  5. we'd also want to make it possible to discover which exact files are being used quite easily, so that someone skilled could look at the output and go 'ah-hah, if I want to know which parameters are being used in lenient, I can look in this file here.

  6. workflow, data set description, and parameter file discovery is also important - we'd want users to be able to show workflows/data sets/parameter files under the current directory, e.g.

dahak ls 

would show all workflows, data sets, and parameter files; and dahak ls --workflow etc. etc.

For this to work, we'd need a naming convention, and I think something like the following scheme might work:

  • workflows are Snakefiles; a set of default workflows come with dahak, and new ones can be added easily for someone familiar with snakemake; this would be a place for help docs, but is lowest priority at this point;

  • data set descriptions are something like {name}.data.json;

  • parameter files are something like param.{name}.json or {name}.param.json and come with dahak, but new ones can be created easily by people who are not Python or snakemake experts.

Basically I think this is a situation where a nice "simple" (looking) layer of CLI syntax and Python could result in a tool that is acceptable for both beginner and expert people used to command line software.

cc @luizirber

@charlesreid1
Copy link
Member Author

charlesreid1 commented Mar 14, 2018 via email

@charlesreid1
Copy link
Member Author

charlesreid1 commented Mar 14, 2018

One other follow-up question - @ctb gave an example of trimmed reads, which (I think - correct me if I'm wrong) is a workflow step & not a workflow itself. (Workflows.md) Do we want to provide subcommands for workflow steps, in addition to subcommands for entire workflows? Or just focus on workflows and only add standalone subcommands for a few of the steps?

I will put together a prototype command line interface for a single workflow, the taxonomic classification workflow, and we can use that as a basis for further discussions.

@brooksph
Copy link
Contributor

That’s a great question that will likely vary depending on the workflow. For the read filtering protocol we can generate the fastqc reports by default and the maybe add a “-no reports” flag. For the trimming step In the read filtering workflow we will need to have a flag for quality score. Similarly for taxonomic classification the user will need to be able to specify a kmer size and the default will likely be 51. We shouldn’t get carried away with flags but also need to ensure that the user can use all of the features of the software. One option for 1.0 might be including a small set of sub commands and then providing guidance for running things in ‘interactive’ mode if more tweaking is necessary. Just a thought.

@ctb
Copy link
Contributor

ctb commented Mar 14, 2018 via email

@ctb
Copy link
Contributor

ctb commented Mar 14, 2018 via email

@ctb
Copy link
Contributor

ctb commented Mar 15, 2018

Since this issue got a bit sidetracked with the spacegraphcats stuff, I wanted to add:

  • documentation doesn't need to be comprehensive;
  • a quickstart tutorial for expert command-line users is probably the top priority, since no one other than the developers can use this until we have that;
  • as mentioned above a few times, the commands should be self documenting to a great extent - e.g. when running a task or workflow, it should be easy to figure out where the config files being used are coming from, as well as what commands are being run. This could be enabled by a command line switch that turns on this information, and how to do that should be output with every command IMO.

@ctb
Copy link
Contributor

ctb commented Mar 17, 2018

Hi all, here's a simple expansion of the spacegraphcats idea with parameterized workflows: https://github.com/ctb/2018-snakemake-cli. With only a little elaboration in the idea I think it could be powerful enough to provide a good, configurable, CLI for dahak, where you could run either entire workflows or specific stages of workflows.

@charlesreid1
Copy link
Member Author

After looking at that example I now understand exactly what you were getting at. Thank you.

@ctb
Copy link
Contributor

ctb commented Mar 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants