Skip to content
Young edited this page Jan 24, 2023 · 6 revisions

Frequently Asked Questions (aka FAQ)

What do I do if I encounter an error?

TELL ME ABOUT IT!!!

Be sure to include the command used, what config file was used, and what the nextflow error was.

Where is an example config file?

There is a template file with all the variables in this repo at configs/grandeur_template.config that the End User can copy and edit. All of the parameters are included in that file.

There's also a config file what we use here at UPHL, UPHL.config.

To get a copy of this config file (will not run workflow)

nextflow run UPHL-BioNGS/Grandeur --config_file true

To use the config file created by the End User, simply specify the path with -c

nextflow run UPHL-BioNGS/Grandeur -profile singularity -c <path to user edited config file>

Do you have test data?

There are three test profiles for "Grandeur", they download reads from the SRA using the sra-toolkit.

  • test downloads six samples from the SRA to run through the workflow with default settings
  • test1 uses those same samples, but does not download genomes from NCBI
  • test2 downloads some CRPA and creates a multiple sequence alignment
nextflow run UPHL-BioNGS/Grandeur -profile test,singularity

There are also 6 genomes from NCBI genome that are in this repository under data/fasta:

And then run with

nextflow run UPHL-BioNGS/Grandeur -profile singularity --fastas data/fasta

The directory data/msa contains one gff file and 6 fasta files of Stenotrophomonas maltophilia that can be used to test multiple sequence alignment. A resulting treefile (iqtree.treefile), snp_matrix (snp_matrix.txt), and roary summary file (summary_statistics.txt) are included for comparison.

Testing creating a phylogenetic tree from a core gene comparison:

nextflow run UPHL-BioNGS/Grandeur -profile singularity,just_msa --fastas data/msa --gff data/msa --iqtree2_outgroup GCF_900475405.1_44087_C01_genomic

How do I turn processes off?

Prior versions allowed more flexibility about which analyses were run. This was difficult to maintain. There are some processes that can be turned off:

  • params.msa = false is the default, but this skips multiple sequence alignment.
  • params.current_datasets = false will skip downloading genomes from NCBI and will instead use the genomes in the workflow
  • params.information = false will skip the information subworkflow

What about CLIA validation?

At UPHL, we use this workflow to determine the serotype of Salmonella and E. coli under CLIA. Therefore, all containers with their versions are explicitly selected if available, and any updates to this repo will come with a version change. In future endevours, we hope to use this workflow for organism identification and AMR gene identification.

The CLIA officer of the End User may request additional locks be put in place, like having all of the containers specified. If additional help is needed, please submit an issue or Email me.

How were serotyping tools chosen for this workflow?

They perform well, their containers were easy to create, and @erinyoung had heard about them.

Are any other tools getting added to "Grandeur"?

As "Grandeur" is intended to be a species agnostic workflow for a local public health laboratory, and sequencing is continuing to expand in its utility, new tools are constantly being needed to analyze isolates to further public health goals.

Many of these additional tools are added by need locally or from the End User, so if the End User knows of other serotyping/analysis tools, please submit an issue or tell @erinyoung about it, and we'll work in some options.

@erinyoung also appreciates pull requests from forks.

Warning : If there's not a reliable container of the suggested tool, @erinyoung will request that the End User create a container for that tool and contribute to StaPH-B's docker repositories.

What about organisms with large genomes?

Organisms with large genomes can still contribute to disease, but this is not the workflow for those. "Grandeur" uses spades for de novo alignment, and large genomes may be too much for spades.

What about SARS-CoV-2?

As of the time of writing this README, reference-based alignment of SARS-CoV-2 is still the norm. "Grandeur" is for de novo assembly of things with small genomes. Cecret would be a better workflow for SARS-CoV-2 sequencing.

What is genome_sizes.json used for?

genome_sizes.json has a list of commonly sequenced organisms and the approximate expected genome size for each organism. This is only used for the "cg-pipeline" process to estimate coverage. A file from the End User can be used instead and specified with params.genome_sizes.

Can I create my custom FastANI database?

The default of "Grandeur" is to use the most current genomes from NCBI. This is mainly due to email chains relating to old organism names. There is a defined list of genomes included in "Grandeur", but the End User can create their own custom FastANI reference.

Things to consider:

  • fastas need to be names $genus_$species_$id.fna
  • fasta headers should have spaces replaced with '_'
  • fastas should be a directory called genomes (depth = 1)
  • the genomes directory should be compressed using tar (tar -czvf fastani_refs.tar.gz genomes/)

The config file lines for the above example (can be copied and pasted into a config file):

params.fastani_ref = "fastani_refs.tar.gz"
params.current_datasets = false

How do I cite this workflow?

This workflow stands on the shoulders of giants. As such, please cite the individual tools that were useful for your manuscript so that those developers can continue to get funding. They are listed above. Mentioning this workflow in the text as "The Grandeur workflow v.VERSION (www.github.com/UPHL-BioNGS/Grandeur)" is good enough for @erinyoung's ego.

Can I use roary's QC options with kraken?

No. If there is interest in this feature, please contribute to the conversation at StaPH-B/docker-builds.

Can I re-use files?

Yes. The main use-case at UPHL is to run "Grandeur" per seqeuncing run, which is variety of different organisms. Samples involved in outbreaks are generally spread over multiple runs.

The process at UPHL goes as follows:

  1. Run "Grandeur" on all the paired-end sequencing reads from a MiSeq run to get fasta files (located at /grandeur/contigs) with the uphl profile
  2. Gather the fasta files from their respective sequencing runs and put them in a new directory
  3. Add a representative genome from NCBI to this new directory
  4. Run "Grandeur" on the collected fasta files with the profile just_msa and specify the representative genome from NCBI as an outgroup

A real use case from UPHL with a Pseudomonas aeruginosa

nextflow run UPHL-BioNGS/Grandeur \
  -with-tower \
  -profile singularity,just_msa \
  --outgroup GCF_000006765.1_ASM676v1_genomic \
  --fastas fastas

Can I start with prokka-annotated gff files?

Yes, although this is now a more-hidden option because several End Users were trying to use gff files downloaded from NCBI instead of re-using gff files created from prokka.

Prokka annotated gff files (ending with 'gff') as follows or designate directory with 'params.gff' or '--gff'

directory
└── gff
     └── *gff
Clone this wiki locally