-
Notifications
You must be signed in to change notification settings - Fork 7
FAQ
TELL ME ABOUT IT!!!
- Github issue
- Email me
- Send me a message on slack
Be sure to include the command used, what config file was used, and what the nextflow error was.
There is a template file with all the variables in this repo at configs/grandeur_template.config that the End User can copy and edit. All of the parameters are included in that file.
There's also a config file what we use here at UPHL, UPHL.config.
To get a copy of this config file (will not run workflow)
nextflow run UPHL-BioNGS/Grandeur --config_file true
To use the config file created by the End User, simply specify the path with -c
nextflow run UPHL-BioNGS/Grandeur -profile singularity -c <path to user edited config file>
There are three test profiles for "Grandeur", they download reads from the SRA using the sra-toolkit.
-
test
downloads six samples from the SRA to run through the workflow with default settings -
test1
uses those same samples, but does not download genomes from NCBI -
test2
downloads some CRPA and creates a multiple sequence alignment
nextflow run UPHL-BioNGS/Grandeur -profile test,singularity
There are also 6 genomes from NCBI genome that are in this repository under data/fasta:
- GCF_000005845.2_ASM584v2 : Escherichia coli
- GCF_000006925.2_ASM692v2 : Shigella flexneri
- GCF_000006945.2_ASM694v2 : Salmonella enterica
- GCF_000240185.1_ASM24018v2 : Klebsiella pneumoniae
- GCF_008632635.1_ASM863263v1 : Acinetobacter baumannii
- GCF_900475405.1_44087_C01 : Stenotrophomonas maltophilia
And then run with
nextflow run UPHL-BioNGS/Grandeur -profile singularity --fastas data/fasta
The directory data/msa
contains one gff file and 6 fasta files of Stenotrophomonas maltophilia that can be used to test multiple sequence alignment. A resulting treefile (iqtree.treefile), snp_matrix (snp_matrix.txt), and roary summary file (summary_statistics.txt) are included for comparison.
Testing creating a phylogenetic tree from a core gene comparison:
nextflow run UPHL-BioNGS/Grandeur -profile singularity,just_msa --fastas data/msa --gff data/msa --iqtree2_outgroup GCF_900475405.1_44087_C01_genomic
Prior versions allowed more flexibility about which analyses were run. This was difficult to maintain. There are some processes that can be turned off:
-
params.msa = false
is the default, but this skips multiple sequence alignment. -
params.current_datasets = false
will skip downloading genomes from NCBI and will instead use the genomes in the workflow -
params.information = false
will skip the information subworkflow
At UPHL, we use this workflow to determine the serotype of Salmonella and E. coli under CLIA. Therefore, all containers with their versions are explicitly selected if available, and any updates to this repo will come with a version change. In future endevours, we hope to use this workflow for organism identification and AMR gene identification.
The CLIA officer of the End User may request additional locks be put in place, like having all of the containers specified. If additional help is needed, please submit an issue or Email me.
They perform well, their containers were easy to create, and @erinyoung had heard about them.
As "Grandeur" is intended to be a species agnostic workflow for a local public health laboratory, and sequencing is continuing to expand in its utility, new tools are constantly being needed to analyze isolates to further public health goals.
Many of these additional tools are added by need locally or from the End User, so if the End User knows of other serotyping/analysis tools, please submit an issue or tell @erinyoung about it, and we'll work in some options.
@erinyoung also appreciates pull requests from forks.
Warning : If there's not a reliable container of the suggested tool, @erinyoung will request that the End User create a container for that tool and contribute to StaPH-B's docker repositories.
Organisms with large genomes can still contribute to disease, but this is not the workflow for those. "Grandeur" uses spades for de novo alignment, and large genomes may be too much for spades.
As of the time of writing this README, reference-based alignment of SARS-CoV-2 is still the norm. "Grandeur" is for de novo assembly of things with small genomes. Cecret would be a better workflow for SARS-CoV-2 sequencing.
genome_sizes.json has a list of commonly sequenced organisms and the approximate expected genome size for each organism. This is only used for the "cg-pipeline" process to estimate coverage. A file from the End User can be used instead and specified with params.genome_sizes
.
The default of "Grandeur" is to use the most current genomes from NCBI. This is mainly due to email chains relating to old organism names. There is a defined list of genomes included in "Grandeur", but the End User can create their own custom FastANI reference.
Things to consider:
- fastas need to be names $genus_$species_$id.fna
- fasta headers should have spaces replaced with '_'
- fastas should be a directory called genomes (depth = 1)
- the genomes directory should be compressed using tar (
tar -czvf fastani_refs.tar.gz genomes/
)
The config file lines for the above example (can be copied and pasted into a config file):
params.fastani_ref = "fastani_refs.tar.gz"
params.current_datasets = false
This workflow stands on the shoulders of giants. As such, please cite the individual tools that were useful for your manuscript so that those developers can continue to get funding. They are listed above. Mentioning this workflow in the text as "The Grandeur workflow v.VERSION (www.github.com/UPHL-BioNGS/Grandeur)" is good enough for @erinyoung's ego.
No. If there is interest in this feature, please contribute to the conversation at StaPH-B/docker-builds.
Yes. The main use-case at UPHL is to run "Grandeur" per seqeuncing run, which is variety of different organisms. Samples involved in outbreaks are generally spread over multiple runs.
The process at UPHL goes as follows:
- Run "Grandeur" on all the paired-end sequencing reads from a MiSeq run to get fasta files (located at
/grandeur/contigs
) with theuphl
profile - Gather the fasta files from their respective sequencing runs and put them in a new directory
- Add a representative genome from NCBI to this new directory
- Run "Grandeur" on the collected fasta files with the profile
just_msa
and specify the representative genome from NCBI as an outgroup
A real use case from UPHL with a Pseudomonas aeruginosa
nextflow run UPHL-BioNGS/Grandeur \
-with-tower \
-profile singularity,just_msa \
--outgroup GCF_000006765.1_ASM676v1_genomic \
--fastas fastas
Yes, although this is now a more-hidden option because several End Users were trying to use gff files downloaded from NCBI instead of re-using gff files created from prokka.
Prokka annotated gff files (ending with 'gff') as follows or designate directory with 'params.gff' or '--gff'
directory
└── gff
└── *gff
-
- amrfinderplus
- bbduk
- blastn
- blobtools_*
- core_genome_evaluation
- circulocov
- datasets_*
- drprg
- elgato
- emmtyper
- fastani
- fastp
- fastqc
- heatcluster
- iqtree2
- kaptive
- kleborate
- kraken2
- mash_*
- mashtree
- mlst
- multiqc
- mykrobe
- panaroo
- pbptyper
- phytreeviz
- plasmidfinder
- prokka
- quast
- seqsero2
- serotypefinder
- shigatyper
- snp_dists
- spades