by Rodrigo de Paula Baptista, PhD
-
Genome Annotation
-
Prokaryotic annotation
- Running Prokka
-
Eukaryotic Annotation
-
Training your prediction dataset
-
Running AUGUSTUS
-
Glimpse on BRAKER
-
-
Direct RNA sequencing for annotation
-
Understanding the output:
-
prokka output files
-
the gff file
-
-
Gene Ontology and InterproScan (adding functions)
-
Visualization
-
-
Pan-Genomes
-
Core Genomes
-
Accessory Genomes
-
Running Roary the Pan-Genome pipeline
-
preparing dataset
-
Running Roary
-
Output files
-
Running Raxml to make a phylogenetic tree
-
-
Visualization using ITOL
-
All Files from this session will be available during the bootcamp at "/projects/k2i/session4/Files/"
Genome annotation involves the identification of functional elements along the sequence of a genome, assigning meaning to it. This process is essential because DNA sequencing often yields sequences of unknown function. Over the past three decades, genome annotation has transformed from computationally annotating long protein-coding genes in single genomes (one per species) and experimentally annotating short regulatory elements on a limited number of them, to the population-wide annotation of individual nucleotides across thousands of genomes (many per species). This enhanced resolution and inclusivity in genome annotations, spanning from genotypes to phenotypes, are providing precise insights into the biology of species, populations, and individuals. A key distinction between prokaryotic and eukaryotic genes lies in their structural composition. Prokaryotes possess only exons, whereas eukaryotes have both exons and introns.
The focus here is to demonstrate the methodology for gene prediction (annotation) in both prokaryotic and eukaryotic genomes.
There are several options of instances available:
- PROKKA (Which will be used in this boot camp)
- RAST (Webserver application)
- PGAP (Usually the best option, but runs slow)
Prokka is a software tool designed for the rapid annotation of bacterial, archaeal, and viral genomes, generating output files that adhere to standard specifications.
File name | Description | Location in the cluster |
---|---|---|
ARLG-4673_assembly.fasta | Assembled genome generated on session 1 | /projects/k2i/session4/Files/Prokka |
my_genome.gff | Pre-generated gff output | /projects/k2i/session4/Files/Prokka |
Basic Usage:
###Prepare to run in interactive node
srun --pty --export=ALL --ntasks=1 --reservation=workshop --cpus-per-task=8 --mem=15GB --time=04:00:00 /bin/bash
###Call prokka in bootcamp server
source /projects/k2i/radmicrobes-s4/bin/activate
##Run prokka
prokka --outdir prokka_test --prefix my_genome --rfam genome.fasta
Flag explanation
--outdir [name] Output folder (in this case it will be a folder created as prokka_test in your current location)
--prefix [name] Filename output prefix (in this case it will name your files with the my_genome prefix, e.g. my_genome.gff)
--rfam Enable searching also for ncRNAs with Infernal+Rfam and not just protein-coding genes
For more options type:
prokka -h
prokka options
General:
--help This help
--version Print version and exit
--citation Print citation for referencing Prokka
--quiet No screen output (default OFF)
--debug Debug mode: keep all temporary files (default OFF)
Setup:
--listdb List all configured databases
--setupdb Index all installed databases
--cleandb Remove all database indices
--depends List all software dependencies
Outputs:
--outdir [X] Output folder [auto] (default '')
--force Force overwriting existing output folder (default OFF)
--prefix [X] Filename output prefix [auto] (default '')
--addgenes Add 'gene' features for each 'CDS' feature (default OFF)
--locustag [X] Locus tag prefix (default 'PROKKA')
--increment [N] Locus tag counter increment (default '1')
--gffver [N] GFF version (default '3')
--compliant Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
--centre [X] Sequencing centre ID. (default '')
Organism details:
--genus [X] Genus name (default 'Genus')
--species [X] Species name (default 'species')
--strain [X] Strain name (default 'strain')
--plasmid [X] Plasmid name or identifier (default '')
Annotations:
--kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')
--prodigaltf [X] Prodigal training file (default '')
--gram [X] Gram: -/neg +/pos (default '')
--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)
--proteins [X] Fasta file of trusted proteins to first annotate from (default '')
--hmms [X] Trusted HMM to first annotate from (default '')
--metagenome Improve gene predictions for highly fragmented genomes (default OFF)
--rawproduct Do not clean up /product annotation (default OFF)
Computation:
--fast Fast mode - skip CDS /product searching (default OFF)
--cpus [N] Number of CPUs to use [0=all] (default '8')
--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
--norrna Don't run rRNA search (default OFF)
--notrna Don't run tRNA search (default OFF)
--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
There are several options of instances available:
- WEBAUGUSTUS (Web application used in this boot-camp)
- Original AUGUSTUS (Can be installed on your server or local machine)
- COMPANION (automated pipeline for eukaryotic pathogens)
File name | Description | Location in the cluster |
---|---|---|
Reference.fasta | Close related reference genome used for training | /projects/k2i/session4/Files/WEBAUGUSTUS |
Euk_genome.fasta | Genome to be annotated | /projects/k2i/session4/Files/WEBAUGUSTUS |
Protein_ref.fasta | Reference protein evidence for training | /projects/k2i/session4/Files/WEBAUGUSTUS |
Euk_genome_augustus.gff | Pre-generated gff output | /projects/k2i/session4/Files/WEBAUGUSTUS |
For the training submission you need:
- Your Reference Genome in Fasta format: Reference.fasta
- Your Reference Genome Protein file in fasta format: Protein_ref.fasta

Very Similar to submitting the Training but now providing the training parameter file generated from your training
The parameters.tar.gz archive has a folder containing the following files is required for predicting genes in a new genome with pre-trained parameters that as observed have the probabilities for the feature prediction in your sample
- species/species_parameters.cfg
- species/species_metapars.cfg
- species/species_metapars.utr.cfg
- species/species_exon_probs.pbl.withoutCRF
- species/species_exon_probs.pbl
- species/species_weightmatrix.txt
- species/species_intron_probs.pbl
- species/species_intron_probs.pbl.withoutCRF
- species/species_igenic_probs.pbl
- species/species_igenic_probs.pbl.withoutCRF
- Prokaryotic Expected Features:
- Gene, exon, CDS, short UTR length
- Eukaryotic expected Feature:
- Gene, more than one exon per gene (single exon are also expected), intron, CDS (including splice variants), longer and more complex UTR lengths
BRAKER mainly features semi-unsupervised, extrinsic evidence data (RNA-Seq and/or protein spliced alignment information) supported by the training of GeneMark-ES/ET/EP/ETP and subsequent training of AUGUSTUS with the integration of extrinsic evidence in the final gene prediction step. It automates all the AUGUSTUS prediction process.
Basic usage using protein data or/and RNAseq data:
braker.pl --genome=genome.fa --prot_seq=orthodb.fa --bam=/path/to/SRA_ID1.bam
Nanopore direct RNA sequencing (DRS) involves the continuous reading of native RNA strands, providing a valuable tool for validating ab initio annotations. This technique is particularly useful for obtaining information about gene boundaries, including untranslated regions (UTRs), and is adept at detecting alternative splicing events.

Since the results from this assay would give you full-length transcripts, there is no need to perform a prediction or transcriptome assembly. Usually, you run minimap2 against your assembled genome and curate the annotation using tools such as Webapollo2.
To run the alignment using minmap2 and DRS reads against a reference genome:
./minimap2 -ax splice -uf -k14 reference.fa reads.fq > aln.sam
Many files are generated in prokka:
Extension | Description |
---|---|
.gff | This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. |
.gbk | This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. |
.fna | Nucleotide FASTA file of the input contig sequences. |
.faa | Protein FASTA file of the translated CDS sequences. |
.ffn | Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) |
.sqn | An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. |
.fsa | Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. |
.tbl | Feature Table file, used by "tbl2asn" to create the .sqn file. |
.err | Unacceptable annotations - the NCBI discrepancy report. |
.log | Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. |
.txt | Statistics relating to the annotated features found. |
.tsv | Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product |
The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.
Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'
1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.
2. source - name of the program that generated this feature, or the data source (database or project name)
3. feature - feature type name, e.g. gene, mRNA, exon, CDS, ncRNA, UTR, etc.
4. start - Start position of the feature, with sequence numbering starting at 1.
5. end - End position of the feature, with sequence numbering starting at 1.
6. score - A floating point value. (not important here)
7. strand - defined as + (forward) or - (reverse).
8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.
9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.
The Gene Ontology (GO) serves as a framework and set of concepts for delineating the functions of gene products across various organisms. Tailored for supporting the computational representation of biological systems, GO annotations establish associations between specific gene products and GO concepts. Together, these annotations create statements relevant to the function of the respective genes.
Practically, an ontology serves as a representation of our knowledge about a subject. In the context of biology, 'ontologies' encompass representations of detectable or directly observable entities and the relationships between them. The lack of a universal standard terminology in biology and related fields results in varied term usage specific to species, research areas, or even individual research groups, creating challenges in communication and data sharing.
To address this, the Gene Ontology project offers an ontology comprising defined terms that represent properties of gene products. This ontology spans three domains:
1. Cellular component: Encompassing the parts of a cell or its extracellular environment.
2. Molecular function: Describing the elemental activities of a gene product at the molecular level, such as binding or catalysis.
3. Biological process: Defining operations or sets of molecular events with a clear beginning and end, relevant to the functioning of integrated living units, including cells, tissues, organs, and organisms.
Blast2GO is one of the best tools available to look for gene functions. (Besides being paid as part of OMICs Box it has a free trial and for some cases free basic versiohn for non-profit organization accounts)
-
Simple usage with all analysis needed to add GO terms into your genome
-
Basic worflow:
Using InterproScan ou can also visualize for other features such as domain location transmembrane domains, and signal peptides in your proteins.
You just need to submit your protein sequence:
Exercise: copy the first sequence on your prokka faa result and paste here. What do you see?
Integrative Genomics Viewer (IGV)
-
To visualize your genome and annotation:
- Upload your genome:
- Upload your gff track:
In molecular biology and genetics, a pan-genome refers to the complete collection of genes found across all strains within a clade. Broadly speaking, it represents the integration of all genomes within a clade. The pan-genome can be dissected into a 'core pangenome,' comprising genes present in all individuals, and an 'accessory genome', which encompasses genes found in only a subset of the strains, including strain-specific genes.
Whole genome sequencing (WGS) is increasingly employed in epidemiological investigations of pathogens. While SNP variant calling is presently considered the most suitable method, challenges arise from the selection of a representative reference genome and the results being dependent on the isolate, limiting standardization and impacting resolution in an uncertain manner.
In the quest for a pangenome analysis, the core genome emerges as an appealing alternative. This is because the core genome comprises loci present in all isolates (>95% of the isolates), facilitating the comparison and identification of sequence differences in protein-coding genes among the analyzed isolates. This approach distinguishes itself from whole-genome sequencing by incorporating a selection of accessory loci (not present in all isolates) and intergenic regions.
The accessory genome is frequently divided into the shell genome, which comprises genes shared by more than 15% of the strains but less than 95% of the strains, and the cloud genome, encompassing genes shared by less than 15% of the strains but present in more than one strain. These genes have the potential to confer distinctive characteristics to a strain and/or offer a niche-specific advantage to the host strains. Acquisition of these genes may occur through horizontal gene transfer (HGT), and they are sustained by a subset of all the strains within a species.
There are several options of instances available:
-
Roary (Besides not having updates anymore, it is still very useful for the community and will be used for this session)
-
Panaroo (Considered the substitute for roary)
-
PIRATE (another good alternative)
File name | Description | Location in the cluster |
---|---|---|
*.gff | all pre made annotations using prokka for our analysis | /projects/k2i/session4/Files/roary_files/GFF/ |
my_genome_ST.tsv | Strain type from the assembled genome from sessions 1 and 3 | /projects/k2i/session4/Files/roary_files |
core_gene_alignment.aln | Core alignment pre generated from roary for phylogeny | /projects/k2i/session4/Files/roary_files |
accessory_binary_genes.fa.newick | Accessory phylogenetic tree generated by Roary | /projects/k2i/session4/Files/roary_files |
mkdir roary_analysis
cp /path/session4/roary_files/* roary_analysis
cd roary_analysis
Basic usage
source /projects/k2i/roary/bin/activate
roary -p 5 -n -e -v *.gff
Flag explanation
-p 5 number of threads (5 in this example)
-n fast core gene alignment with MAFFT
-e create a multiFASTA alignment of core genes using PRANK
-v verbose output to STDOUT
(*)gff All annotated files from your samples
File name | Description |
---|---|
core_gene_alignment.aln | Core gene aligment |
accessory_binary_genes.fa.newick | Accessory gene tree |
gene_presence_absence.csv | Table with information of gene presence and absense |
Basic usage:
source /projects/k2i/radmicrobes-s4/bin/activate
raxml-ng --msa core_gene_alignment.aln --model GTR+G --all --bs-trees 10
Flag explanation
--msa file.aln Alignment file generated by Roary
--model GTR+G model of evolution for DNA and protein alignments for tree reconstruction (best model can be predicted using tools such as Modeltest-NG)
--all all-in-one (ML search + bootstrapping)
--bs-trees 10 number of bootstraps replicates (10 replicates is not recommended since is too low, but we are using here to get results faster)
- Step 1: Upload your raxml tree here
- Create a dataset:
- Add information about your samples (my_genomes_STs.tsv):
-
Input file
- Contains three columns (see example below):
Sample name | Color code | Information (ST) |
---|---|---|
C1016_genome | #0A11C8 | 307 |
C1069_genome | #048B52 | 258 |
C1122_genome | #0A11C8 | 307 |
C1151_genome | #0A11C8 | 307 |
- Core gene tree
- What are the differences that you see between core and Accessory trees? What information you can extract from it?