Session 4 - Functional Annotation and Pan Genomes

by Rodrigo de Paula Baptista, PhD

Session Summary

Genome Annotation
- Prokaryotic annotation
  - Running Prokka
- Eukaryotic Annotation
  - Training your prediction dataset
  - Running AUGUSTUS
  - Glimpse on BRAKER
- Direct RNA sequencing for annotation
- Understanding the output:
  - prokka output files
  - the gff file
- Gene Ontology and InterproScan (adding functions)
- Visualization
Pan-Genomes
- Core Genomes
- Accessory Genomes
- Running Roary the Pan-Genome pipeline
  - preparing dataset
  - Running Roary
  - Output files
  - Running Raxml to make a phylogenetic tree
- Visualization using ITOL

All Files from this session will be available during the bootcamp at "/projects/k2i/session4/Files/"

Genome Annotation

Genome annotation involves the identification of functional elements along the sequence of a genome, assigning meaning to it. This process is essential because DNA sequencing often yields sequences of unknown function. Over the past three decades, genome annotation has transformed from computationally annotating long protein-coding genes in single genomes (one per species) and experimentally annotating short regulatory elements on a limited number of them, to the population-wide annotation of individual nucleotides across thousands of genomes (many per species). This enhanced resolution and inclusivity in genome annotations, spanning from genotypes to phenotypes, are providing precise insights into the biology of species, populations, and individuals. A key distinction between prokaryotic and eukaryotic genes lies in their structural composition. Prokaryotes possess only exons, whereas eukaryotes have both exons and introns.

(Castellana & Bafna, 2010)

The focus here is to demonstrate the methodology for gene prediction (annotation) in both prokaryotic and eukaryotic genomes.

PROKARYOTIC Gene Annotation

There are several options of instances available:

PROKKA (Which will be used in this boot camp)
RAST (Webserver application)
PGAP (Usually the best option, but runs slow)

RUNNING PROKKA

Prokka is a software tool designed for the rapid annotation of bacterial, archaeal, and viral genomes, generating output files that adhere to standard specifications.

Prokka documentation

Files to be used in this hands-on

File name	Description	Location in the cluster
ARLG-4673_assembly.fasta	Assembled genome generated on session 1	/projects/k2i/session4/Files/Prokka
my_genome.gff	Pre-generated gff output	/projects/k2i/session4/Files/Prokka

Basic Usage:

###Prepare to run in interactive node
srun --pty --export=ALL --ntasks=1 --reservation=workshop --cpus-per-task=8 --mem=15GB --time=04:00:00 /bin/bash
###Call prokka in bootcamp server
source /projects/k2i/radmicrobes-s4/bin/activate
##Run prokka
prokka --outdir prokka_test --prefix my_genome --rfam genome.fasta

Flag explanation

--outdir [name] Output folder (in this case it will be a folder created as prokka_test in your current location)

--prefix [name] Filename output prefix (in this case it will name your files with the my_genome prefix, e.g. my_genome.gff)

--rfam Enable searching also for ncRNAs with Infernal+Rfam and not just protein-coding genes

For more options type:

prokka -h

prokka options

General:
  --help            This help
  --version         Print version and exit
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
  --outdir [X]      Output folder [auto] (default '')
  --force           Force overwriting existing output folder (default OFF)
  --prefix [X]      Filename output prefix [auto] (default '')
  --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
  --locustag [X]    Locus tag prefix (default 'PROKKA')
  --increment [N]   Locus tag counter increment (default '1')
  --gffver [N]      GFF version (default '3')
  --compliant       Force Genbank/ENA/DDJB compliance: --genes --mincontiglen 200 --centre XXX (default OFF)
  --centre [X]      Sequencing centre ID. (default '')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --prodigaltf [X]  Prodigal training file (default '')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    Fasta file of trusted proteins to first annotate from (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
Computation:
  --fast            Fast mode - skip CDS /product searching (default OFF)
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

Eukaryotic Annotation using AUGUSTUS

There are several options of instances available:

WEBAUGUSTUS (Web application used in this boot-camp)
Original AUGUSTUS (Can be installed on your server or local machine)
COMPANION (automated pipeline for eukaryotic pathogens)

RUNNING WEBAUGUSTUS

Files to be used in this hands-on

File name	Description	Location in the cluster
Reference.fasta	Close related reference genome used for training	/projects/k2i/session4/Files/WEBAUGUSTUS
Euk_genome.fasta	Genome to be annotated	/projects/k2i/session4/Files/WEBAUGUSTUS
Protein_ref.fasta	Reference protein evidence for training	/projects/k2i/session4/Files/WEBAUGUSTUS
Euk_genome_augustus.gff	Pre-generated gff output	/projects/k2i/session4/Files/WEBAUGUSTUS

Training dataset for prediction

Submission Form

For the training submission you need:

Your Reference Genome in Fasta format: Reference.fasta
Your Reference Genome Protein file in fasta format: Protein_ref.fasta

Running the prediction using your trained dataset

Very Similar to submitting the Training but now providing the training parameter file generated from your training

The parameters.tar.gz archive has a folder containing the following files is required for predicting genes in a new genome with pre-trained parameters that as observed have the probabilities for the feature prediction in your sample

species/species_parameters.cfg
species/species_metapars.cfg
species/species_metapars.utr.cfg
species/species_exon_probs.pbl.withoutCRF
species/species_exon_probs.pbl
species/species_weightmatrix.txt
species/species_intron_probs.pbl
species/species_intron_probs.pbl.withoutCRF
species/species_igenic_probs.pbl
species/species_igenic_probs.pbl.withoutCRF

Expected differences between Prokaryotic and Eukaryotic GFF Features

Prokaryotic Expected Features:
- Gene, exon, CDS, short UTR length
Eukaryotic expected Feature:
- Gene, more than one exon per gene (single exon are also expected), intron, CDS (including splice variants), longer and more complex UTR lengths

Glimpse on BRAKER

BRAKER mainly features semi-unsupervised, extrinsic evidence data (RNA-Seq and/or protein spliced alignment information) supported by the training of GeneMark-ES/ET/EP/ETP and subsequent training of AUGUSTUS with the integration of extrinsic evidence in the final gene prediction step. It automates all the AUGUSTUS prediction process.

BRAKER Documentation

Basic usage using protein data or/and RNAseq data:

braker.pl --genome=genome.fa --prot_seq=orthodb.fa --bam=/path/to/SRA_ID1.bam

Direct RNA sequencing for annotation using Oxford Nanopore sequencing (DRS)

Nanopore direct RNA sequencing (DRS) involves the continuous reading of native RNA strands, providing a valuable tool for validating ab initio annotations. This technique is particularly useful for obtaining information about gene boundaries, including untranslated regions (UTRs), and is adept at detecting alternative splicing events.

(Parker et al., 2020)

Since the results from this assay would give you full-length transcripts, there is no need to perform a prediction or transcriptome assembly. Usually, you run minimap2 against your assembled genome and curate the annotation using tools such as Webapollo2.

To run the alignment using minmap2 and DRS reads against a reference genome:

./minimap2 -ax splice -uf -k14 reference.fa reads.fq > aln.sam

Understanding the Annotation output:

Prokka output files

Many files are generated in prokka:

Output details

Extension	Description
.gff	This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
.gbk	This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
.fna	Nucleotide FASTA file of the input contig sequences.
.faa	Protein FASTA file of the translated CDS sequences.
.ffn	Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA)
.sqn	An ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
.fsa	Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
.tbl	Feature Table file, used by "tbl2asn" to create the .sqn file.
.err	Unacceptable annotations - the NCBI discrepancy report.
.log	Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled.
.txt	Statistics relating to the annotated features found.
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

The GFF File

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix.

2. source - name of the program that generated this feature, or the data source (database or project name)

3. feature - feature type name, e.g. gene, mRNA, exon, CDS, ncRNA, UTR, etc.

4. start - Start position of the feature, with sequence numbering starting at 1.

5. end - End position of the feature, with sequence numbering starting at 1.

6. score - A floating point value. (not important here)

7. strand - defined as + (forward) or - (reverse).

8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.

9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

Lets do some exercises!

###How does your GFF file looks like?
less my_genome.gff

###How many features are in your gff file?
cat my_genome.gff | cut -s -f 3| sort| uniq -c

###Do we have a "Carbapenem-hydrolyzing beta-lactamase KPC" gene annotated in this genome?
more my_genome.gff| grep "KPC"

Gene Ontology and InterproScan (adding functions)

The Gene Ontology (GO) serves as a framework and set of concepts for delineating the functions of gene products across various organisms. Tailored for supporting the computational representation of biological systems, GO annotations establish associations between specific gene products and GO concepts. Together, these annotations create statements relevant to the function of the respective genes.

Practically, an ontology serves as a representation of our knowledge about a subject. In the context of biology, 'ontologies' encompass representations of detectable or directly observable entities and the relationships between them. The lack of a universal standard terminology in biology and related fields results in varied term usage specific to species, research areas, or even individual research groups, creating challenges in communication and data sharing.

To address this, the Gene Ontology project offers an ontology comprising defined terms that represent properties of gene products. This ontology spans three domains:

1. Cellular component: Encompassing the parts of a cell or its extracellular environment.

2. Molecular function: Describing the elemental activities of a gene product at the molecular level, such as binding or catalysis.

3. Biological process: Defining operations or sets of molecular events with a clear beginning and end, relevant to the functioning of integrated living units, including cells, tissues, organs, and organisms.

Blast2GO

Blast2GO is one of the best tools available to look for gene functions. (Besides being paid as part of OMICs Box it has a free trial and for some cases free basic versiohn for non-profit organization accounts)

Simple usage with all analysis needed to add GO terms into your genome
Basic worflow:

Interproscan

Using InterproScan ou can also visualize for other features such as domain location transmembrane domains, and signal peptides in your proteins.

You just need to submit your protein sequence:

Exercise: copy the first sequence on your prokka faa result and paste here. What do you see?

Visualization

Integrative Genomics Viewer (IGV)

To visualize your genome and annotation:
- Upload your genome:

Upload your gff track:

Pan-Genomes

In molecular biology and genetics, a pan-genome refers to the complete collection of genes found across all strains within a clade. Broadly speaking, it represents the integration of all genomes within a clade. The pan-genome can be dissected into a 'core pangenome,' comprising genes present in all individuals, and an 'accessory genome', which encompasses genes found in only a subset of the strains, including strain-specific genes.

Core Genome

Whole genome sequencing (WGS) is increasingly employed in epidemiological investigations of pathogens. While SNP variant calling is presently considered the most suitable method, challenges arise from the selection of a representative reference genome and the results being dependent on the isolate, limiting standardization and impacting resolution in an uncertain manner.

In the quest for a pangenome analysis, the core genome emerges as an appealing alternative. This is because the core genome comprises loci present in all isolates (>95% of the isolates), facilitating the comparison and identification of sequence differences in protein-coding genes among the analyzed isolates. This approach distinguishes itself from whole-genome sequencing by incorporating a selection of accessory loci (not present in all isolates) and intergenic regions.

Accessory Genome

The accessory genome is frequently divided into the shell genome, which comprises genes shared by more than 15% of the strains but less than 95% of the strains, and the cloud genome, encompassing genes shared by less than 15% of the strains but present in more than one strain. These genes have the potential to confer distinctive characteristics to a strain and/or offer a niche-specific advantage to the host strains. Acquisition of these genes may occur through horizontal gene transfer (HGT), and they are sustained by a subset of all the strains within a species.

Pangenome tools

There are several options of instances available:

Roary (Besides not having updates anymore, it is still very useful for the community and will be used for this session)
Panaroo (Considered the substitute for roary)
PIRATE (another good alternative)

RUNNING ROARY

Roary Documentation

Preparing dataset

Files to be used in this hands-on

File name	Description	Location in the cluster
*.gff	all pre made annotations using prokka for our analysis	/projects/k2i/session4/Files/roary_files/GFF/
my_genome_ST.tsv	Strain type from the assembled genome from sessions 1 and 3	/projects/k2i/session4/Files/roary_files
core_gene_alignment.aln	Core alignment pre generated from roary for phylogeny	/projects/k2i/session4/Files/roary_files
accessory_binary_genes.fa.newick	Accessory phylogenetic tree generated by Roary	/projects/k2i/session4/Files/roary_files

mkdir roary_analysis
cp /path/session4/roary_files/* roary_analysis
cd roary_analysis

Running Roary

Basic usage

source /projects/k2i/roary/bin/activate
roary -p 5 -n -e -v *.gff

Flag explanation

-p 5 number of threads (5 in this example)

-n fast core gene alignment with MAFFT

-e create a multiFASTA alignment of core genes using PRANK

-v verbose output to STDOUT

(*)gff All annotated files from your samples

Output files

File name	Description
core_gene_alignment.aln	Core gene aligment
accessory_binary_genes.fa.newick	Accessory gene tree
gene_presence_absence.csv	Table with information of gene presence and absense

Running Raxml to make a phylogenetic tree

Basic usage:

source /projects/k2i/radmicrobes-s4/bin/activate
raxml-ng --msa core_gene_alignment.aln --model GTR+G --all --bs-trees 10

Flag explanation

--msa file.aln Alignment file generated by Roary

--model GTR+G model of evolution for DNA and protein alignments for tree reconstruction (best model can be predicted using tools such as Modeltest-NG)

--all all-in-one (ML search + bootstrapping)

--bs-trees 10 number of bootstraps replicates (10 replicates is not recommended since is too low, but we are using here to get results faster)

Visualization using ITOL

ITOL Website

Visualizing your core tree

Step 1: Upload your raxml tree here

Step 2: In the control panel select Advanced and then click on midpoint root

Adding Information (Sequencing types)

Create a dataset:

Add information about your samples (my_genomes_STs.tsv):

Input file
- Contains three columns (see example below):

Sample name	Color code	Information (ST)
C1016_genome	#0A11C8	307
C1069_genome	#048B52	258
C1122_genome	#0A11C8	307
C1151_genome	#0A11C8	307

Core gene tree

How about the Accessory?

Discussion

What are the differences that you see between core and Accessory trees? What information you can extract from it?

Files

README.md

Latest commit

History

README.md

File metadata and controls

Session 4 - Functional Annotation and Pan Genomes

Session Summary

All Files from this session will be available during the bootcamp at "/projects/k2i/session4/Files/"

Genome Annotation

PROKARYOTIC Gene Annotation

RUNNING PROKKA

Files to be used in this hands-on

Eukaryotic Annotation using AUGUSTUS

RUNNING WEBAUGUSTUS

Files to be used in this hands-on

Training dataset for prediction

Submission Form

Running the prediction using your trained dataset

Expected differences between Prokaryotic and Eukaryotic GFF Features

Glimpse on BRAKER

Direct RNA sequencing for annotation using Oxford Nanopore sequencing (DRS)

Understanding the Annotation output:

Prokka output files

Output details

The GFF File

Lets do some exercises!

Gene Ontology and InterproScan (adding functions)

Blast2GO

Interproscan

Visualization

Pan-Genomes

Core Genome

Accessory Genome

Pangenome tools

RUNNING ROARY

Roary Documentation

Preparing dataset

Files to be used in this hands-on

Running Roary

Output files

Running Raxml to make a phylogenetic tree

Visualization using ITOL

ITOL Website

Visualizing your core tree

Adding Information (Sequencing types)

How about the Accessory?

Discussion