Questions Regarding Analysis Tools

Questions Regarding Genome Analysis Tools

Reads Quality Control (FastQC) To think about:

How many reads have been discarded after trimming?

Only 421 bp were trimmed with Trimmomatic which reaffirms that the initial reads were pretty good.

How can this affect your future analyses and results?

Trimming causes the coverage of the sequenced region to to go down, but the results are of better quality. This means that there is less of the sequenced region available to analyze, but the results are more trustworthy.

How is the quality of your data after trimming?

To be honest, the results were great before I trimmed. However, to learn Trimmomatic better I set Trimmomatic to trim where the Q score dropped below 27 instead of the very low default of 15.

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

These are all options available on Trimmomatic. SLIDINGWINDOW trims reading when it finds that the average quality falls below the set threshold of the quality score within the defined length of base pairs. LEADING looks at low quality base pairs and removed them from the leading end of the sequence. TRAILING does the same as LEADING but for the trailing end of the sequence. The leading and trailing ends often have the lowest quality scored so it makes sense that the two ends are looked at with more scrutiny than the middle.

DNA assembly(Canu) To think about:

What intermediate steps generate informative output about the assembly?

Canu has its own trimming and error correction features and shows what those are as it runs. Canu also shows early on in the slurm file the length of the number of mers it is processing, how many of those are unique, and other basic statistics about the mers being processed by Canu.

How many contigs do you expect? How many do you obtain?

Paper I looks at data that is trimmed and untrimmed, while I only look at the trimmed number of contigs as I trimmed the sequences to be of higher quality before assembling the genome. The paper got 15 contigs before going back and trimming the original sequences. Then, unlisted settings, paper I trimmed the sequences before assembling and aligning and got 10 contigs. I got 8, but I also set high standards in Trimmomatic’s which could have been higher than the in the original paper.

What is the difference between a ‘contig’ and a ‘unitig’?

A contig is the resulting continuous read that is build from a set of reads with no gaps. Contigs are built from at least one unitigs which is a high confidence contig. This means that there is little to no variance within the overlapping sequences that make up the unitig.

What is the difference between a ‘contig’ and a ‘scaffold’?

A scaffold is made up of contigs, but are made longer by including gaps which are shown by the letter ‘N’.

What are the kmers? What kmer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big kmer?

Kmer refers to a string of nucleotides of length K that are part of a larger string. For example, a 17-mer would be a string of 17 nucleotides. As for what kmer should be used? That is a complicated question. Having longer kmers makes it such that there are more likely to be mistakes in the kmers so shorter ones are more accurate. However, shorter kmers are more likely to line up with incorrect areas of the genome just by chance.

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

When the Canu assembler does the correction step replaces “noisy” parts of sequence reads with sequence reads that Canu has built from overlapping reads.

Can you see any other letter apart from AGTC in your assembly? If so, what are those?

I do not see any other letter apart from AGTC in my assembly.

Assembly evaluation (QUAST)To think about:

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...etc)

N50 is near to being the mean or median contig length in an assembly, with the difference being that N50 gives greater weight to contigs of greater length. The L50 is the number of contigs that are of the N50 length. The N90 statistic refers to the length of contig that defines the bottom of the top 90% contigs in length, meaning that 10% of the contigs are shorter than the N90 length.

I tend to look at completeness, number of contigs, total size, GC%, and the number of mismatches per 100kb since one statistic alone is not enough to indicate good quality. However, one statistic alone can prove bad quality which is why looking at just one stat is not enough.

How does your assembly compare with the reference assembly? What can have caused the differences?

This is hard to say as the original author’s assembly is not public and the reference genome for E. faecium was used in its place so there were some slight differences. But with that said, some difference could have been caused simply by chance or by having different settings in the different tools used.

Why do you think your assembly is better/worse than the public one?

The authors’ assembly was not made public so the reference genome for the species was used and therefore it is harder to compare the two.

When running metaQuast for a metagenome, it may happen that very few contigs map back to the reference genomes. Is this expected? Does that mean your assembly is bad? Why?

This is fine since metagenomics sequences everything in the provided sample. To expect everything to map back to a reference genome would ignore the fact that there might not be a reference genome out there for everything you are sequencing.

It does not mean that your assembly is bad for the reasons stated above.

Annotation (Prokka) To think about:

What types of features are detected by the software? Which ones are more reliable a priori?

Prokka detects both structural and functional features of a genome. That means that it detects both where protein coding sequences are and what those proteins do. The tools used for detecting what the proteins do, functional annotation, runs sequences through BLAST and also uses HMMER3 as a "trained" method to find the supposed function a protein encoded in a coding sequence. However, one could also say that the strength of the functional annotation backs up the results of the structural annotation. But I would overall say that functional annotation is stronger as the hypothetical proteins would suggest that the there is more potential for those coding sequences to have been mislabeled as such.

How many features of each kind are detected in your contigs? Do you detect the same number of features than the authors? How do they differ?

The author of paper I actually does not give this intermediate result, but I found that within my 8 contigs there are:
bases: 3136314 tmRNA: 1 rRNA: 18 tRNA: 70 CDS: 3044

Why is more difficult to do the functional annotation in eukaryotic genomes?

When dealing with eukaryotic genomes, you have to deal with exons and intron. This is not the case for prokaryotes as they are more efficient.

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

1292 proteins were labeled as hypothetical proteins because they were not synonymous enough to known protein sequences. This could be due to error on the side of this sequence or could be due to the hypothetical proteins in question not being studied thoroughly.

How can you evaluate the quality of the obtained functional annotation?

There are several ways to evaluate the quality of an obtained functional annotation. This is one use of synteny comparison as the genome I sequenced should have most of the same genes as the reference genome used.

How comparable are the results obtained from two different structural annotation softwares?

This really just comes down to the algorithm used for the annotation. For example, a software that just looks for potential start and stop codons will not perform as well as a software that is hmm trained to know what coding regions should look like for your results and also takes into consideration stereochemistry of what is surrounding the area that is more directly being analyzed.

Mapping(BWA) Questions To think about:

What percentage of your reads map back to your contigs? Why do you think that is?

95.26% of the PacBio assembly was covered by Illumina reads when mapped. I think this is not 100% because I trimmed a decent bit of the Illumina sequences off due to quality and also because there could be sequences that were sequenced incorrectly from either Illumina or PacBio such that the sequences did not align by the standards set in BWAW-MEM.

What potential issues can cause mRNA reads not to map properly to gene in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?

A major difference between mapping RNA for prokaryotic and eukaryotic projects are that eukaryotic projects have to deal with exons and introns. This means that some bits of the RNA that are spliced before the RNA is read to be translated into amino acids which requires mapping tools that are aware of this and are able to search for the exons- or the bits of RNA that are kept and are processed in translation. In this project for paper I, BWA-MEM mapper was able to be used for both RNA and DNA mapping since exons and introns were not a consideration which was not the case for eukaryotic projects. This should yield a difference between prokaryotic and eukaryotic projects two-fold as there are different considerations needed for the two as mentioned above in addition to eukaryotic projects needing more specific tools.

What percentage of reads map to genes?

out of 3044 potential CDS, 1292 did not match up to known proteins. That means that 42.44% did not match up, which is not great.

How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?

1292 potential CDS did not match up to known genes. This means that they might be genes of unknown function or might not have been accurately described as being part of a coding region. The RNA could also have been mapped incorrectly to the DNA, making it falsely appear to have regions of coding sequences due to where RNA was incorrectly aligned to the DNA.

Do you see big differences between replicates?

At this stage there is not a huge difference between replicates, however there is a noticeable difference between replicate read counts as can be seen with DEseq2 PCA analysis.

Post-mapping analyses (SAMtools) Questions To think about:

What is the structure of a SAM file, and how does it relate to a BAM file?

SAM files are Sequence Alignment Map files and BAM files are Binary Alignment MAP files meaning that BAM is just a binary version of SAM so that is is more easily stored as it takes of significantly less memory. Within a SAM file, there is a header followed by 11 fields that will be in every file with the option of adding other fields. These required fields in order from left to right are:

QNAME: The name of the template
FLAG: The bitwise flag
RNAME: Name of the reference sequence
POS: The -1 based mapping position taken from the leftmost side
MAPQ: Quality of the mapping
CIGAR: Concise Idiosyncratic Gapped Alignment Report
RNEXT: Next read position name
PNEXT: Next read position
TLEN: Observed length of the template
SEQ: Sequence
QUAL: Quality score

Read counting (HTSeq) Questions To think about:

What is the distribution of the counts per gene? Are most genes expressed? How many counts would indicate that a gene is expressed?

This is a somewhat tricky question to answer as this data is normalized by DEseq2 as the first step of analysis so in this case the counts are in a bell curve. Some genes appear to be more expressed than other simply because the read depth of the RNA was deeper at certain points due to the sequencing technique, making normalization an essential first step. Not all genes are expressed, but this is hard to tell. The cut-off that is commonly used is 10 counts, but this is arbitrary. Genes work differently and have different requirements for expressions. Sometimes a certain threshold needs to be reached before expression can be noticed.

In the metagenomics project, the data doesn’t offer enough statistical power for a differential expression analysis. Why not? What can you still tell from the data only from the read counts?

The data doesn't offer enough statistical power for differential expression analysis because there are not enough reads per organism do the comparison. The read counts can still help give an idea of how many organisms are being sequenced as the read counts will likely be grouped by organism in a PCA analysis.

Expression Analysis(DESeq2) Questions To think about:

If your expression results differ from those in the published article, why could it be?

There are so many reasons for why this could be, but in a nutshell results could differ due to different flags and settings used during steps that deal with RNA. An example would be if the author set a different base level for quality than used in this demonstration. The authors did not include this information in their publishing.

How do the different samples and replicates cluster together?

For paper I was the RNA from E. faecium that was grown in BHI was clustered together and the RNA from the _E. faecium that was grown in human serum was clustered together. This makes sense as BHI is an enriched medium while the human serum is more restrictive. Having more restrictive conditions means that there are less options for nutrient intake so that organisms grown in those conditions are more likely to express the same genes than those grown in an enriched environment. Those grown in an enriched environment do not have to undergo the same selective and environmental pressures to express the same set of genes.

What effect and implications has the p-value selection in the expression results?

The p-value shouldn't be used as much as the q-value, or adjusted p-value (see the next question). However the adjusted p-value plays an important role when looking at expression results as those deemed significant by it are the genes that are looked further into while the rest are usually ignored. This means that the adjusted p-value is responsible for giving accurate results results, false positives, and false negatives. The adjusting the alpha level determines what the cut-off level for adjusted p-values is for what is accepted as being significant.

What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?

A q-value is an adjusted p-value and both are better to use in different scenarios. A p-value in general represents a certain result happening just by chance. This is perfectly fine when only comparing a couple groups. This becomes more complicated as there are thousands of groups within groups being compared. When this many comparisons are done, more results are going to be deemed significant just by chance according to their p-value. A q-value takes into consideration the differences between groups and within groups and re-evaluates the p-values so that there are fewer false positives. Since the data used for differential expression analysis is comparing many groups within groups, the q-value is better in this instance.

Do you need any normalization step? What would you normalize against? Does DESeq do it?

Normalization is needed for RNA expression analysis, otherwise it is hard to determine if read counts are significant because they truly are being expressed deferentially or if some genes appear to be deferentially expressed simply because of the quality and read depth of a particular par of a sequence that has more to do with how RNA is sequenced. However, DEseq-2 does normalization automatically upon entering data into a data-frame and explicitly says NOT to try to normalize data before processing it in DEseq-2. In this instance for paper I, the reads of RNA taken from E. faecium grown in human serum were normalized against reads of RNA taken from E. faecium grown in BHI medium.

What would you do to increase the statistical power of your expression analysis?

There are several ways to improve the statistical power of expression analysis. However not all of these are attainable once the data has been collected, such as increasing the sample size. The most straight forward way to increase this expression analysis for the data already collected is to change the alpha level. This should be done cautiously though as a too restrictive alpha value makes it more likely that a false null hypothesis will fail to be rejected, where as an alpha value that is not restrictive enough makes it more likely to reject a true null hypothesis.

Questions Regarding Analysis Tools

Questions Regarding Genome Analysis Tools

Reads Quality Control (FastQC) To think about:

How many reads have been discarded after trimming?

How can this affect your future analyses and results?

How is the quality of your data after trimming?

What do the LEADING, TRAILING and SLIDINGWINDOW options do?

DNA assembly(Canu) To think about:

What intermediate steps generate informative output about the assembly?

How many contigs do you expect? How many do you obtain?

What is the difference between a ‘contig’ and a ‘unitig’?

What is the difference between a ‘contig’ and a ‘scaffold’?

What are the kmers? What kmer(s) should you use? What are the problems and benefits of choosing a small kmer? And a big kmer?

Some assemblers can include a read-correction step before doing the assembly. What is this step doing?

Can you see any other letter apart from AGTC in your assembly? If so, what are those?

Assembly evaluation (QUAST)To think about:

What do measures like N50, N90, etc. mean? How can they help you evaluate the quality of your assembly? Which measure is the best to summarize the quality of the assembly (N50, number of ORFs, completeness, total size, longest contig ...etc)

How does your assembly compare with the reference assembly? What can have caused the differences?

Why do you think your assembly is better/worse than the public one?

When running metaQuast for a metagenome, it may happen that very few contigs map back to the reference genomes. Is this expected? Does that mean your assembly is bad? Why?

Annotation (Prokka) To think about:

What types of features are detected by the software? Which ones are more reliable a priori?

How many features of each kind are detected in your contigs? Do you detect the same number of features than the authors? How do they differ?

Why is more difficult to do the functional annotation in eukaryotic genomes?

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?

How can you evaluate the quality of the obtained functional annotation?

How comparable are the results obtained from two different structural annotation softwares?

Mapping(BWA) Questions To think about:

What percentage of your reads map back to your contigs? Why do you think that is?

What potential issues can cause mRNA reads not to map properly to gene in the chromosome? Do you expect this to differ between prokaryotic and eukaryotic projects?

What percentage of reads map to genes?

How many reads do not map to genes? What does that mean? How does that relate to the type of sequencing data you are mapping?

Do you see big differences between replicates?

Post-mapping analyses (SAMtools) Questions To think about:

What is the structure of a SAM file, and how does it relate to a BAM file?

Read counting (HTSeq) Questions To think about:

What is the distribution of the counts per gene? Are most genes expressed? How many counts would indicate that a gene is expressed?

In the metagenomics project, ​the data doesn’t offer enough statistical power for a differential expression analysis. Why not? What can you still tell from the data only from the read counts?

Expression Analysis(DESeq2) Questions To think about:

If your expression results differ from those in the published article, why could it be?

How do the different samples and replicates cluster together?

What effect and implications has the p-value selection in the expression results?

What is the q-value and how does it differ from the p-value? Which one should you use to determine if the result is statistically significant?

Do you need any normalization step? What would you normalize against? Does DESeq do it?

What would you do to increase the statistical power of your expression analysis?

Clone this wiki locally

In the metagenomics project, the data doesn’t offer enough statistical power for a differential expression analysis. Why not? What can you still tell from the data only from the read counts?