-
Notifications
You must be signed in to change notification settings - Fork 0
Questions Regarding Analysis Tools
Only 421 bp were trimmed with Trimmomatic which reaffirms that the initial reads were pretty good.
Trimming causes the coverage of the sequenced region to to go down, but the results are of better quality. This means that there is less of the sequenced region available to analyze, but the results are more trustworthy.
To be honest, the results were great before I trimmed. However, to learn Trimmomatic better I set Trimmomatic to trim where the Q score dropped below 27 instead of the very low default of 15.
These are all options available on Trimmomatic. SLIDINGWINDOW trims reading when it finds that the average quality falls below the set threshold of the quality score within the defined length of base pairs. LEADING looks at low quality base pairs and removed them from the leading end of the sequence. TRAILING does the same as LEADING but for the trailing end of the sequence. The leading and trailing ends often have the lowest quality scored so it makes sense that the two ends are looked at with more scrutiny than the middle.
Canu has its own trimming and error correction features and shows what those are as it runs. Canu also shows early on in the slurm file the length of the number of mers it is processing, how many of those are unique, and other basic statistics about the mers being processed by Canu.
Paper I looks at data that is trimmed and untrimmed, while I only look at the trimmed number of contigs as I trimmed the sequences to be of higher quality before assembling the genome. The paper got 15 contigs before going back and trimming the original sequences. Then, unlisted settings, paper I trimmed the sequences before assembling and aligning and got 10 contigs. I got 8, but I also set high standards in Trimmomatic’s which could have been higher than the in the original paper.
A contig is the resulting continuous read that is build from a set of reads with no gaps. Contigs are built from at least one unitigs which is a high confidence contig. This means that there is little to no variance within the overlapping sequences that make up the unitig.
A scaffold is made up of contigs, but are made longer by including gaps which are shown by the letter ‘N’.
Kmer refers to a string of nucleotides of length K that are part of a larger string. For example, a 17-mer would be a string of 17 nucleotides. As for what kmer should be used? That is a complicated question. Having longer kmers makes it such that there are more likely to be mistakes in the kmers so shorter ones are more accurate. However, shorter kmers are more likely to line up with incorrect areas of the genome just by chance.
When the Canu assembler does the correction step replaces “noisy” parts of sequence reads with sequence reads that Canu has built from overlapping reads.
I do not see any other letter apart from AGTC in my assembly.
N50 is near to being the mean or median contig length in an assembly, with the difference being that N50 gives greater weight to contigs of greater length. The L50 is the number of contigs that are of the N50 length. The N90 statistic refers to the length of contig that defines the bottom of the top 90% contigs in length, meaning that 10% of the contigs are shorter than the N90 length.
I tend to look at completeness, number of contigs, total size, GC%, and the number of mismatches per 100kb since one statistic alone is not enough to indicate good quality. However, one statistic alone can prove bad quality which is why looking at just one stat is not enough.
This is hard to say as the original author’s assembly is not public and the reference genome for E. faecium was used in its place so there were some slight differences. But with that said, some difference could have been caused simply by chance or by having different settings in the different tools used.
The authors’ assembly was not made public so the reference genome for the species was used and therefore it is harder to compare the two.
This is fine since metagenomics sequences everything in the provided sample. To expect everything to map back to a reference genome would ignore the fact that there might not be a reference genome out there for everything you are sequencing.
It does not mean that your assembly is bad for the reasons stated above.
Prokka detects both structural and functional features of a genome. That means that it detects both where protein coding sequences are and what those proteins do. The tools used for detecting what the proteins do, functional annotation, runs sequences through BLAST and also uses HMMER3 as a "trained" method to find the supposed function a protein encoded in a coding sequence. However, one could also say that the strength of the functional annotation backs up the results of the structural annotation. But I would overall say that functional annotation is stronger as the hypothetical proteins would suggest that the there is more potential for those coding sequences to have been mislabeled as such.
The author of paper I actually does not give this intermediate result, but I found that within my 8 contigs there are:
bases: 3136314
tmRNA: 1
rRNA: 18
tRNA: 70
CDS: 3044
When dealing with eukaryotic genomes, you have to deal with exons and intron. This is not the case for prokaryotes as they are more efficient.
1292 proteins were labeled as hypothetical proteins because they were not synonymous enough to known protein sequences. This could be due to error on the side of this sequence or could be due to the hypothetical proteins in question not being studied thoroughly.
There are several ways to evaluate the quality of an obtained functional annotation. This is one use of synteny comparison as the genome I sequenced should have most of the same genes as the reference genome used.
This really just comes down to the algorithm used for the annotation. For example, a software that just looks for potential start and stop codons will not perform as well as a software that is hmm trained to know what coding regions should look like for your results and also takes into consideration stereochemistry of what is surrounding the area that is more directly being analyzed.
95.26% of the PacBio assembly was covered by Illumina reads when mapped. I think this is not 100% because I trimmed a decent bit of the Illumina sequences off due to quality and also because there could be sequences that were sequenced incorrectly from either Illumina or PacBio such that the sequences did not align by the standards set in BWAW-MEM.
A major difference between mapping RNA for prokaryotic and eukaryotic projects are that eukaryotic projects have to deal with exons and introns. This means that some bits of the RNA that are spliced before the RNA is read to be translated into amino acids which requires mapping tools that are aware of this and are able to search for the exons- or the bits of RNA that are kept and are processed in translation. In this project for paper I, BWA-MEM mapper was able to be used for both RNA and DNA mapping since exons and introns were not a consideration which was not the case for eukaryotic projects. This should yield a difference between prokaryotic and eukaryotic projects two-fold as there are different considerations needed for the two as mentioned above in addition to eukaryotic projects needing more specific tools.
out of 3044 potential CDS, 1292 did not match up to known proteins. That means that 42.44% did not match up, which is not great.
1292 potential CDS did not match up to known genes. This means that they might be genes of unknown function or might not have been accurately described as being part of a coding region. The RNA could also have been mapped incorrectly to the DNA, making it falsely appear to have regions of coding sequences due to where RNA was incorrectly aligned to the DNA.
At this stage there is not a huge difference between replicates, however there is a noticeable difference between replicate read counts as can be seen with DEseq2 PCA analysis.
SAM files are Sequence Alignment Map files and BAM files are Binary Alignment MAP files meaning that BAM is just a binary version of SAM so that is is more easily stored as it takes of significantly less memory. Within a SAM file, there is a header followed by 11 fields that will be in every file with the option of adding other fields. These required fields in order from left to right are:
- QNAME: The name of the template
- FLAG: The bitwise flag
- RNAME: Name of the reference sequence
- POS: The -1 based mapping position taken from the leftmost side
- MAPQ: Quality of the mapping
- CIGAR: Concise Idiosyncratic Gapped Alignment Report
- RNEXT: Next read position name
- PNEXT: Next read position
- TLEN: Observed length of the template
- SEQ: Sequence
- QUAL: Quality score
This is a somewhat tricky question to answer as this data is normalized by DEseq2 as the first step of analysis so in this case the counts are in a bell curve. Some genes appear to be more expressed than other simply because the read depth of the RNA was deeper at certain points due to the sequencing technique, making normalization an essential first step. Not all genes are expressed, but this is hard to tell. The cut-off that is commonly used is 10 counts, but this is arbitrary. Genes work differently and have different requirements for expressions. Sometimes a certain threshold needs to be reached before expression can be noticed.
The data doesn't offer enough statistical power for differential expression analysis because there are not enough reads per organism do the comparison. The read counts can still help give an idea of how many organisms are being sequenced as the read counts will likely be grouped by organism in a PCA analysis.
There are so many reasons for why this could be, but in a nutshell results could differ due to different flags and settings used during steps that deal with RNA. An example would be if the author set a different base level for quality than used in this demonstration. The authors did not include this information in their publishing.
For paper I was the RNA from E. faecium that was grown in BHI was clustered together and the RNA from the _E. faecium that was grown in human serum was clustered together. This makes sense as BHI is an enriched medium while the human serum is more restrictive. Having more restrictive conditions means that there are less options for nutrient intake so that organisms grown in those conditions are more likely to express the same genes than those grown in an enriched environment. Those grown in an enriched environment do not have to undergo the same selective and environmental pressures to express the same set of genes.
The p-value shouldn't be used as much as the q-value, or adjusted p-value (see the next question). However the adjusted p-value plays an important role when looking at expression results as those deemed significant by it are the genes that are looked further into while the rest are usually ignored. This means that the adjusted p-value is responsible for giving accurate results results, false positives, and false negatives. The adjusting the alpha level determines what the cut-off level for adjusted p-values is for what is accepted as being significant.
A q-value is an adjusted p-value and both are better to use in different scenarios. A p-value in general represents a certain result happening just by chance. This is perfectly fine when only comparing a couple groups. This becomes more complicated as there are thousands of groups within groups being compared. When this many comparisons are done, more results are going to be deemed significant just by chance according to their p-value. A q-value takes into consideration the differences between groups and within groups and re-evaluates the p-values so that there are fewer false positives. Since the data used for differential expression analysis is comparing many groups within groups, the q-value is better in this instance.
Normalization is needed for RNA expression analysis, otherwise it is hard to determine if read counts are significant because they truly are being expressed deferentially or if some genes appear to be deferentially expressed simply because of the quality and read depth of a particular par of a sequence that has more to do with how RNA is sequenced. However, DEseq-2 does normalization automatically upon entering data into a data-frame and explicitly says NOT to try to normalize data before processing it in DEseq-2. In this instance for paper I, the reads of RNA taken from E. faecium grown in human serum were normalized against reads of RNA taken from E. faecium grown in BHI medium.
There are several ways to improve the statistical power of expression analysis. However not all of these are attainable once the data has been collected, such as increasing the sample size. The most straight forward way to increase this expression analysis for the data already collected is to change the alpha level. This should be done cautiously though as a too restrictive alpha value makes it more likely that a false null hypothesis will fail to be rejected, where as an alpha value that is not restrictive enough makes it more likely to reject a true null hypothesis.