-
Notifications
You must be signed in to change notification settings - Fork 10
Home
While long-read sequencing technologies (like Oxford Nanopore) have gotten a lot better in recent years, long-read-only assemblies still suffer from some consensus sequence errors. Homopolymer-length errors are the most common type, e.g. AAAAAAAA
becoming AAAAAAA
. One can use short reads (like from an Illumina platform) to correct errors in a long-read assembly, a process known as short-read polishing. There are a number of short-read polishing tools, including HyPo, NextPolish, ntEdit, Pilon, POLCA and Racon.
However, errors in repeat sequences can be difficult to fix. Most of those short-read polishing tools rely on alignments generated from tools like BWA. When run with default settings, aligners put each read in a single best location (randomly chosen in the case of a tie). So if the assembly has an error in a repeat, reads may not align to it because they can get a better alignment in other instances of the repeat. For example, consider a genome with a two-copy exact repeat (I'll call them copy A and copy B), and the assembly of this genome has an error in copy A. When aligning short reads to the assembly, all reads which originated from the error-containing region of copy A will instead align to the corresponding region of copy B, because they can achieve a more accurate alignment there. This leaves no reads aligned over the error, and a short-read polishing tool will therefore have no information with which to fix the error.
Long-read assemblies with short-read polishing can be very accurate, and their non-repeat sequences may in fact be perfect. But due to the scenario described above, errors often remain in repeat sequences. This frustratingly keeps truly error-free genome assemblies out of reach.
Polypolish is a short-read polishing tool that differs from existing tools in an important way: it takes as input short-read alignments where each read is aligned to all possible locations. This means that errors in repeats will be covered by short-read alignments, and Polypolish can therefore fix those errors. For an illustrated walk-through of how it works, check out the Toy example page of this wiki.
In my tests, Polypolish outperformed other short-read polishing tools on long-read assemblies of bacterial genomes. A manuscript is in the works – stay tuned!
No polishing tool is perfect, Polypolish included. This means that you should make your long-read assembly as accurate as possible before doing any short-read polishing. Trycycler can deliver clean assemblies which are free from medium-to-large scale errors. Long-read polishing with Medaka can further improve accuracy, so this should be done before running Polypolish.
Since different short-read polishers use difference algorithms, I have found that using a combination of tools can deliver the most accurate assemblies. Aside from Polypolish, my favourite short-read polishers are HyPo, ntEdit, Pilon and POLCA – try using them in addition to Polypolish. In particular, I've seen very nice results by using a combination of Polypolish, HyPo and ntEdit.
Nothing about Polypolish is intrinsically specific to bacterial genomes – its approach should work on eukaryotes too. However, I've only ever used it on bacterial genomes, and repeat-rich eukaryote genomes might cause issues. Specifically, aligning short reads to all possible locations could result in a lot of alignments if the genome is big and highly repetitive. Try at your own risk!
Check out the Software requirements and Installation pages to get Polypolish up and running. Then the How to run Polypolish page will show you how to use it. Don't worry – it's pretty simple!