-
Notifications
You must be signed in to change notification settings - Fork 189
Genome build
We have included a suite of tools including genome size survey, genetic map and Hi-C heatmap to check for quality of genome build.
Tip
Download the test dataset here.
The raw sequencing data provides a way to estimate the size, ploidy, heterozygosity and repeat content of a genome, similar to GenomeScope. Let's say that you have a kmer count histogram (commonly generated by Jellyfish, or other kmer counter), in a file reads.histo
.
1 1281576854
2 89292133
3 21588481
4 9347716
5 5569400
6 4705214
With 1st column the frequency of kmer in the sequencing data, and 2nd column the abundance of kmer with a given frequency. It is easy to infer all the genome statistics and annotate directly on the kmer histogram.
python -m jcvi.assembly.kmer histogram reads.histo "*S. species* ‘Variety 1’" 21
This takes the kmer counts and the species name that goes in the tile. Finally the size K
when used to generate the kmer histogram. Behind the scenes, a negative binomial mixture model is applied to approximate the various genome statistics, including the ploidy of the genome.
You can then simply read various genome statistics from the plot, and that the genome is a tetraploid.
After genome assembly, we would often like to perform quality control. One of the QC is to compare to the genetic maps of the organism. Assume that you have the genetic map input matrix (MSTMap format), in file geneticmap.matrix
.
With first column indicating the position in the current genome assembly, in the format of chr1.12345
, and the following columns indicating the genotypes of each mapping individual.
Our genetic quality control map can then be visualized as a heatmap with one command:
python -m jcvi.assembly.geneticmap heatmap geneticmap.matrix
Entries in the heatmap corresponding to the linkage disequilibrium (chr4
and chr6
, suggesting a potential mis-assembly (or could be a rearrangement between the mapping parents).
© Haibao Tang, 2010-2024