Bioinformatics measure comparison framework with several implemented measure functions.
This README is currently out-of-date. I will update it as soon as everything is working again.
This software is under active development and you should expect it to be buggy. It might or might not work, and certainly is not guaranteed useable for any purpose whatsoever.
This software is for the paper to be written and will have a citation here.
The following are needed:
- C++ at least standard C++11. The code is tested with clang++ 4.0.1 and g++ 7.1.1.
- Boost development libaries. For my system, this is
boost_1_64-devel
. - Boost edit distance
This needs to be installed as a subdirectory that matches
INC
in theMakefile
. I use the default ofedit_distance
. - If you want to create a text version of this README, you need
pandoc
.
The Makefile
should work on recent enough systems. See comments in
the top of the file for choosing a compiler, etc. If it does not work,
fix it and sent a pull request.
The program will checkpoint after every row of distance matrix
calculation. You need to create the directory for the checkpointing
before running. The default is metrictest.checkpoint
.
These data files are not mine, but they are in the data/
directory
for experimentation and testing.
AF091148
A small (1408 sequences) FASTA file that works well for testing since it tends to be fast.1688_seqs_nophix
a set of sequences from the paper Open-Source Sequence Clustering Methods Improve the State Of the Art. To save you reading the paper, you can download the FASTQ file here. You have to convert from FASTQ to FASTA format; many conversion utilities exist. This dataset is claimed to come from the Bokulich et al. paper Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. I have done this for the version in thedata/
directory.
The definitive option description is in Options.h
. This summary
might be out-of-date.
--restart
restart from a checkpoint. The default is to not restart. If you are restarting, you must ensure that thecheckpointdir
is correct.checkpointdir
is the only command-line option that is valid when restarting.
All other command-line options will come from the checkpoint; you cannot change them mid-run.
--fasta=foo
Read FASTA-format sequences from the filefoo
. Required.--measure=foo
The distance measure to use (foo
in this example). This must be a value known to createmetric.cpp. Required.--submeasure=foo
Some distance measurements have submeasures. For example, kmer distances can be calculated via Euclidean or cosine variants. The measure function you use must understand the submeasure you supply. Optional, depending on the distance measure you use.--measureopt=foo
Supply the optionfoo
to the measure function. For example, kmers need to know the value for k. Optional, depending on the distance measure you use.--distmatfname=foo
Write the resulting distance matrix to the filefoo
. Required. No default.--ncores=n
use n threads. The max (and default) value is the number of cores that the system has. Optional.--checkpointdir=foo
Write all checkpoint information to files in the directoryfoo
.--printresult=true|false
Whether or not to print the resulting distance matrix. For a matrix of any size, it is impractical to print. The default isfalse
. If you set this totrue
then the result is printed.
./metrictest --measure=kmer --submeasure=cosine --measureopt=7 --fasta=data/AF091148.fasta --distmatfname=AF091148-7mercosine-distances
./metrictest --measure=kmer --submeasure=euclidean --measureopt=7 --fasta=data/AF091148.fasta --distmatfname=AF091148-7mereuclidean-distances
./metrictest --ncores=7 --measure=edit --fasta=data/AF091148.fasta --distmatfname=AF091148-edit-distances
./metrictest --measure=edit --measureopt=pam250ish --fasta=data/AF091148.short.fasta --distmatfname=AF091148-edit-distances
This last one uses edit weights from the file pam250ish
(which must
exist before you run it). That file is not included due to questions
about the biological meanings of the values in it.
If you want to use the edit distance with weights, you need to create a weight file. This file contains 16 (4x4 in order of ACGT for both rows and columns) long double values that are the cost of a mutation from one base to the other. Numbers are separated by white space; I put four values per line. The matrix must be symmetric. A cost of 0 for the diagonal is a good idea, but will never be used (why have a cost for doing nothing?).
The following functions currently exist:
-
Measure
edit
uses Levenshtein distance between sequences. The default is unit cost per operation (insertion, deletion, substitution). You can provide a cost matrix in the filefoo
with the--measureopt=foo
command-line option. -
Measure
kmer
uses k-mers. You must supply a value for k by using--measureopt=k
. You must supply a--submeasure=foo
wherefoo
is eithereuclidean
orcosine
.-
Euclidean is currently Eculidean squared, as described in K-mer based distance estimation
-
Cosine is described in the Wikipedia page and (probaby?) used in Apostolico, A; Denas, O (March 2008). Fast algorithms for computing sequence distances by exhaustive substring composition.
-