Home

TagGD

TagGD is a tool to create ("multiplex") and search for ("demultiplex") unique genetic barcodes. It's main strength is that the demultiplexing algorithm can cope with a large amount of simultaneous barcodes, and also allow for indels when demultiplexing. It uses a kmer-based algorithm internally, and the user can tune several parameters.

TagGD has been published in PLoS One. The multiplexer is available as C++ code from another GitHub repository, along with the old C++ version of the demultiplexer. This repository contains the new python/cython version of the demultiplexer.

For inquiries, please send an email to [email protected], or [email protected].

Demultiplexer

The demultiplexer takes as input a file with sequence reads, and a file with the true barcodes and additional properties. Every read is processed looking for a matching barcode, and if found, it is augmented with the barcode and the attributes of the barcode. It is written in cython and is parallelized to make full use of all machine cores.

Input

The input reads file can be in the FASTQ, FASTA, SAM or BAM format. The true barcodes file should be tab-delimited, header-less, have the barcodes in the first column, and have an arbitrary number of attributes to follow, e.g. here with two additional integer fields:

ATTGGGCACAGACGCAGACCTCGTACG    106   194
AATCAAAGTTAATTATGCATTGCGGTT    108   194
TACACGCCTTGTCTTGTTAACATATTT    110   194
AATCCTACTCTCAAGAACTTTGGCTCT    114   194
GCTCCTGTACTGTCAGCCTCACTGCCC    114   194
...

Output

TagGD uses the same format for outputted reads as the input file. Every read will written to either of three files, e.g. for FASTQ:

outname.matched.fq -- perfectly or unambiguously matched reads. The corresponding barcode and its attributes are appended to the read description.
outname.ambiguous.fq -- ambiguously mapped reads. Every read will appear multiple times, but augmented with the different barcodes it matched.
outname.unmatched.fq -- unmatched reads.

Each barcode is appended SAM-style like so B0:Z:ACGTTCGAGTTC, while the attributes are appended like so B1:Z:myattrib, B2:Z:anotherattrib. For instance, for a BAM file:

@M00275:102:000000000-A33TB:1:1101:13977:2680/2 B0:Z:ACGCAGGTCTTGATAGGCCCTTGAACT B1:Z:200 B2:Z:296
AATGCAGTATATAGCCCTTGAGCTCTTTTTTTAAAACTACACCTCATTTTCGAGATTGTAAAGGGAGGTTTTGTGAAGTTCTAAAAGGTTCTAGTTTGAAGGTCGGCCTTGTAGATTAAAACGAAGGTTAC
+
<????/<75<<?BB?BFFC>F;/;>AA>EF>>CHD09/9CA?E/7D?GGDF<55=@@95<AAEEE+@C66AE,[email protected]@>D7C>-555AC=-5-5C-*++5*-55C
@M00275:102:000000000-A33TB:1:1101:13977:2680/2 B0:Z:ACGCAGGTCTTGATAGGCCCTTGAACT B1:Z:200 B2:Z:296
AATGCAGTATATAGCCCTTGAGCTCTTTTTTTAAAACTACACCTCATTTTCGAGATTGTAAAGGGAGGTTTTGTGAAGTTCTAAAAGGTTCTAGTTTGAAGGTCGGCCTTGTAGATTAAAACGAAGGTTAC
+
<????/<75<<?BB?BFFC>F;/;>AA>EF>>CHD09/9CA?E/7D?GGDF<55=@@95<AAEEE+@C66AE,[email protected]@>D7C>-555AC=-5-5C-*++5*-55C
...

In addition, a tab-delimited result file with information from the search algorithm is produced, outname.results.tsv, e.g.:

Annotation	Match_result Barcode	Edit_distance Ambiguous_top_hits Qualified_candidates Raw_candidates	Last_position Approx_insertions Approx_deletions
M00275:102:000000000-A33TB:1:1101:13977:2680/2	MATCHED_UNAMBIGUOUSLY	ACGCAGGTCTTGATAGGCCCTTGAACT	7	1	1	87	23	0	4
M00275:102:000000000-A33TB:1:1101:18822:2563/2	UNMATCHED	-	-	0	0	135	-	-	-
M00275:102:000000000-A33TB:1:1101:18822:2563/2	UNMATCHED	-	-	0	0	135	-	-	-
M00275:102:000000000-A33TB:1:1101:13977:2680/2	MATCHED_UNAMBIGUOUSLY	ACGCAGGTCTTGATAGGCCCTTGAACT	7	1	1	87	23	0	4
...

Building and running TagGD demultiplexer

See INSTALL file for more info. If you encounter errors while building, try adding cython to your environment. Typically, to install,

python setup.py build
python setup.py install

Typically, for displaying user options,

taggd_demultiplex.py --help

Typically, for running,

taggd_demultiplex.py --k 7 --max_edit_distance 7 --overhang 3 --metric Subglobal ./true_barcodes.tsv ./reads.fq ./output_prefix

Provide feedback

Saved searches