-
Notifications
You must be signed in to change notification settings - Fork 8
Home
TagGD is a tool to create ("multiplex") and search for ("demultiplex") unique genetic barcodes. It's main strength is that the demultiplexing algorithm can cope with a large amount of simultaneous barcodes, and also allow for indels when demultiplexing. It uses a kmer-based algorithm internally, and the user can tune several parameters.
TagGD has been published in PLoS One. The multiplexer is available as C++ code from another GitHub repository, along with the old C++ version of the demultiplexer. This repository contains the new python/cython version of the demultiplexer.
For inquiries, please send an email to [email protected], or [email protected].
The demultiplexer takes as input a file with sequence reads, and a file with the true barcodes and additional properties. Every read is processed looking for a matching barcode, and if found, it is augmented with the barcode and the attributes of the barcode. It is written in cython and is parallelized to make full use of all machine cores.
The input reads file can be in the FASTQ, FASTA, SAM or BAM format. The true barcodes file should be tab-delimited, header-less, have the barcodes in the first column, and have an arbitrary number of attributes to follow, e.g. here with two additional integer fields:
ATTGGGCACAGACGCAGACCTCGTACG 106 194
AATCAAAGTTAATTATGCATTGCGGTT 108 194
TACACGCCTTGTCTTGTTAACATATTT 110 194
AATCCTACTCTCAAGAACTTTGGCTCT 114 194
GCTCCTGTACTGTCAGCCTCACTGCCC 114 194
...
TagGD uses the same format for outputted reads as the input file. Every read will written to either of three files, e.g. for FASTQ:
- outname.matched.fq -- perfectly or unambiguously matched reads. The corresponding barcode and its attributes are appended to the read description.
- outname.ambiguous.fq -- ambiguously mapped reads. Every read will appear multiple times, but augmented with the different barcodes it matched.
- outname.unmatched.fq -- unmatched reads.
Each barcode is appended SAM-style like so B0:Z:ACGTTCGAGTTC, while the attributes are appended like so B1:Z:myattrib, B2:Z:anotherattrib. For instance, for a BAM file:
@M00275:102:000000000-A33TB:1:1101:13977:2680/2 B0:Z:ACGCAGGTCTTGATAGGCCCTTGAACT B1:Z:200 B2:Z:296
AATGCAGTATATAGCCCTTGAGCTCTTTTTTTAAAACTACACCTCATTTTCGAGATTGTAAAGGGAGGTTTTGTGAAGTTCTAAAAGGTTCTAGTTTGAAGGTCGGCCTTGTAGATTAAAACGAAGGTTAC
+
<????/<75<<?BB?BFFC>F;/;>AA>EF>>CHD09/9CA?E/7D?GGDF<55=@@95<AAEEE+@C66AE,[email protected]@>D7C>-555AC=-5-5C-*++5*-55C
@M00275:102:000000000-A33TB:1:1101:13977:2680/2 B0:Z:ACGCAGGTCTTGATAGGCCCTTGAACT B1:Z:200 B2:Z:296
AATGCAGTATATAGCCCTTGAGCTCTTTTTTTAAAACTACACCTCATTTTCGAGATTGTAAAGGGAGGTTTTGTGAAGTTCTAAAAGGTTCTAGTTTGAAGGTCGGCCTTGTAGATTAAAACGAAGGTTAC
+
<????/<75<<?BB?BFFC>F;/;>AA>EF>>CHD09/9CA?E/7D?GGDF<55=@@95<AAEEE+@C66AE,[email protected]@>D7C>-555AC=-5-5C-*++5*-55C
...
In addition, a tab-delimited result file with information from the search algorithm is produced, outname.results.tsv, e.g.:
Annotation Match_result Barcode Edit_distance Ambiguous_top_hits Qualified_candidates Raw_candidates Last_position Approx_insertions Approx_deletions
M00275:102:000000000-A33TB:1:1101:13977:2680/2 MATCHED_UNAMBIGUOUSLY ACGCAGGTCTTGATAGGCCCTTGAACT 7 1 1 87 23 0 4
M00275:102:000000000-A33TB:1:1101:18822:2563/2 UNMATCHED - - 0 0 135 - - -
M00275:102:000000000-A33TB:1:1101:18822:2563/2 UNMATCHED - - 0 0 135 - - -
M00275:102:000000000-A33TB:1:1101:13977:2680/2 MATCHED_UNAMBIGUOUSLY ACGCAGGTCTTGATAGGCCCTTGAACT 7 1 1 87 23 0 4
...
See INSTALL file for more info. If you encounter errors while building, try adding cython to your environment. Typically, to install,
python setup.py build
python setup.py install
Typically, for displaying user options,
taggd_demultiplex.py --help
Typically, for running,
taggd_demultiplex.py --k 7 --max_edit_distance 7 --overhang 3 --metric Subglobal ./true_barcodes.tsv ./reads.fq ./output_prefix