Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(STARsolo's) UR and UB SAM flags differ in length #1909

Open
johnmoll opened this issue Jul 24, 2023 · 1 comment
Open

(STARsolo's) UR and UB SAM flags differ in length #1909

johnmoll opened this issue Jul 24, 2023 · 1 comment
Labels

Comments

@johnmoll
Copy link

Hi,

First of all, thanks for your amazing tool(s), I find STAR and STARsolo extraordinarily useful. This is to report that STARsolo's UB and UR don't have the same length when processing BDRhapsody data. The manual reads (page 13):

CR CY UR UY : sequences and quality scores of cell barcodes and UMIs for the solo* demulti-
plexing, not error corrected.
CB UB : error-corrected cell barcodes and UMIs for solo* demultiplexing.

so I expected UR and UB to share length, perhaps I misunderstood. If they were intended to share length, I might have found a bug. The snippet below shows how to generate an output where their lengths don't match.

STAR 2.7.10b, compiled from source, running on Ubuntu 18.04.6 LTS.

#!/bin/bash

NTHREADS=5

mkdir sandbox; cd $_

## these are whitelists (BDRhapsody)
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS1.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS2.txt
wget https://teichlab.github.io/scg_lib_structs/data/BD_CLS3.txt

## fake, short genome
cat << EOF > genome.fa
>ERCC-00002
TCCAGATTACTTCCATTTCCGCCCAAGCTGCTCACAGTATACGGGCGTCG
GCATCCAGACCGTCGGCTGATCGTGGTTTTACTAGGCTAGACTAGCGTAC
GAGCACTATGGTCAGTAATTCCTGGAGGAATAGGTACCAAGAAAAAAACG
AACCTTTGGGTTCCAGAGCTGTACGGTCGCACTGAACTCGGATAGGTCTC
AGAAAAACGAAATATAGGCTTACGGTAGGTCCGAATGGCACAAAGCTTGT
TCCGTTAGCTGGCATAAGATTCCATGCCTAGATGTGATACACGTTTCTGG
AAACTGCCTCGTCATGCGACTGTTCCCCGGGGTCAGGGCCGCTGGTATTT
GCTGTAAAGAGGGGCGTTGAGTCCGTCCGACTTCACTGCCCCCTTTCAGC
CTTTTGGGTCCTGTATCCCAATTCTCAGAGGTCCCGCCGTACGCTGAGGA
CCACCTGAAACGGGCATCGTCGCTCTTCGTTGTTCGTCGACTTCTAGTGT
GGAGACGAATTGCCAGAATTATTAACTGCGCAGTTAGGGCAGCGTCTGAG
GAAGTTTGCTGCGGTTTCGCCTTGACCGCGGGAAGGAGACATAACGATAG
CGACTCTGTCTCAGGGGATCTGCATATGTTTGCAGCATACTTTAGGTGGG
CCTTGGCTTCCTTCCGCAGTCAAAACCGCGCAATTATCCCCGTCCTGATT
TACTGGACTCGCAACGTGGGTCCATCAGTTGTCCGTATACCAAGACGTCT
AAGGGCGGTGTACACCCTTTTGAGCAATGATTGCACAACCTGCGATCACC
TTATACAGAATTATCAATCAAGCTCCCCGAGGAGCGGACTTGTAAGGACC
GCCGCTTTCGCTCGGGTCTGCGGGTTATAGCTTTTCAGTCTCGACGGGCT
AGCACACATCTGGTTGACTAGGCGCATAGTCGCCATTCACAGATTTGCTC
GGCAATCAGTACTGGTAGGCGTTAGACCCCGTGACTCGTGGCTGAACGGC
CGTACAACTCGACAGCCGGTGCTTGCGTTTTACCCTTAAAAAAAAAAAAA
AAAAAAAAAAA
>ERCC-00003
CAGCAGCGATTAAGGCAGAGGCGTTTGTATCTGCCATTATAAAGAAGTTT
CCTCCAGCAACTCCTTTCTTAATTCCAAACTTAGCTTCAGTTATAAATTC
CCCTCCCATGATTGGGATTTTATAAACTTTTCTTCCATATAATTCATCTT
TCTTCTCATAACCGTCTCCGAAAAACTTCAACTTAAATCCAACCTTTAAC
TGCTCATCAGCCATGTCTCCCACAGCATCAAAAATAGCAGTTGTTGGACA
TGTTAAGACACACTGCCCCAATCTCTCTAACATTTGATGCTCTAACTCTG
ACTTTTTAGGGTGGCATATCTGTATTATAAATCCTGGTCTTCCATCTGGT
GTTTTTGATGGAGGGACATATTTCTCAATTCCTGCTTCTGCTGGACACAT
TATAACTGAACAACCAAAACCTGTTGCCTCTGTAGCTGCAATCTTAGCCC
ACTTCTTTGTAGCTGCTGTTATTAAAACTCTTGAAACCCATATTGGGAAT
GCTTCTGCAAATGTATCTTCAATATATACTCCATTTATTTCCATAGTTTC
CCTCCATTAAGATTTTAACAATTATAGTTTATCTTAGGGGCTATTAATAT
CTTATCATTTGGTTTTTAATATTCGATAAATCCATAAATAAAAATATATC
AACAATAATTTTAAATAATCTAAGTATAGGTAATATAACAATTAAAAAGA
TTTAGAGGGATAGAATTGAACGGCATTAGGAGAATTGTTTTAGATATATT
GAAGCCGCATGAGCCAAAAATAACAGATATGGCATTAAAATTAACATCAT
TATCAAACATTGATGGGGTTAATATTACAGTCTATGAAATAGATAAAGAG
ACTGAGAATGTTAAAGTTACAATTGAAGGGAATAATTTAGATTTTGATGA
GATTCAGGAAATTATTGAAAGTTTGGGAGGGACTATTCACAGTATAGATG
AGGTTGTTGCAGGTAAAAAGATTATTGAAGAGTTAGAACACCACAAGATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00004
TCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCC
CACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTA
AATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT
ATCTAAGGTAAAGTGCTTCTCAATAACATCCGCTCCTAAGGCAACAGAAA
CTACTGGGGCGAGTATTCCCAATGTATGGTCAGAATATCCCACAGGGATA
TTGAATATACTTTTCAAGGTTTTAATAGCGTTTAAATTGACATCTTCATA
AGGGGTTGGGTAAGATGAAATACAATGCAATAAAATAATATCCCTGCATC
CATTATTTTCTAAAACTTTAACTGCTTCCCAAATTTCCCCAATATCAGAC
ATTCCTGTAGATAAAATCACCGGCTTGCCTGTTTTTGCCACTTTTTCTAA
TAAGGGATAAAAGGTTAAATCACCAGAGGCAATTTTAAATCAGGCACATA
AAAAAAAAAAAAAAAAAAAAAAA
>ERCC-00009
CAATGATAGGCTAGTCTCGCGCAGTACATGGTAGTTCAGCCAATAGATGC
CTAGTACGCTGACGGCATTCAGAGTACGCTGATCGGCTTATGACGTATGT
GACGCAGCTCTTAGCGCAATGTATGTGCTGTTATCGAAGCCTATGGCTGA
GTATGTAACGCTATGGCGTGCTAGTCGTCTCATATACGTCTGATGACCTC
GTATCATGTTATAGGGCTGCGAACTGTCGATGATGGTCACGACTCTGTCG
ATAGCTGTGTGACTCATTCAGAAGGTGTGCAGCCTATATGATACGCAGTC
GCATCCTATCTTACGTGTCAGTACTATGTGTGAGTGCTCCGCCCTAGTGC
TGATGTATGCCCCATAGTGCTCAGTGGAGTCTCTCTTAGCATAGTGTCCG
CTCATACATTAGATGGACGGCTCATTAGTATCATCGTCGGCTGATATAGG
TCGTGGCTCCCTGTATATCGAGGTGAGTCTATCTGGATCAACGTCGCACT
ATGATGTGCAAAGTGTCGTCCATGTATAGACAGTGCGCGTATCATATAGG
ATGCGGCGATCTCATACAGCGTTACGGTCGCTGCGTACTGTATAAGGATG
CTCTGTGAACTGTCATCGGTCCGATCAATTAGTCTAGTGTGCGTTATTCA
GATCGAGTGAGTACATGATTCGTCAGTGTGGATCAATTACAGTTAGGCCG
CTGACACATTAGTAACGTCGGCAAGCACTTAGTCGTGTCGTAAGCCAGTG
TGTCGTGTCTTAGACGACTGTGTGTGATTCTCGAGCGATTTATACATCCG
TGACAGCGCTTATAGTGTGCTGACAGACTGGTTGGTTATCCAATGATCGA
CCTGGAGTCTAATATCTGACCACGCCTTGTAATCGTATGACACGCGCTTG
ACACGACTGAATCCAGCTTAAGAGCCCTGCAACGCGATATACAGGCGCTG
CTACCGATATAAAAAAAAAAAAAAAAAAAAAAAA
EOF

## fake, short canonical transcripts
echo -e 'ERCC-00002\tERCC\texon\t1\t1061\t.\t+\t.\tgene_id "ERCC-00002"; transcript_id "DQ459430";
ERCC-00003\tERCC\texon\t1\t1023\t.\t+\t.\tgene_id "ERCC-00003"; transcript_id "DQ516784";
ERCC-00004\tERCC\texon\t1\t523\t.\t+\t.\tgene_id "ERCC-00004"; transcript_id "DQ516752";
ERCC-00009\tERCC\texon\t1\t984\t.\t+\t.\tgene_id "ERCC-00009"; transcript_id "DQ668364";' > genome.gtf

## index
STAR --runThreadN $NTHREADS \
     --runMode genomeGenerate \
     --sjdbGTFfile genome.gtf \
     --genomeDir a_genome \
     --genomeSAindexNbases 4 \
     --sjdbOverhang 100 \
     --genomeFastaFiles genome.fa
     
## generate 100 R1s (cDNA) and R2s (CB + UMI)
seq 1 100 | awk ' {print "@"$0"\nTCTTGCTTCAACAATAACGTCTCTTTCAGAAGGCATTGGTATCTTTTCCCCACTTCCAAGCATTTTTTCAACTAATCTTATGTTATTAACCATTTCCTTAAATTCTTCTGGGTCTGCTGACAAAGCATGATCAGGACCTTCCATATTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r1.fq.gz

seq 1 100 | awk ' {print "@"$0"\nCTTGTACTAGTGATGTTCTCCAGACAGGCTACAGATTTGATGGTTTTTTTTTTTTTTTTT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF";}' | gzip -c > r2.fq.gz

## run starsolo; UB and UR reported explicitly, among other flags
STAR --runThreadN 5 \
     --genomeDir a_genome \
     --readFilesCommand zcat \
     --outFileNamePrefix mapping_test/ \
     --readFilesIn  r1.fq.gz r2.fq.gz  \
     --soloType CB_UMI_Complex \
     --soloAdapterSequence GTGANNNNNNNNNGACA \
     --soloCBposition 2_-9_2_-1 2_4_2_12 2_17_2_25 \
     --soloUMIposition 3_10_3_17 \
     --soloCBwhitelist BD_CLS1.txt BD_CLS2.txt BD_CLS3.txt \
     --soloCBmatchWLtype 1MM \
     --soloCellFilter None \
     --soloCellReadStats Standard \
     --sjdbOverhang 100 \
     --outSAMattributes sS sM CB UB CR CY UR UY \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts \
     --sjdbGTFfile genome.gtf

## UB looks like UR, but with a prepended AA dinucleotide
samtools view mapping_test/Al*bam | cut -f16- | head
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG
# UR:Z:TTTGATGG   UY:Z:FFFFFFFF   CB:Z:CTTGTACTA_TGTTCTCCA_GGCTACAGA      UB:Z:AATTTGATGG

Thanks again for your time,

John

@alexdobin alexdobin added the bug label Jul 24, 2023
@alexdobin
Copy link
Owner

Hi John,

This is indeed a bug; thanks a lot for the detailed test example.
The length of the UB tag for complex barcodes is mistakenly taken from --soloUMIlen (default = 10).
I will fix it in the next release, for now, as a workaround, please specify the correct length --soloUMIlen 8.

alexdobin added a commit that referenced this issue Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants