Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping to repetitive regions #32

Open
mdshw5 opened this issue Sep 22, 2016 · 1 comment
Open

Mapping to repetitive regions #32

mdshw5 opened this issue Sep 22, 2016 · 1 comment

Comments

@mdshw5
Copy link

mdshw5 commented Sep 22, 2016

Hi Rob. I'm using the --writeMappings option in Salmon 0.7.2, but reporting this issue here since I think the code base is (will be) the same. I have a few genes that we've noticed (based on biological intuition) are called expression when they shouldn't be. The mapping data has really helped to diagnose these transcripts! Here's one example:

IGV mapped data

This transcript has some repetitive sequence near the 3' end, which is masked by RepeatMasker:

>ENSMUST00000034602.7
GAATACCCATGAGTCTGTTGGCGGAGGATAGTGCTAACCCTGGCTGATCATCCTGTGGCTTGCCTCTATTTGTTGCATGATTCCCTCTGGAAGATGGAACACAGCGGGATTCTGGCTAGTCTGATACTGATTGCTGTTCTCCCCCAAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTAGATGGAACGGTGGAAGGATGGTTTGCAAAGAATAAAACACTCAACTTGGGCAAAGGCGTTCTGGACCCACGAGGGATATATCTGTGTAATGGGACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCGGGCACCATGGCTGGTGTCATCTTCATTGACCTCATCGCAACTCTGCTCCTGGCTTTGGGCGTCTACTGCTTTGCAGGACATGAGACCGGAAGGCCTTCTGGGGCTGCTGAGGTTCAAGCACTGCTGAAGAATGAGCAGCTGTATCAGCCTCTTCGAGATCGTGAAGATACCCAGTACAGCCGTCTTGGAGGGAACTGGCCCCGGAACAAGAAATCTTAAGATGGAGGTTTCTCAAAGTGCCATAGCAACGGCGCCTTCCCCTGTGATCAACCAATAAAGACGTTTCCTTCCGTCGGCAGGCGCCTGTGTTTTCCAAAGCATGGATTGAGAGTTGTCTTAGCTTGGCTGAGTTCTGCCCACCCGTGGCCCCTCACTTCTGGCTTCCTCTCCTCAGACCCAGACAGGCAGTACGAGGGACAGCTTCCCATCCTGGCCCCTTGGCATTTCTAACCCTGTGGGTTTCCTGTGGGCAGGGTCAGCCCAGCCACTCTGCTTCATTGCCTGTTGGTTCCAGTCCCTATCATAACAACTTCCGGGATTTTGCTGCCTTACCACAGCCCAAAGCAGGCTGTGGTGTTTAGAAGCTGAGCCACAAAGAAGTTTCCATGACATCATGAATGGGGGTGGCAGAGAAGAATATTGGGGCTCAGAGGgtgtgtgtgtgtgtgtgtgtgtgtgtgtgagtgagagagagagagagagagagagagagagagagagagagagacagacagagagacagagagacagaaagacagaaagacagaaagacagagacagagctgtcAAGaattttcttctcacttcgtgggttctggggattgtactaaggtcatgaggcttgtgtggcaagcacccttacccactgagccatcttgcctgcccTCATCCTCaaattaattaaaaataaagaacatgatgaattcc

The IGV screenshot has two tracks: the top track is using the transcript sequence as is, and the bottom is converting all of the RepeatMasker soft-masked bases to hard-masking ("N"). You can see that the mapping is greatly improved. I'm not sure other transcripts will benefit so much (or might even be harmed) by masking repetitive elements, but have you considered any way to deal with this issue? The reads mapping to this transcript are:

1       163 ENSMUST00000034602.7    146 1   80M =   319 253 AAGGGAGCCCCTTCAAGATACAAGTGACCGAATATGAGGACAAAGTATTTGTGACCTGCAATACCAGCGTCATGCATCTA    *   NH:i:1
2       83  ENSMUST00000034602.7    319 1   80M =   146 -253    ACAGAGCAGCTGGCAAAGGTGGTGTCTTCTGTGCAAGTCCATTACCGAATGTGCCAGAACTGTGTGGAGCTAGACTCCGG    *   NH:i:1
3       121 ENSMUST00000034602.7    976 1   125M    =   976 0   CCAGTTGGTTGTCTAGCACCAAATAGCTCAGAAAATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA   *   NH:i:1
4       377 ENSMUST00000034602.7    988 1   125M    =   988 0   CCCTGTATTATAATGACACTGGGATGTGAAGCCAGCTCTTCTCACTCCAGCCTGCTCTTTGTCTGCTTCGGCACACCTCTTATCACCCCTCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG   *   NH:i:15
5       121 ENSMUST00000034602.7    1007    1   80M =   1007    0   GAACCGGGATTGGTCTATTTGTCCTTTTAGAGGCCAGACAACTGCATATGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAG    *   NH:i:1
6       163 ENSMUST00000034602.7    1010    1   80M =   1018    88  ATACACATAAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGA    *   NH:i:1
7       121 ENSMUST00000034602.7    1017    1   80M =   1017    0   CAAATACTTTTAAAGCAGTCCAAGTTTATTTTCATTCGTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAG    *   NH:i:1
8       83  ENSMUST00000034602.7    1018    1   80M =   1010    -88 AAGTAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGC    *   NH:i:1
9       121 ENSMUST00000034602.7    1021    1   80M =   1021    0   TAATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACA    *   NH:i:1
10      163 ENSMUST00000034602.7    1022    1   125M    =   1056    159 AATGTTGTACACACTGAGCAGGTTATATGTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAA   *   NH:i:1
11      105 ENSMUST00000034602.7    1024    1   80M =   1024    0   CCCGGCATAGTCTGATTTATTTATTTTTCATTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:1
12      153 ENSMUST00000034602.7    1030    1   80M =   1030    0   GTTTCTTAATCTCTCTTTCTCTTTCTGTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGTCTTTTCTATACCAGTTTATA    *   NH:i:1
13      361 ENSMUST00000034602.7    1033    1   80M =   1033    0   GTCAGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:2
14      393 ENSMUST00000034602.7    1035    1   80M =   1035    0   TAATAATTCTCATAATTTTTTTCTCCTTACCATGTGCATGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:2
15      377 ENSMUST00000034602.7    1035    1   80M =   1035    0   GCAGAAAGTTAAATAAAAGGTAAGTACGTGTGTGTGGCCAAATCAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAG    *   NH:i:15
16      361 ENSMUST00000034602.7    1036    1   80M =   1036    0   AGCGCTAAGAAGTCCTGTTAGACCGTGTGCGTGTGCGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:2
17      361 ENSMUST00000034602.7    1044    1   80M =   1044    0   CGGAAATGTGAAAGTTCTGAGACAGACATATAGACAGTGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    *   NH:i:7
18      153 ENSMUST00000034602.7    1050    1   80M =   1050    0   GTTTAGAAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACAT    *   NH:i:1
19      99  ENSMUST00000034602.7    1054    1   125M    =   1065    136 GTGTGAGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCT   *   NH:i:1
20      83  ENSMUST00000034602.7    1056    1   125M    =   1022    -159    AAGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAACGCACATGTATATGTATGTACATACATATGCACATAACAGCAATTAATGAAAAAAGAGGACATGAATTTAAAAGAGCAAGGAGGGG   *   NH:i:1
21      105 ENSMUST00000034602.7    1056    1   80M =   1056    0   GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGCGCGTGAGAGAGAGCGGGCGAGCACACTCTGTGCCTCTCAGGAAG    *   NH:i:1
22      121 ENSMUST00000034602.7    1056    1   80M =   1056    0   GTGTGTGTGTGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCAAGAGCGAGCGAGAGCGAAAGCATGTTCAGGT    *   NH:i:1
23      153 ENSMUST00000034602.7    1058    1   80M =   1058    0   GAGAAAATGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGAGAGAGCGAGCGCAGGCTGGTGCCTTTTTCAGTAACTTCCC    *   NH:i:1
24      147 ENSMUST00000034602.7    1065    1   125M    =   1054    -136    TGTGTGTGTGTGTGAGTGAGAGAGAGAGAGAGACTAAGTTAAATTTAAGAGTTCCATCTGACTTGAATGAGCAGCACAAACTTAACACTAACCACAAAATACCTGGGGCTTTCTCAGTCAGAAAC   *   NH:i:1
25      361 ENSMUST00000034602.7    1086    1   80M =   1086    0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGTGTGCAAATGATGCAAAGCCTACTGCCTAAGTTGTATTT    *   NH:i:22
26      409 ENSMUST00000034602.7    1094    1   80M =   1094    0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGACAGAAACGTGACTTCCTCTTTTTCTATTTGTAGAACATTTATTTTCTTCC    *   NH:i:27
27      83  ENSMUST00000034602.7    1134    1   125M    =   1137    128 GACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG   *   NH:i:1
28      163 ENSMUST00000034602.7    1137    1   125M    =   1134    -128    AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAA   *   NH:i:1
29      99  ENSMUST00000034602.7    1137    1   80M =   1139    82  AGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGAC    *   NH:i:1
30      147 ENSMUST00000034602.7    1139    1   80M =   1137    -82 AAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAGAAAGACAG    *   NH:i:1
31      105 ENSMUST00000034602.7    1141    1   125M    =   1141    0   AGACAGAAAGACAGAAAGACAGAAAGACAGAAACACATAGAGGTGTACTGTACAACTGGTAAGAAAAGGCATATTTTTTAAATATAGAAGTGCCTACATATAGAGTGTACTTGGGAAAGAAAAGA   *   NH:i:1
32      169 ENSMUST00000034602.7    1225    1   80M =   1225    0   TAACATCATCAGGCTTGGTGGCAAGCACCCTTACCCACTGAGCCATCTAATAGCTCTTCTCTTGAATTTTTTTAAACACC    *   NH:i:1
@rob-p
Copy link
Contributor

rob-p commented Sep 22, 2016

Hi @mdshw5,

Yes, we have been thinking about this issue. I can't say we have the perfect solution, but I do think we have some ideas. In general, one of the distinctions between the ultrafast mapping approaches and traditional alignment is that there is typically no "validation" step in the mapping approaches. Relatedly, we're typically looking for the single best match between the query and the reference, and the best match may not be a good match. For example, if a single k-mer from the query matches to a set of transcripts, and no other k-mer from the query matches anywhere, then this mapping is the best match (when, in reality, it may be better to leave this read completely unmapped).

One of the recent things we've been working on in RapMap and salmon 0.7.3 are some options for dealing with this. Particularly, something like this — an option which allows one to set some minimum coverage threshold for a read to be considered mappable. Relatedly, there are some internal variables in RapMap that it would be possible to expose that would allow preventing a match to be predicated entirely upon highly repeated sequence. For example, given the suffix array interval for a match (either a k-mer or a longer maximum matchable prefix), we know how many occurrences of this string exist in the reference. One could consider counting these matches toward coverage, but disallowing them from being "anchors" for a mapping (i.e. if the only matches for anchoring an alignment occur more than x times, then don't count them as anchors). Like I said, variables for these quantities already exist inside RapMap, but the challenge would be in terms of (1) exposing them and (2) deciding on reasonable and general heuristics to apply them.

To this end, I'd actually be very interested to hear feedback. Do these options sound useful? Which ones might be best? Are there other (reasonably efficient) filters that would also help? For example, we can trivially enforce co-linearity of the matches between the query and the reference, but non-colinear chains don't seem to actually show up to much for the best mappings. The coverage may be a different story though.

mdshw5 referenced this issue in COMBINE-lab/salmon Sep 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants