Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__gen_read routine does not reverse insertions on reverse strand #3

Open
bredeson opened this issue Aug 9, 2012 · 3 comments
Open

Comments

@bredeson
Copy link

bredeson commented Aug 9, 2012

I'm not a C programmer, but (perl is similar enough and) I produced a 'git diff' of the changes I made to fix the issue. Perhaps there is a better way of doing this (insertion size is limited to length of longest end read), but tested it and it works:

git diff wgsim.old.c wgsim.c

diff --git a/wgsim.c b/wgsim.c
index 5c82192..faf31c1 100644
--- a/wgsim.c
+++ b/wgsim.c
@@ -312,11 +312,20 @@ void wgsim_core(FILE *fpout1, FILE *fpout2, const char *fn, int is_hap, uint64_t
tmp_seq[x][k++] = c & 0xf;
if (mut_type == SUBSTITUTE) ++n_sub[x];
} else { \

  •                                           int n, ins;                                                                             \
    
  •                                     int n, ins, y;                \
    
  •                                     uint8_t open_c = c & 0xf;     \
                                              ++n_indel[x];                                                                   \
    
  •                                           tmp_seq[x][k++] = c & 0xf;                                              \
    
  •                                           for (n = mut_type>>12, ins = c>>4; n > 0 && k < s[x]; --n, ins >>= 2) \
    
  •                                                   tmp_seq[x][k++] = ins & 0x3;                            \
    
  •                                           if (x) {                \
    
  •                                             uint8_t insert[s[x]]; \
    
  •                                             for (y = 0, n = mut_type>>12, ins = c>>4; n > 0 && k+y < s[x]-1; --n, ins >>= 2) \
    
  •                                               insert[y++] = ins & 0x3; \
    
  •                                             for (; y > 0; --y) tmp_seq[x][k++] = insert[y-1]; \
    
  •                                             tmp_seq[x][k++] = open_c; \
    
  •                                           } else {                \
    
  •                                               tmp_seq[x][k++] = open_c; \
    
  •                                             for (n = mut_type>>12, ins = c>>4; n > 0 && k < s[x]; --n, ins >>= 2) \
    
  •                                               tmp_seq[x][k++] = ins & 0x3; \
    
  •                                           }                       \
                                      }  
                              }  
                              if (k != s[x]) ext_coor[x] = -10;                                               \
    
@delocalizer
Copy link

Thanks for finding this problem. Another obvious (not neatest, nor most performant) fix in a few lines is to generate both reads in forward direction and reverse & complement tmp_seq[1] afterwards:

diff --git a/wgsim.c b/wgsim.c
index 55b8daf..85eef7d 100644
--- a/wgsim.c
+++ b/wgsim.c
@@ -323,8 +323,10 @@ void wgsim_core(FILE *fpout1, FILE *fpout2, const char *fn, int is_hap, uint64_t
                        } while (0)

                        __gen_read(0, pos, ++i);
-                       __gen_read(1, pos + d - 1, --i);
-                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = tmp_seq[1][k] < 4? 3 - tmp_seq[1][k] : 4; // complement
+                       __gen_read(1, pos + d - s[1], ++i);
+                       uint8_t revcomp[s[1]];
+                       for (k = 0; k < s[1]; ++k) revcomp[k] = tmp_seq[1][s[1] - k - 1] < 4? 3 - tmp_seq[1][s[1] - k - 1] : 4; 
+                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = revcomp[k];
                        if (ext_coor[0] < 0 || ext_coor[1] < 0) { // fail to generate the read(s)
                                --ii;
                                continue;

@delocalizer
Copy link

Addition to the fix proposed above to make sure ext_coor is correct:

@@ -319,12 +319,15 @@ void wgsim_core(FILE *fpout1, FILE *fpout2, const char *fn, int is_hap, uint64_t
                                                        tmp_seq[x][k++] = ins & 0x3;                            \
                                        }                                                                                                       \
                                }                                                                                                               \
+                               if (x) ext_coor[x] = i - 1;                                                                     \
                                if (k != s[x]) ext_coor[x] = -10;                                               \
                        } while (0)

                        __gen_read(0, pos, ++i);
-                       __gen_read(1, pos + d - 1, --i);
-                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = tmp_seq[1][k] < 4? 3 - tmp_seq[1][k] : 4; // complement
+                       __gen_read(1, pos + d - s[1], ++i);
+                       uint8_t revcomp[s[1]];
+                       for (k = 0; k < s[1]; ++k) revcomp[k] = tmp_seq[1][s[1] - k - 1] < 4? 3 - tmp_seq[1][s[1] - k - 1] : 4; 
+                       for (k = 0; k < s[1]; ++k) tmp_seq[1][k] = revcomp[k];
                        if (ext_coor[0] < 0 || ext_coor[1] < 0) { // fail to generate the read(s)
                                --ii;
                                continue;

@delocalizer
Copy link

There is a subtle problem (or really 2 problems) with bredeson's original fix as it stands - reverse reads that terminate at or within insertions are incorrect. A reverse read that terminates exactly at the location of an insertion will have a last base equal to the reference rather than the first base of the (r.c.) of the insertion, and reverse reads that terminate in the middle of an insertion will have a truncated insertion.

e.g. with the following input

>test
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT
ACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTT

the following command (bredeson.wgsim compiled from wgsim.c with bredeson's
changes)

bredeson.wgsim -e 0 -d 200 -S 1367125828 -N500 -R 1.0 -X 0.7  -r 0.005 test.fa br.1.fq br.2.fq

generates the following single heterozygous insertion:

[wgsim] seed = 1367125828
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 400
test    213 -   TC  -

and the following reads with (nominally) no errors:

[crl@IMB10-02998 sim]$ grep -A1 _282_0: *.fq
br.1.fq:@test_52_282_0:0:0_0:0:1_116/1
br.1.fq-TAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTA
--
br.1.fq:@test_85_282_0:0:0_0:0:1_1c5/1
br.1.fq-GTAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTT
--
br.2.fq:@test_52_282_0:0:0_0:0:1_116/2
br.2.fq-GTAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTT
--
br.2.fq:@test_85_282_0:0:0_0:0:1_1c5/2
br.2.fq-AACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGG

note that the second and third reads are in reverse direction from 282, and
the last 3 bases in the reads are TTT. However their last base is at position
213, the location of the insertion, and so the last base should be the first
base of the r.c. of the insertion i.e. a 'G'. It isn't, it's the reference
base of T.

compare some reads 1 bp upstream from the same output:

[crl@IMB10-02998 sim]$ grep -A1 _281_0: *.fq
br.1.fq:@test_127_281_0:0:0_0:0:1_19/1
br.1.fq-TAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTAT
--
br.1.fq:@test_130_281_0:0:0_0:0:1_f1/1
br.1.fq-GTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTT
--
br.1.fq:@test_61_281_0:0:0_0:0:1_12a/1
br.1.fq-GTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGG
--
br.2.fq:@test_127_281_0:0:0_0:0:1_19/2
br.2.fq-CCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGG
--
br.2.fq:@test_130_281_0:0:0_0:0:1_f1/2
br.2.fq-TAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTAT
--
br.2.fq:@test_61_281_0:0:0_0:0:1_12a/2
br.2.fq-TAAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTAT

The first, fifth and sixth reads are in reverse direction from 281 and here we
see that the last 4 bases are TTAT, which is also incorrect, they should be
TTGA. This is actually the simplest way to tell that something is wrong in this
case - given the input and the single mutation, the motif TTAT should never
appear in any reads.

finally, compare some reads 1 more bp upstream:

[crl@IMB10-02998 sim]$ grep -A1 _280_0: *.fq
br.1.fq:@test_66_280_0:0:0_0:0:1_92/1
br.1.fq-AAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAA
--
br.1.fq:@test_78_280_0:0:0_0:0:1_11d/1
br.1.fq-AAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTGAT
--
br.1.fq:@test_26_280_0:0:0_0:0:1_191/1
br.1.fq-AAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTGAT
--
br.2.fq:@test_66_280_0:0:0_0:0:1_92/2
br.2.fq-AAAACCCCGGGGTTTTAAACCCGGGTTTAACCGGTTACGTAAAACCCCGGGGTTTTAAACCCGGGTTGAT
--
br.2.fq:@test_78_280_0:0:0_0:0:1_11d/2
br.2.fq-TTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAA
--
br.2.fq:@test_26_280_0:0:0_0:0:1_191/2
br.2.fq-AAACCCCGGGGTTTTACGTAACCGGTTAAACCCGGGTTTAAAACCCCGGGGTTTTACGTAACCGGTTAAA

and we see here at last the reads over the insertion (2nd, 3rd, 4th) are good,
ending in TTGAT.

The trickiness is representation in wgsim of insertion as sequence *after* a coordinate doesn't work as easily in reverse direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants