New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Unhandled exception when running witty.er with vcf from sniffles and HG002 #9

Closed

drdivine opened this issue Sep 17, 2020 · 31 comments

drdivine commented Sep 17, 2020 •

edited

Loading

First of all, thanks for the code. We are using it compare vcfs.

I get an unhandled exception when trying to run a comparison.

Unhandled Exception: System.ArgumentException: Invalid interval given: Start(4294967295) and Stop(0)!
   at Ilmn.Das.Core.Tries.Extensions.Try.GetOrThrow[TSource](ITry`1 source)
   at Wittyer.Program.LaunchWittyerMain(String[] args) in /src/Wittyer/Program.cs:line 49
   at Wittyer.Program.Main(String[] args) in /src/Wittyer/Program.cs:line 22

It happened with svim, and I filtered out all breakends, and it went through. Now, I am running it with HG002 from the 1000 Genomes project and a vcf from my pipeline with sniffles vcf caller and I get this issue.

Any help on figuring out where it comes from and how to debug would be helpful.

Thanks

The text was updated successfully, but these errors were encountered:

Contributor

Kentalot commented Sep 17, 2020 •

edited

Loading

Yeah, sniffles vcfs are not very compatible with witty.er unfortunately. We use sniffles in the company as well and we had to have to modify their SVTYPE of TRA to be BND as vcf spec requires along with other modifications. Could you try running just the truth vs itself and sniffles vs itself to see if it's the truth vcf or the sniffles vcf? For the truth vcf, i assume it's a public vcf? if so, can you link me to the one you have and I can try debugging. The Error message doesn't describe the failing issue very well unfortunately, and we can add better error messaging in the future. Thanks.

Author

drdivine commented Sep 21, 2020

Here is the truth:
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz

Here is the code I used to run the container:

./runWitty.er.sh ./HG002/HG002_SVs_Tier1_v0.6.vcf.gz_wittyerFiltered.vcf ./HG002/HG002_SVs_Tier1_v0.6.vcf.gz_wittyerFiltered.vcf ./HG002/benchmark_benchmark

Overall Stats:
Overall EventPrecision: 100.000 %
Overall EventRecall: 100.000 %
Overall EventFscore: 100.000 %

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore BaseQueryTotal BaseQueryTp BaseQueryFp BasePrecision BaseTruthTotal BaseTruthTp BaseTruthFn BaseRecall BaseFscore

HG002 HG002 12745 12745 0 100.000 % 12745 12745 0 100.000 % 100.000 % 3737937 3737937 0 100.000 % 3737937 3737937 0 100.000 % 100.000 %

CopyNumberGain

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

CopyNumberLoss

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

CopyNumberReference

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

Deletion

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 5464 5464 0 100.000 % 5464 5464 0 100.000 % 100.000 %

Duplication

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

Insertion

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 7281 7281 0 100.000 % 7281 7281 0 100.000 % 100.000 %

IntraChromosomeBreakend

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

Inversion

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

TranslocationBreakend

QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
HG002 HG002 0 0 0 NaN 0 0 0 NaN NaN

Author

drdivine commented Sep 21, 2020

For the sniffles vs. sniffles comparison I get the following output:

Unhandled Exception: System.ArgumentException: Invalid interval given: Start(4294967295) and Stop(0)!
   at Ilmn.Das.Core.Tries.Extensions.Try.GetOrThrow[TSource](ITry`1 source)
   at Wittyer.Program.LaunchWittyerMain(String[] args) in /src/Wittyer/Program.cs:line 49
   at Wittyer.Program.Main(String[] args) in /src/Wittyer/Program.cs:line 22

I used the following command and script to genereate:

./runWitty.er.sh  ./HG002/sniffles.output.vcf.gz_wittyerFiltered.vcf  ./HG002/sniffles.output.vcf.gz_wittyerFiltered.vcf ./HG002/sniffles_sniffles

runWitty.er.sh:

#!/usr/bin/env bash

InputVCF=${1}
TruthVCF=${2}
OutputDir=${3}


docker run --rm \
  --workdir $(pwd) \
  -v $(pwd):$(pwd) \
  docker-general.dockerhub.illumina.com/wittyer:master \
  -i ${InputVCF} -t ${TruthVCF} -o ${OutputDir}

I have already filtered out all SVTYPE not in the following list:

["DEL", "INS", "DUP", "INV", "CNV"]

Your help at elucidating the problem would be greatly appreciated.

Thanks in advance!

Contributor

Kentalot commented Sep 21, 2020

Ok, yeah looks like it's the sniffles vcf. Could you run vcf-validator on the filtered sniffles vcf? it might give you a hint of the problem. If it passes, could you send me the vcf?

Author

drdivine commented Sep 21, 2020 •

edited

Loading

alright. I used the vcftools module vcf-validator and I got some errors. Here is the output before I filter out all SVTYPE not in the following list:

["DEL", "INS", "DUP", "INV", "CNV"]

$ vcf-validator -u sniffles.output.vcf.gz
The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at 1:1285489 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
INFO field at 1:25186263 .. Could not validate the float [-nan]
INFO field at 1:203935829 .. Could not validate the float [-nan],Could not validate the float [-nan]
INFO field at 8:43093591 .. INFO tag [SUPTYPE=AL,SR,NR] expected different number of values (expected 1, found 3)


------------------------
Summary:
        1524 errors total 

        1362    ..      INFO field at 1:1285489 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
        138     ..      INFO field at 1:25186263 .. Could not validate the float [-nan]
        17      ..      INFO field at 8:43093591 .. INFO tag [SUPTYPE=AL,SR,NR] expected different number of values (expected 1, found 3)
        6       ..      INFO field at 1:203935829 .. Could not validate the float [-nan],Could not validate the float [-nan]
        1       ..      The header tag 'reference' not present. (Not required but highly recommended.)
[ec2-user@ip-172-31-16-65 HG002]$ vcf-validator -u sniffles.output.vcf.gz_wittyerFiltered.vcf 
The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at 2:307140 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
INFO field at 2:1938079 .. Could not validate the float [nan]

After filtering, I get many less errors:

$ vcf-validator -u sniffles.output.vcf.gz_wittyerFiltered.vcf 
The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at 2:307140 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
INFO field at 2:1938079 .. Could not validate the float [nan]


------------------------
Summary:
        92 errors total 

        79      ..      INFO field at 2:307140 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
        12      ..      INFO field at 2:1938079 .. Could not validate the float [nan]
        1       ..      The header tag 'reference' not present. (Not required but highly recommended.)

I guess this means that I need to change the INFO tag SUPTYPE=AL,SR,NR to only one value? Do you have a recommendation? Do you know what the nan value should be changed to?

Thanks

Contributor

Kentalot commented Sep 21, 2020 •

edited

Loading

the SUPTYPE shouldn't be a problem, but the nan could be an issue, could you post at least the one on 2:1938079 or at least tell me which field has the nan.

Author

drdivine commented Sep 21, 2020

here we go:

 vcf-validator -u sniffles.output.vcf.gz_wittyerFiltered.vcf 
The header tag 'reference' not present. (Not required but highly recommended.)
INFO field at 7:1312741 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
INFO field at 7:6096244 .. Could not validate the float [nan]


------------------------
Summary:
        84 errors total 

        77      ..      INFO field at 7:1312741 .. INFO tag [SUPTYPE=AL,SR] expected different number of values (expected 1, found 2)
        6       ..      INFO field at 7:6096244 .. Could not validate the float [nan]
        1       ..      The header tag 'reference' not present. (Not required but highly recommended.)
[ec2-user@ip-172-31-16-65 HG002]$ cat sniffles.output.vcf.gz_wittyerFiltered.vcf | grep 6096244
7       6096244 13497   N       AAAATATATATATATATATATATATATATAT .       PASS    CHR2=7;END=6096278;RE=12;PRECISE;SVLEN=31;SVMETHOD=Snifflesv1.0.11;SVTYPE=INS;STD_quant_start=0.0;STD_quant_stop=2.25832;Kurtosis_quant_start=nan;Kurtosis_quant_stop=-0.943684;SUPTYPE=AL;STRANDS=+-   GT:DR:DV        ./.:.:12

Contributor

Kentalot commented Sep 21, 2020 •

edited

Loading

Nope, that nan is on a non-required field, so that's not the problem. could you send me the vcf? If you don't want to post it here, could you email it to me. you can find that in readme.

Author

drdivine commented Sep 21, 2020

sniffles.output.vcf.gz_wittyerFiltered.vcf.gz

Contributor

Kentalot commented Sep 21, 2020

Ok, I found it. There's a bug in witty.er, kinda (more like a bug in vcf spec)

Entries like these:

1	68628999	36520	N	N[GL000195.1:0[	.	PASS	RE=25;PRECISE;SVLEN=0;SVMETHOD=Snifflesv1.0.11;SVTYPE=BND;STD_quant_start=0.0;STD_quant_stop=0.0;Kurtosis_quant_start=7.13381;Kurtosis_quant_stop=1.96099;SUPTYPE=SR;STRANDS=+-	GT:DR:DV	./.:.:25

have a 0-position as the breakend point N[GL000195.1:0[, which is acceptable in vcf kinda. VCF is 1-based so the first position should always be 1 for positions like these, but technically VCF says 0 is reserved for telomeres. I am thinking sniffles is trying to signify that the whole GL000195.1 is attached to chr1 or something so is using the telomere position? Anyway, to get around this for witty.er, you can do a sed to replace all :0[ strings with :1[ instead and it'll finish running. However, it still doesnt' seem to detect any breakends since it seems sniffles outputs breakends as single entries and witty.er currently only supports dual entry breakends for now. not sure how important these breakends are for evaluating.

Author

drdivine commented Sep 22, 2020

Okay, I'll try this out and let you know. Thanks for the tip.

Author

drdivine commented Sep 29, 2020

Just came up with an uncaught exception:

running following command:
./runWitty.er.sh /home/ec2-user/GenomicsWorkflowCode/containers/witty.er/HG002_witty_overlap/HG002_b37_delly2.vcf.gz_wittyerFiltered.vcf /home/ec2-user/GenomicsWorkflowCode/containers/witty.er/HG002_witty_overlap/HG002_SVs_Tier1_v0.6.vcf.gz_wittyerFiltered.vcf /home/ec2-user/GenomicsWorkflowCode/containers/witty.er/HG002_witty_overlap/benchmark_delly2
Fail to parse 1 variants from truth, check first 10 or less: 
Exception caught:
System.IO.InvalidDataException: CN does not have a valid value in 1	86005692	DEL00000691	AAATATTAAAATATTTCTATCAATTATGAATATTCCATTGATTCAATAATCTCTTTTTAATAAAAAATTATTACATTAAATATACATGCTAAGATTTTTCTTGATTTAAAAAACACTAAAATGAGAAAAAAATAATGTGAATCTTAGAAACTAGTAAATAGAGTATCATCCCCATTTTAAACTAAGTCAACAATCTTTCCAATATACTATATTCCCTTTTCCTTTGACCTGGGGTTACACCTTTTAGTCTTATTTCCCTATCTTCCTCTTTATGTTCCTTACATTTACCCAGCCCTGGGCCTTGGCGTTGATTTCTGCTGTGCCTGGACTATGAATTCCTCCTTACTCTACATGCCCAAAGACAACCAGGTTTCAAAGCCCAGCACAACACCCACTCCTTTGTTAACTCTTTCCTGATTCCCCAAATTAAAATGACCTCTTATTCCTCTGAACTACCATACATTTTGCACTCTTGTATGAAACTACTCTTTGTATCTTGTGCTAGGGTTACGTATACGCTTTCTCTCCATTAGAGTATGCTTCTTCAAGACCTAGACTATAATTCACACATCTTTGTACAACTACCTCCAATGCTTTCCTCCTATCCCCACCATAAAGCCCAGCAACTTCTTGCATGTAACCGGCTCTTTAAAAATATTTATCATTTGGCCAGGTGCAGTGACTCACGCCTGTAATCCCAGCACTTTGGGAGGCTGAGGTGGGTGGATCACTTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAGCCCTGTCTCTACTAAAAAAATACAGAAATTAGCTGGGCGTGGTGGTGCATGCCTGTAATCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATTGCTTAAGCCTGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGTGCTACTGCACTCCAGCCTGGGTGAAAGAGCGAGACTCAGTATCTAAAAAAAAAAAAAAAAAAAAAAATGTATATCTATATCTATATCTGTATCTATATCTATCATTTGGATGACTAATCAGAACACATTGAATGGTTTTTCTCTATTATTCATTATGACCAAATGAATACCAGTTATTGTCACGTCACCAAGCATGTCACCACTATTCAGCCAAGAATTGAGATGTCGTGTTTCAAGAGACGTTCATTCTCCTGCCATGATCCAGTCACAACTGATGGGCCTGCATAGTAGCACTTTTCTAAGTGATAGCCTTTATATTTCTTTAGAACCAAAACTTATTTGAAGGTTCTCTGTAAGCACTTAGGCTAAGCACACCTTGCTTTAAAAAATATAAAATCCTTCGAAACAACAAATGGAGGAAATGTCCTGTGTGCTAAAATGGAAATTGTTTTGTCATTTTGTTCTTGCTACTGCCTGGAAGAGAAGGAGTTGCTGGAACCCTAAATATGTCTTGGCCCTGGCCCTAGTATTTCATAGGAAACTGTTATTTTTTTAAAAAAGTACTTCTCAGGTATTGCAACTTCAGACATTAGTAAATGATGCTGTGTTGGGTTTTTGCTTGTTCATGTGTTTGTTTATGTTTTATAGGTTATAGCAAAATCCTGTGGTGATCTATCACTAAAACCCTTCATTTGGGTGGTAGAGTTCACAGAAATTTTTAAACACTTTGCTACAGACTGAATATTGTGTCCCCGTAGAATTCATTTGTAAAATAGACTTTAAGCCAATGAAGATCAAAAGAGACAAAGAAGGCCATTACATAATGGTAAAAGGATCAATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGTGCCCAGATTCATAAAGCAAGTTCTTAGAGACCTAAAAAGAGACTTAGACTCCCACACAATAATAATGGGAGACTTTAATACCCCACTGTCAATATTAGACAGATCAACGAGACAGAAAATTAACAAGGATATTCAGGACTTGAACTCAGCTCTACACCAAGCAGACCTAATAGACATCTATAGAACTCTCCACCCCAAATCAACAGAATATACATTCTTCTCAGCACCACATCACACTTATTCTAAAATTGACCACAGATTTGGAAGTAAAACACTCCTCAGCAAATGTAAAAGAATAGAAATCACAACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGCATTAAGAAACTCACTCAAAACTGCACAACTACATGGAATTAACCTGCTCCTGAATGACTACTGGGTAAATAATGTAATGAAGGCAGAAACAAAGAGGTTCTTTGGAACCAATGAGAACAAAGACACAATGTACCAGAATCTCTGGTACACATTTAAAGCAGTGTGTAGAGGGAAATTTATAGCAGTAAATGCCCACAAGAGAAAGCATGAAAGATCTAAAATCGACACCCTAACATCACAATTAAAAGAACTAGAGAAGCAAGAGCAAATACATTCAAAAACTAGCAGAAAGCAAGAAATAACTAAGATCAGAGCAGAACTGAAGGAGATCGAGACACAAAAATCCCTTCAAAAAAAAATCAATGAATCAAGGAGCTGGTTTTTTGAAAAAAATCAACAAAATGGATAGACCCGCTAGCAAGACTCATAAAGAAGAAAAGAGAGAAGAATCAAATAGATGCAGTAAAAAACGAAAAAAGGGATATCACTATGGATCCCACAGAAATACAAACTACTCTCAGAGAATACTATGACACCTCTATGCAAATAACCTAGAAAATCTAGAAGAAATGGATAAATTCCTGGATACATACAGCCTCCCAAGACTAAACCAGGAAGAACTCAAATCCTGGAATAGACCATCAACGGGTTCTGAAATTGAGGTGATAATTAATAGCCTACCAAGCAAAAAAATTCCGGGACCAGACAGATTCACAGCCAAATTCTACCAGAGGTAAAAAGAGGAGCTGATACCATTCCTTCTGAAACTATTCGAATCAATAGAAAAAGAGGGAAACCTCCCTAACTCATTTTATGAGGCCAGCATCATCCTGATACCAAAGCCTGGCAGAGACAAAACAAAAAAAGATAATTTTAGGCCAATATCCCTAATGAACATCGATGTGAAAATCCTCAATAAAGTCCTGGCAAACCGAATCCAGCAGCACGTCAAACAGCTTATCCACCATGATCAAGTTGGCTTCATCCCTGGGATGCAAGGCTGGTTCGACATACACAAATCAACAAATGTAATCCATCACTTAAACAGAACCAATGACAAAACCACATAATTATCTCAATAGATGCAGAAAAGGCCTTCAACAAAACTCAACAGCCTTTCATGCTAAAAACTCTCAATAAACTAGATATTGATTGAACATATCTCAAAATAATAAGAGCTATTTATGACAAGCCCACAGCC	A	2607	PASS	END=146486081;MAPQ=0;PE=0;SVTYPE=DEL;CT=3to5;SR=44;SRMAPQ=59;CE=1.96315;PRECISE;SRQ=0.975059;CIEND=-7,7;CONSENSUS=CTCTTGTCACCCAGGCTGGAGTGCAACGGCACAGTCTTGGCTCACTGCAACCTCCACCTCCAGGGTTCAAACGATTCTCCTGCCTCAGCCTCCCACGTAGCTGGGATGACAGGCTCCCGCCACCATACTCAGCTAATTTCTGTATTTTTAGTAGAGATGGGGTCTCACCATGTTGGCCAGGCTGGCCTCAAACTCCTGAGCTCAGGTGATCCGCCTTTCTCGGCCTCCCAAAGTGGGATTACAGGTGGGAGCCACCGCGCCTTGCCAAAGTTACAGATTTCTAAATGCCTCCTGGATAATCTCCATCTGGAGTTGGATTCAGTGTTTCCAGATCAAACATATCCATTTCCTTCCACAATTGTTTCTTTTCCCGTACCTTTGACTTCGGTTAATAGCACAGTTCTCTCAATCATATCCACTTCTCCCTTTTCTGTACCCGTATCAGCTGGCAGGCCCTGTCAACAGCACCCCCTAAGGGAAAGCTGGGCCATTCACTTTCATAAGTTCTCCCCCTCAATTACCCCTCATTAACTCTTTTCCAGACTATTGTAGTAGTGTCCCAGATAGGTTCCATCCTCGATCAATTCTTTATATTACTGGCAGGTGATTTTAATGACCCAGTTTGATTTGCCTTTCTTCTTTTTGAAATCCTTCAATAGTTCTCCATTGCCTGTCAAACTTAATTCAACTCTCTTTCTATAAAAATAATAACTTTATTGAGCCCACTATGTGACCAGTGCTTTACATATATATAATTTTACTTATTTCTCCAAAAGCTTATAAAGTAGGTATTATACCATATCCTTGAAAATGAAATCAATTTTATAACTATGAAATATCATACTGAATGGGCAAAAACTGGAAGCATTCCCTTTGAAAACTGGCACAAGACAGCGATGCCCTCTCTCACCACTCCTATTCAACATAGTATTGGAAGTTCTGGCCAGGGCAATCAGGCAAGAGAAATAAATAAAGGGTATTCCATTAGGAAAAGAGGAAGTCAGATTGTCCCTGTTTTGCAGATGACATGATTGTATATTTAGAAAACCCCATCGTCTCAGCCCCAAATCTCCTTAAGCTGATAAGCAACTTCAGCAAAATCTCAGGATACAAAATCAATGTGCAAAAATTAGAAGCATTCCTATACACCAAGAACAGACAAACAGAGAGCCAAATCATGAGTGAACTCCCATTCACAATTACTACAAAGAGAATAAAATATCTAGGAATCCAACTTACAAGGGACGTGAAGGACCTCTTCAAGGAGAACTACAAACCACTGCTCAATGAAATAAAAGAGGACACAAACAAATGGAAGAACATTCCATGCTCATGGATAGGAAGAATCAATATCATGAAAATGGCCATACTGCCCAAGGTAATTTATAGATTCAATGCCATCCCCATCAAGCTACCACTGACTTTCTTCATAGAACTGGAAAAACTACTTTAAACTTCATATGGAACCAAAAAAGGGCCCTCATAGCCAACACAATCTCAAGCAAAAAGAACAAAGCTGGAGGCATCACGCTACCTGACTTCAAACTATACTACAAGGCTACAGTAAACAAAACAGCATGGTACTGGTACCAAAACAGAGATATAGACCAATGGAACAGAACAGAGGCCTCAGAAATAACACCACACATCTACAACCATCAGTCTTTGACAAACCA;SVMETHOD=EMBL.DELLYv0.8.3;INSLEN=0;CIPOS=-7,7;HOMLEN=6	GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV	1/1:-41.1991,-3.31041,0.0:33:PASS:1671647059:-1951793036:1625441051:-1:0:0:0:11
   at Ilmn.Das.App.Wittyer.Vcf.Samples.WittyerSample.CreateFromVariant(IVcfVariant baseVariant, IVcfSample sample, Boolean isReference) in /src/Ilmn.Das.App.Wittyer/Vcf/Samples/WittyerSample.cs:line 51
   at Ilmn.Das.App.Wittyer.Vcf.Variants.WittyerVariantInternal.Create(IVcfVariant baseVariant, IVcfSample sample, WittyerType svType, IReadOnlyList`1 bins, Nullable`1 percentageDistance, UInt32 basepairDistance) in /src/Ilmn.Das.App.Wittyer/Vcf/Variants/WittyerVariantInternal.cs:line 234
   at Ilmn.Das.App.Wittyer.Vcf.WittyerVcfReader.CreateVariant(IVcfVariant vcfVariant, IVcfSample sample, Boolean isTruth, String sampleName, IReadOnlyDictionary`2 inputSpecDict, IDictionary`2 bndSet, List`1 errorList, Boolean isCrossTypeOn)
1	86005692	DEL00000691	AAATATTAAAATATTTCTATCAATTATGAATATTCCATTGATTCAATAATCTCTTTTTAATAAAAAATTATTACATTAAATATACATGCTAAGATTTTTCTTGATTTAAAAAACACTAAAATGAGAAAAAAATAATGTGAATCTTAGAAACTAGTAAATAGAGTATCATCCCCATTTTAAACTAAGTCAACAATCTTTCCAATATACTATATTCCCTTTTCCTTTGACCTGGGGTTACACCTTTTAGTCTTATTTCCCTATCTTCCTCTTTATGTTCCTTACATTTACCCAGCCCTGGGCCTTGGCGTTGATTTCTGCTGTGCCTGGACTATGAATTCCTCCTTACTCTACATGCCCAAAGACAACCAGGTTTCAAAGCCCAGCACAACACCCACTCCTTTGTTAACTCTTTCCTGATTCCCCAAATTAAAATGACCTCTTATTCCTCTGAACTACCATACATTTTGCACTCTTGTATGAAACTACTCTTTGTATCTTGTGCTAGGGTTACGTATACGCTTTCTCTCCATTAGAGTATGCTTCTTCAAGACCTAGACTATAATTCACACATCTTTGTACAACTACCTCCAATGCTTTCCTCCTATCCCCACCATAAAGCCCAGCAACTTCTTGCATGTAACCGGCTCTTTAAAAATATTTATCATTTGGCCAGGTGCAGTGACTCACGCCTGTAATCCCAGCACTTTGGGAGGCTGAGGTGGGTGGATCACTTGAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAGCCCTGTCTCTACTAAAAAAATACAGAAATTAGCTGGGCGTGGTGGTGCATGCCTGTAATCCCAGCTACTTGGGAGGCTGAGGCAGGAGAATTGCTTAAGCCTGGGAGGCAGAGGTTGCAGTGAGCCAAGATCGTGCTACTGCACTCCAGCCTGGGTGAAAGAGCGAGACTCAGTATCTAAAAAAAAAAAAAAAAAAAAAAATGTATATCTATATCTATATCTGTATCTATATCTATCATTTGGATGACTAATCAGAACACATTGAATGGTTTTTCTCTATTATTCATTATGACCAAATGAATACCAGTTATTGTCACGTCACCAAGCATGTCACCACTATTCAGCCAAGAATTGAGATGTCGTGTTTCAAGAGACGTTCATTCTCCTGCCATGATCCAGTCACAACTGATGGGCCTGCATAGTAGCACTTTTCTAAGTGATAGCCTTTATATTTCTTTAGAACCAAAACTTATTTGAAGGTTCTCTGTAAGCACTTAGGCTAAGCACACCTTGCTTTAAAAAATATAAAATCCTTCGAAACAACAAATGGAGGAAATGTCCTGTGTGCTAAAATGGAAATTGTTTTGTCATTTTGTTCTTGCTACTGCCTGGAAGAGAAGGAGTTGCTGGAACCCTAAATATGTCTTGGCCCTGGCCCTAGTATTTCATAGGAAACTGTTATTTTTTTAAAAAAGTACTTCTCAGGTATTGCAACTTCAGACATTAGTAAATGATGCTGTGTTGGGTTTTTGCTTGTTCATGTGTTTGTTTATGTTTTATAGGTTATAGCAAAATCCTGTGGTGATCTATCACTAAAACCCTTCATTTGGGTGGTAGAGTTCACAGAAATTTTTAAACACTTTGCTACAGACTGAATATTGTGTCCCCGTAGAATTCATTTGTAAAATAGACTTTAAGCCAATGAAGATCAAAAGAGACAAAGAAGGCCATTACATAATGGTAAAAGGATCAATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATACAGGAGTGCCCAGATTCATAAAGCAAGTTCTTAGAGACCTAAAAAGAGACTTAGACTCCCACACAATAATAATGGGAGACTTTAATACCCCACTGTCAATATTAGACAGATCAACGAGACAGAAAATTAACAAGGATATTCAGGACTTGAACTCAGCTCTACACCAAGCAGACCTAATAGACATCTATAGAACTCTCCACCCCAAATCAACAGAATATACATTCTTCTCAGCACCACATCACACTTATTCTAAAATTGACCACAGATTTGGAAGTAAAACACTCCTCAGCAAATGTAAAAGAATAGAAATCACAACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGCATTAAGAAACTCACTCAAAACTGCACAACTACATGGAATTAACCTGCTCCTGAATGACTACTGGGTAAATAATGTAATGAAGGCAGAAACAAAGAGGTTCTTTGGAACCAATGAGAACAAAGACACAATGTACCAGAATCTCTGGTACACATTTAAAGCAGTGTGTAGAGGGAAATTTATAGCAGTAAATGCCCACAAGAGAAAGCATGAAAGATCTAAAATCGACACCCTAACATCACAATTAAAAGAACTAGAGAAGCAAGAGCAAATACATTCAAAAACTAGCAGAAAGCAAGAAATAACTAAGATCAGAGCAGAACTGAAGGAGATCGAGACACAAAAATCCCTTCAAAAAAAAATCAATGAATCAAGGAGCTGGTTTTTTGAAAAAAATCAACAAAATGGATAGACCCGCTAGCAAGACTCATAAAGAAGAAAAGAGAGAAGAATCAAATAGATGCAGTAAAAAACGAAAAAAGGGATATCACTATGGATCCCACAGAAATACAAACTACTCTCAGAGAATACTATGACACCTCTATGCAAATAACCTAGAAAATCTAGAAGAAATGGATAAATTCCTGGATACATACAGCCTCCCAAGACTAAACCAGGAAGAACTCAAATCCTGGAATAGACCATCAACGGGTTCTGAAATTGAGGTGATAATTAATAGCCTACCAAGCAAAAAAATTCCGGGACCAGACAGATTCACAGCCAAATTCTACCAGAGGTAAAAAGAGGAGCTGATACCATTCCTTCTGAAACTATTCGAATCAATAGAAAAAGAGGGAAACCTCCCTAACTCATTTTATGAGGCCAGCATCATCCTGATACCAAAGCCTGGCAGAGACAAAACAAAAAAAGATAATTTTAGGCCAATATCCCTAATGAACATCGATGTGAAAATCCTCAATAAAGTCCTGGCAAACCGAATCCAGCAGCACGTCAAACAGCTTATCCACCATGATCAAGTTGGCTTCATCCCTGGGATGCAAGGCTGGTTCGACATACACAAATCAACAAATGTAATCCATCACTTAAACAGAACCAATGACAAAACCACATAATTATCTCAATAGATGCAGAAAAGGCCTTCAACAAAACTCAACAGCCTTTCATGCTAAAAACTCTCAATAAACTAGATATTGATTGAACATATCTCAAAATAATAAGAGCTATTTATGACAAGCCCACAGCC	A	2607	PASS	END=146486081;MAPQ=0;PE=0;SVTYPE=DEL;CT=3to5;SR=44;SRMAPQ=59;CE=1.96315;PRECISE;SRQ=0.975059;CIEND=-7,7;CONSENSUS=CTCTTGTCACCCAGGCTGGAGTGCAACGGCACAGTCTTGGCTCACTGCAACCTCCACCTCCAGGGTTCAAACGATTCTCCTGCCTCAGCCTCCCACGTAGCTGGGATGACAGGCTCCCGCCACCATACTCAGCTAATTTCTGTATTTTTAGTAGAGATGGGGTCTCACCATGTTGGCCAGGCTGGCCTCAAACTCCTGAGCTCAGGTGATCCGCCTTTCTCGGCCTCCCAAAGTGGGATTACAGGTGGGAGCCACCGCGCCTTGCCAAAGTTACAGATTTCTAAATGCCTCCTGGATAATCTCCATCTGGAGTTGGATTCAGTGTTTCCAGATCAAACATATCCATTTCCTTCCACAATTGTTTCTTTTCCCGTACCTTTGACTTCGGTTAATAGCACAGTTCTCTCAATCATATCCACTTCTCCCTTTTCTGTACCCGTATCAGCTGGCAGGCCCTGTCAACAGCACCCCCTAAGGGAAAGCTGGGCCATTCACTTTCATAAGTTCTCCCCCTCAATTACCCCTCATTAACTCTTTTCCAGACTATTGTAGTAGTGTCCCAGATAGGTTCCATCCTCGATCAATTCTTTATATTACTGGCAGGTGATTTTAATGACCCAGTTTGATTTGCCTTTCTTCTTTTTGAAATCCTTCAATAGTTCTCCATTGCCTGTCAAACTTAATTCAACTCTCTTTCTATAAAAATAATAACTTTATTGAGCCCACTATGTGACCAGTGCTTTACATATATATAATTTTACTTATTTCTCCAAAAGCTTATAAAGTAGGTATTATACCATATCCTTGAAAATGAAATCAATTTTATAACTATGAAATATCATACTGAATGGGCAAAAACTGGAAGCATTCCCTTTGAAAACTGGCACAAGACAGCGATGCCCTCTCTCACCACTCCTATTCAACATAGTATTGGAAGTTCTGGCCAGGGCAATCAGGCAAGAGAAATAAATAAAGGGTATTCCATTAGGAAAAGAGGAAGTCAGATTGTCCCTGTTTTGCAGATGACATGATTGTATATTTAGAAAACCCCATCGTCTCAGCCCCAAATCTCCTTAAGCTGATAAGCAACTTCAGCAAAATCTCAGGATACAAAATCAATGTGCAAAAATTAGAAGCATTCCTATACACCAAGAACAGACAAACAGAGAGCCAAATCATGAGTGAACTCCCATTCACAATTACTACAAAGAGAATAAAATATCTAGGAATCCAACTTACAAGGGACGTGAAGGACCTCTTCAAGGAGAACTACAAACCACTGCTCAATGAAATAAAAGAGGACACAAACAAATGGAAGAACATTCCATGCTCATGGATAGGAAGAATCAATATCATGAAAATGGCCATACTGCCCAAGGTAATTTATAGATTCAATGCCATCCCCATCAAGCTACCACTGACTTTCTTCATAGAACTGGAAAAACTACTTTAAACTTCATATGGAACCAAAAAAGGGCCCTCATAGCCAACACAATCTCAAGCAAAAAGAACAAAGCTGGAGGCATCACGCTACCTGACTTCAAACTATACTACAAGGCTACAGTAAACAAAACAGCATGGTACTGGTACCAAAACAGAGATATAGACCAATGGAACAGAACAGAGGCCTCAGAAATAACACCACACATCTACAACCATCAGTCTTTGACAAACCA;SVMETHOD=EMBL.DELLYv0.8.3;INSLEN=0;CIPOS=-7,7;HOMLEN=6	GT:GL:GQ:FT:RCL:RC:RCR:CN:DR:DV:RR:RV	1/1:-41.1991,-3.31041,0.0:33:PASS:1671647059:-1951793036:1625441051:-1:0:0:0:11
--------------------------------
Overall Stats:
Overall EventPrecision: 0.000 %
Overall EventRecall: 0.000 %
Overall EventFscore: NaN
--------------------------------

We are getting results when we compare our benchmark (HG002) to Sniffles and SVIM, but the comparison with Delly2 is giving us the above error. Is this the reason why we have 0 precision and Recall?

I would be very grateful for your thoughts on this.

Best Regards

Matt

Contributor

Kentalot commented Sep 29, 2020 •

edited

Loading

Yeah, this one looks like the error message was actually accurate. Looks like CN is -1, which is invalid. Maybe manually update that entry? Looks like it's a DEL, so CN should be 1

Contributor

Kentalot commented Sep 29, 2020 •

edited

Loading

You might also want to check the CN of DUPs and REFs to make sure they are using CN correctly. Witty.er assumes CN is an absolute Copy Number and maybe this tool expresses it as a relative copy number. You need to make sure your truth also has the correct convention.

Contributor

Kentalot commented Oct 1, 2020

Yeah, this one looks like the error message was actually accurate. Looks like CN is -1, which is invalid. Maybe manually update that entry? Looks like it's a DEL, so CN should be 1

Actually since it's 1/1, then CN should be 0? CN of 1 should be 0/1

Author

drdivine commented Oct 2, 2020

Yeah, this one looks like the error message was actually accurate. Looks like CN is -1, which is invalid. Maybe manually update that entry? Looks like it's a DEL, so CN should be 1

So, yeah, I took the one row out, and I still get absolutely no overlap for my delly2 vcf. Have you tested with Delly2? We use truvari to do the comparison and we are getting around 90% recall, for example. Do you have any further thoughts as to why we aren't seeing anything with Witty.er?

Once again, I am very grateful for any help you may be able to offer in this regard.

Contributor

Kentalot commented Oct 2, 2020

So, yeah, I took the one row out, and I still get absolutely no overlap for my delly2 vcf. Have you tested with Delly2? We use truvari to do the comparison and we are getting around 90% recall, for example. Do you have any further thoughts as to why we aren't seeing anything with Witty.er?

Once again, I am very grateful for any help you may be able to offer in this regard.

Not sure, could you send over the truth and query vcf? if not, can you paste the stdout output so we can see if it actually detected the entries from truth or query at all?

Author

drdivine commented Oct 2, 2020

here are the two files. https://drive.google.com/drive/folders/1Zy1hlu1OzM88nTiw_I0dMYWg5pvl__T7?usp=sharing
Interested to hear what you find.

Contributor

Kentalot commented Oct 3, 2020

Ok i'll get back to you.

silviacongenica commented Oct 6, 2020

Hello, has there been a resolution of this issue? I encountered the same error when running witty.er on a vcf file generated with sniffles.

Contributor

Kentalot commented Oct 6, 2020

Hello, has there been a resolution of this issue? I encountered the same error when running witty.er on a vcf file generated with sniffles.

Yes, for sniffles, you need to filter out all SVTYPEs except ["DEL", "INS", "DUP", "INV", "CNV"] and then also replace all :0[ strings with :1[

silviacongenica commented Oct 7, 2020

Thank you very much, this resolved some issues, but witty.er cannot compare my two vcf files - it reads all variants for query dataset, but 0 for the truth dataset in all categories. However when I compare the truth against itself, there is no issue and it reads all categories.
Could I please send you the vcf files and witty.er output to figure out where is the problem?

Contributor

Kentalot commented Oct 7, 2020

Thank you very much, this resolved some issues, but witty.er cannot compare my two vcf files - it reads all variants for query dataset, but 0 for the truth dataset in all categories. However when I compare the truth against itself, there is no issue and it reads all categories.
Could I please send you the vcf files and witty.er output to figure out where is the problem?

Yes please send to me. That sounds really weird.

Also, drdivine I will try to find time today. I've been pretty busy recently. Thanks.

silviacongenica commented Oct 7, 2020

Thank you very much, I sent the files to the emails listed on the witty.er page. Any input will be welcome!

Contributor

Kentalot commented Oct 7, 2020 •

edited

Loading

here are the two files. https://drive.google.com/drive/folders/1Zy1hlu1OzM88nTiw_I0dMYWg5pvl__T7?usp=sharing
Interested to hear what you find.

So it looks like your truth file is double gzipped, which is weird. Just FYI.

Anyway, after running the tool, I looked at the annotated output vcf. The WHY field shows the following reason why no match was found (different reasons for different entries) along with the number of entries that had this reason (for both truth and query):

/// Not Assessed because the entry had an excluded or non-included filter
FilteredBySettings: 61327

/// Not Assessed because the included filter included PASS and Sample FT was not PASS
FailedSampleFilter: 3969

/// False because there was no overlap
NoOverlap: 36239

/// Not Assessed because the entry was a reference call on a type other than CNV/DUP/DEL
UnsupportedRefCall: 18114

I tried adding --includedFilters="" to the command line to see if I could get more hits, but no luck. I looked at the categorization of the entries and it looks like they are different. Here's the breakdown:

CopyNumberGain in Query: 177
CopyNumberGain in Truth: 0

CopyNumberLoss in Query: 14740
CopyNumberLoss in Truth: 0

CopyNumberReference in Query: 10861
CopyNumberReference in Truth: 0

Deletion in Query: 0
Deletion in Truth: 5464

Insertion in Query: 0
Insertion in Truth: 7281

So it looks like the problem is your query is using CN field so CNVs while your truth is considering things as Deletions. Witty.er supports this type of matching and calls it cross-type matching.

So I added --evaluationMode=cts and reran it. And here are the results:

--------------------------------
Overall Stats:
Overall EventPrecision: 14.951 %
Overall EventRecall: 30.035 %
Overall EventFscore: 19.964 %
--------------------------------
QuerySample     TruthSample     QueryTotal      QueryTp QueryFp Precision       TruthTotal      TruthTp TruthFn Recall  Fscore  BaseQueryTotal  BaseQueryTp     BaseQueryFp     BasePrecision   BaseTruthTotal        BaseTruthTp     BaseTruthFn     BaseRecall      BaseFscore

20      HG002   25778   3854    21924   14.951 %        12745   3828    8917    30.035 %        19.964 %        4784875 2119209 2665666 44.290 %        3737937 2119209 1618728 56.695 %        49.730 %


--------------------------------
CopyNumberReference
--------------------------------
QuerySample     TruthSample     QueryTotal      QueryTp QueryFp Precision       TruthTotal      TruthTp TruthFn Recall  Fscore  
20      HG002   10861   0       10861   0.000 % 0       0       0       NaN     NaN     


--------------------------------
Deletion
--------------------------------
QuerySample     TruthSample     QueryTotal      QueryTp QueryFp Precision       TruthTotal      TruthTp TruthFn Recall  Fscore  
20      HG002   14740   3854    10886   26.147 %        5464    3828    1636    70.059 %        38.081 %        


--------------------------------
Duplication
--------------------------------
QuerySample     TruthSample     QueryTotal      QueryTp QueryFp Precision       TruthTotal      TruthTp TruthFn Recall  Fscore  
20      HG002   177     0       177     0.000 % 0       0       0       NaN     NaN     


--------------------------------
Insertion
--------------------------------
QuerySample     TruthSample     QueryTotal      QueryTp QueryFp Precision       TruthTotal      TruthTp TruthFn Recall  Fscore  
20      HG002   0       0       0       NaN     7281    0       7281    0.000 % NaN

As you can see, now there are matches, but still very poor performance (Deletions have the best with 26.147% Precision and 70.059% Recall. A reason could be multiple. For example, the CN could be wrong (like I mentioned in previous messages since they could be relative CN or just incorrect CN, which will throw off the program). This could also fix the CopyNumberReferences which have a huge number. I don't think you can save the Insertions in the truth since you don't have insertions in query. Hope that helps

Contributor

Kentalot commented Oct 7, 2020

Just to compare, I tried w/o --includedFilters="" and the results were similar so probably best to drop that parameter since you probably want the PASS entries only. Another thing you might want to do is check how the binning went. You might want to try --binSizes=9999999999 or something so that everything is in one bin at first to see if the bin sizes are hurting your stats.

silviacongenica commented Oct 9, 2020

Thank you for looking at the files, I tweaked parameters (--bpd 1000 --pd 0.5 -vt Deletion --em sc) and achieved better scores, but overall I do not think that the witty.er output is correct. A few observations:

when I compare another truth dataset against the truth I sent you, the precision and recall scores are 88 and 8 % for deletions, and 74 % and 62 % for insertions (the datasets contain mainly insertions and deletions). You would expect these numbers to be better for two truth SV datasets from the same sample.
witty.er output comparing my dataset to truth I sent you achieves 5.5 and 9.5% precision and recall for insertions, and 1.2 % and 5.3% for deletions. However, a general look at vcfstats for the same data shows that the numbers of deletions and insertions on individual chromosomes are similar. I find it impossible that all variants would be different.
the witty.er documentation does not specify very well how witty.er defines a match or mismatch, which would affect the precision and recall scores. Could you please elaborate how the algorithm deals with SVs that do not overlap perfectly, have different genotype or length? And especially the --em parameter and the different counting methods? This made the biggest difference in precision and recall scores for me.

Contributor

Kentalot commented Oct 9, 2020 •

edited

Loading

the witty.er documentation does not specify very well how witty.er defines a match or mismatch, which would affect the precision and recall scores. Could you please elaborate how the algorithm deals with SVs that do not overlap perfectly, have different genotype or length? And especially the --em parameter and the different counting methods? This made the biggest difference in precision and recall scores for me.

You can think of the overlapping method kinda like a centered-reciprocal overlap matching. So bpd is 0.25 I think by default which would correspond to 50% reciprocal overlap. The difference is that it must be 50% AND the two events must have their centers be close enough. If there's 50% but one event has its left end really far away from the other event's left end, then they might not be considered a match. Also, the interaction of PD and BPD is special. for small events, PD dominates where it's a percentage of the size of the event that determines how well it overlaps. As the event's size increases, the number of basepairs it has as leeway increases until it matches the number of basepairs specified in BPD. After that point, it doesn't matter how large the event gets, it'll still use BPD. Of course, CIPOS and CIEND also influence this. The readme goes into this in more depth. You should read it there and if you have specific questions, feel free to ask.

The different modes are this:

Default means it needs to match as per above AND it has to have the same Genotype
SimpleCounting mode means it needs to match as per above only, Genotype could be totally different. So if you don't expect GT to match, please use this mode.
CrossTypeMatching mode means it converts CNV Gain types to be DUP types and CNV Loss types to be DEL types so it can match SVs and CNVs in the same category. This is useful for example for drdivine where output is CNVs but the truth is SVs.

Contributor

Kentalot commented Oct 9, 2020

For the truth vs. truth, I don't know the sources and everything so I can't say that the relatively low score is expected or not. I would think that if they're the same truth, it should be higher but maybe they differ somehow?

For your specific case of really low scores, I'll have to look at the results themselves. Are they the same ones you sent me before or new results? Using pd of 0.5+ is almost like 1bp minimum overlap or something, which should get higher scores heh.

One thing to try is the --includedFilters="" option so that filters are ignored. Another good exercise to do is to categorize by WHY to see the reasons there was failures to match.

ddrichel commented Oct 20, 2020

The compatibility issues with delly calls were due to different definitions of the CN tag, see https://groups.google.com/g/delly-users/c/HNvy1ypH7EA
The issue is now resolved, thanks to @tobiasrausch for the speedy fix.

Contributor

Kentalot commented Oct 20, 2020

The compatibility issues with delly calls were due to different definitions of the CN tag, see https://groups.google.com/g/delly-users/c/HNvy1ypH7EA
The issue is now resolved, thanks to @tobiasrausch for the speedy fix.

Wow awesome. Glad Delly was able to improve their vcfs as a result!

Kentalot closed this as completed

kubranarci mentioned this issue

wittyer bug nf-core/variantbenchmarking#59

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment