What is true, thank you, earnestly. A large variant evaluation tool upgraded from wit.ty
Developers: Yinan Wan, Kent Ho
Release Notes: Witty.er release notes
- witty.er is built using C# dotnet 6.0 framework or docker.
- This repo can be cloned and compiled using MS Build or run directly with dotnet.
- To run in MS Windows or linux, just call dotnet Wittyer.dll.
- Alternatively, Docker can be used. See Quick Start
- IMPORTANT NOTE: This doesn't work on Mac M1, for that case, please use unzipped files only and docker.
To compare a query VCF's passing variants against a truth VCF, use the following commmand line:
# first clone the repo
git clone https://github.com/Illumina/witty.er.git
# compile the repo
cd witty.er/Wittyer
dotnet publish . -c Release
# run the dll
dotnet bin/Release/netcoreapp2.0/Wittyer.dll -i input.vcf -t truth.vcf -o outputdir
# Or if Docker is installed and running (example is for Mac or Linux)
# NOTE: Mac Docker has a default memory limit of 2GB, but the program needs more memory than that.
# Please see https://docs.docker.com/docker-for-mac/#advanced to set the docker container memories to a minimum of 4GB.
git clone https://github.com/Illumina/witty.er.git
# Note: Mac M1 needs `--platform linux/x86_64` to be appended after `docker build`
docker build witty.er -t wittyer
# Note: Mac M1 needs `--platform linux/x86_64` to be appended after `docker run`
docker run --rm \
--workdir $(pwd) \
-v $(pwd):$(pwd) \
wittyer \
-i input.vcf -t truth.vcf -o outputdir
- Compatibility notes:
- Delly
- Update to latest delly which has this bug fixed or Remove all the CN annotations
- PBSV
- It has illegal characters (non-VCF spec legal) in reference. Things like
R
,K
,W
etc.
- It has illegal characters (non-VCF spec legal) in reference. Things like
- sniffles
- Filter out all SVTYPEs except ["DEL", "INS", "DUP", "INV", "CNV"]
- HGSVC truth set
- CIEND and CIPOS that are 0,. or .,0 or what not crashes since the intention is unclear. Please filter out entries with these values for these tags.
- dipcall
- It has illegal characters (non-VCF spec legal) in DNA sequences. Things like
R
,K
,W
etc. - contains multi-allelic alts, probably split and take largest one or first one
- lacks SVLEN INFO tag (might not be fully needed, but it helps)
- It has illegal characters (non-VCF spec legal) in DNA sequences. Things like
- Delly
- Currently witty.er does SV comparison on a per-type basis, and only supports the following SV types:
- Deletions (DEL), Insertions (INS), Duplications (DUP), Inversions (INV), Breakends (BND, which are split into TranslocationBreakends and IntraChromosomeBreakend types), and Copy Number Variants (CNV, which are split into CopyNumberGain, CopyNumberLoss, and CopyNumberReference types)
- Witty.er requires SVTYPE to be specified in the vcf INFO field, except for Reference sites.
- Inserted segment is not used in comparison for Insertion
- Insertion is considered as a breakend and evaluated the same way
- CNVs require the CN sample tag in order to be processed.
- CNVs' reference copy number is based on the GT tag. e.g. if it's 0/1 (diploid), reference copy is assumed to be 2.
- DUP and DELs that have CN sample tags are processed as CNVs
- CNV entries without GT is assumed to be diploid.
- Complex type comparisons are currently not supported (e.g insertions in a deletion)
- Limited support for Multi-allelic entries due to vcf annotation is overly complicated
- Multi-alts with different SVTYPEs are annotated as NotAssessed(N) and completely left out from stats
- For multi-alt CNV gains, witty.er currently only using total CN for evaluation without regard to CN number of each individual ALT.
- Only paired Breakends are supported (single or triple, etc. breakends are not supported)
- Ref sites and Ref call samples (samples with GT of 0/0, 0, or 0|0, etc) and Reference Copy Number entries are evaluated as a CopyNumberReference type
- By default Genotypes match is required in order to be considered as TP, but there is a simpleCounting mode.
- For witty.er to ignore genotype information, use simpleCounting evaluationMode by specifying -em sc
- By default witty.er only takes PASS filter entries, for multi-sample vcf, if sample tag FT exist, it has to be PASS as well for the query vcf only.
- Include/Exclude filters will be supported
- If you are only interested in performance for certain regions (like confidence regions), witty.er supports passing in a bed file per SVTYPE via command line or config file.
-
The evaluation criteria of Deletions, Duplications and Inversions are controlled by the following factors:
- Percentage distance (PD) - The distance between both boundaries should respect a number that's proportional to total SV length.
- This is a distance defined by percentage of the variant length (by default 25%, so a 5kb DEL the distance cutoff is 250bp)
- Basepair distance boundary (BPD) - By default 500bp for all types except Insertions, which are set at 100bp
- CIPOS/CIEND
- Percentage distance (PD) - The distance between both boundaries should respect a number that's proportional to total SV length.
-
If there's any overlap in between the intervals defined by PD and BPD around BOTH the POS and END of the query and the target truth variant, only then is it considered TP.
-
For the ultimate boundary distance, the cutoff is defined in two different ways based on SV length
-
When PD < BPD (so if we use default parameter, the transition point is 10kb)
- The cutoff for boundary distance is strictly PD
-
When PD > BPD
- The cutoff for boundary distance is the merged interval of CIPOS/CIEND and BPD
- This is the same as how we evaluate breakend in the wit.ty
-
A typical example of what it meant by PD and BPD
-
-
In general, the cutoff trend will look like this:
note this is not using default parameters
- Without CIPOS/CIEND:
-
Including CIPOS/CIEND:
-
Real examples:
IntraChromosomeBreakends is evaluated just like DEL/DUP, but TranslocationBreakends are going to be evaluated by breakend distance. Note the comparison and stats are still within type (so Insertion always compare with only Insertion).
- Breakends
IntraChromosomeBreakends are using the same acceptance criteria as DEL/DUP, except that PD will be based on the distance between the two breakends:- In the following example, with 5% PD = (2001-1)*5%=100
1 1 TRUTH_BND1 G [1:2001[G 10 PASS SVTYPE=BND
- For TranslocationBreakend (aka inter-chromosome breakend), PD is considered to be infinite, so BPD/CIPOS/CIEND is always used.
- Two BND entries (one breakend pair) is considered as TP of 1
- If genotype matching turned on, genotype of the interpreted breakend pair has to match
- IntraChromosomeBreakend stats will now be binned by the distance between two breakend in a pair.
- TranslocationBreakend will be in their own category with no bins.
-
Insertions
Insertions are evaluated only by distance to insertion site.- PD does not apply to Insertion (since they are inserted into 1 place spanning 0 reference bases)
- Only BPD will be used since that will be the distance to insertion site.
- Inserted segment sequence is not taken into consideration.
- Insertion stats will be binned, though it'll be based on Insertion size/SVLEN.
- Insertions of unknown size will always fall into the largest bin.
- Event overlapping will use the same way as DEL/DUP evaluation.
- Evaluation criteria is percentage distance (25% by default) or basepair distance (500bp by default) depending on the SV size (see above explanations)
- Additional evaluation criteria includes
- CN sample tag match (cross type would be directional match)
- By default GT has to match (turn on simpleCounting mode --em sc to ignore GT match)
- Base level stats does not take event-level true or false into consideration
- All bases overlapped and considered true (i.e. same CN/GT, etc) is considered TP
- All bases overlapped and considered false (i.e. different CN or GT, etc) is considered FP
- All bases that are not overlapped, but is present in truth is considered FN
- All bases that are not overlapped, but is present in query is considered FP
- All bases that are not in Truth or Query will NOT be assessed and not included in any stats
- Base level overlap for each entry will be reported in each vcf entry (WOW tag, see below), regardless if the event is True or False
- Turned on my using "-em cts" (EvaluationMode CrossTypeAndSimpleCounting)
- CopyNumberGain will be treated as DUP and CopyNumberLoss will be treated as DEL, evaluation process is thus the same as DEL/DUP evaluation.
- The stats will be reported under "Deletion" and "Duplication" section.
- When --crossMatching mode is turned on, CNV gain stats will be merged with other DUP stats and CNV loss stats will be merged with DEL stats.
- CopyNumberReference will still be evaluated as normal
Note:
- For cross type matching, we do not support the situation of multi-allellic CNV with at least one DEL and one DUP, due to the fact merging wittyer vcf annotation tags are over complicated.
- Stats will be reported in json format
- Reported as per sample pair, per variant type, per bin
- There will be overall event and base level stats for the sample pair (across all variant types) and for each variant type (across all bins)
- IMPORTANT NOTE: Because entries from different variant types and different bins can overlap positional coordinates, the base-level stat totals may not add up to the overall stats
- OverallStats are the accurate stats to use for things like callability, deriving these stats by adding the numbers will result in overestimations.
- IMPORTANT NOTE: Because entries from different variant types and different bins can overlap positional coordinates, the base-level stat totals may not add up to the overall stats
- Stats for Insertions and Breakends that are within the same chromosome (IntraChromosomeBreakends) ARE binned.
- Stats for Breakends that connect two chromosomes (aka TranslocationBreakends) are not binned.
- There will be overall event and base level stats for the sample pair (across all variant types) and for each variant type (across all bins)
- Insertions, Inversions and IntraChromosomeBreakends will only have event level stats reported
- Deletion, Duplication, CopyNumberGain, CopyNumberLoss, and CopyNumberReference types will have both event and base level stats reported
- The results of certain bins can be ignored in the calculating and reporting of stats by prepending them with an
!
in the--binSizes
argument. See--binSizes
in other parameters. - Example:
{
"VariantType": "CopyNumberGain",
"OverallStats": [
{
"StatsType": "Event",
"TruthTpCount": 0,
"TruthFnCount": 89,
"TruthTotalCount": 89,
"Recall": 0.0,
"QueryTpCount": 0,
"QueryFpCount": 138,
"QueryTotalCount": 138,
"Precision": 0.0,
"Fscore": "NaN"
},
{
"StatsType": "Base",
"TruthTpCount": 0,
"TruthFnCount": 151822,
"TruthTotalCount": 151822,
"Recall": 0.0,
"QueryTpCount": 0,
"QueryFpCount": 113471,
"QueryTotalCount": 113471,
"Precision": 0.0,
"Fscore": "NaN"
}
],
"PerBinStats": [
{
"Bin": "[1, 10000)",
"BasicStats": [
{
"StatsType": "Event",
"TruthTpCount": 0,
"TruthFnCount": 0,
"TruthTotalCount": 0,
"Recall": "NaN",
"QueryTpCount": 0,
"QueryFpCount": 0,
"QueryTotalCount": 0,
"Precision": "NaN",
"Fscore": "NaN"
},
{
"StatsType": "Base",
"TruthTpCount": 0,
"TruthFnCount": 0,
"TruthTotalCount": 0,
"Recall": "NaN",
"QueryTpCount": 0,
"QueryFpCount": 0,
"QueryTotalCount": 0,
"Precision": "NaN",
"Fscore": "NaN"
}
]
},
{
"Bin": "10000+",
"BasicStats": [
{
"StatsType": "Event",
"TruthTpCount": 0,
"TruthFnCount": 89,
"TruthTotalCount": 89,
"Recall": 0.0,
"QueryTpCount": 0,
"QueryFpCount": 138,
"QueryTotalCount": 138,
"Precision": 0.0,
"Fscore": "NaN"
},
{
"StatsType": "Base",
"TruthTpCount": 0,
"TruthFnCount": 151822,
"TruthTotalCount": 151822,
"Recall": 0.0,
"QueryTpCount": 0,
"QueryFpCount": 113471,
"QueryTotalCount": 113471,
"Precision": 0.0,
"Fscore": "NaN"
}
]
}
]
}
- Example vcf lines:
1 1 TRUTH1 N <CN0> 4.56 PASS SVTYPE=CNV;END=20000;CNVLEN=20000;CIPOS=0,400;CIEND=-800,1200 GT:CN:WHAT:WIT:WHY:WOW:WHO:WIN:WHERE .:.:.:.:.:.:.:.:. 0/1:1:am:FP:Multiple:100,19800|19900,200000:QUERY1,QUERY2:CopyNumberLoss|10000-49999:50|300|500|700,19850|19200|1300|500
1 101 QUERY1 N <CN0> 4.56 PASS SVTYPE=CNV;END=19800;CNVLEN=19200;CIPOS=-50,600;CIEND=-100,700 GT:CN:WHAT:WIT:WHY:WOW:WHO:WIN:WHERE 1/1:0:am:FP:GTmismatch:101,19800:TRUTH1:CopyNumberLoss|10000-49999:50|300|500|700 .:.:.:.:.:.:.:.:.
1 19901 QUERY2 N <CN0> 4.56 PASS SVTYPE=CNV;END=21000;CNVLEN=1100 GT:CN:WHAT:WIT:WHY:WOW:WHO:WIN:WHERE 0/1:1:lm:FP:FailedBoundary:19901,20000:TRUTH1:CopyNumberLoss|1000-10000:19850|19200|1300|500 .:.:.:.:.:.:.:.:.
- Witty.er provides following tags for annotation purpose:
- WHO A list of unique IDs representing the top ten matching events (could be less if there are less matches) in the same order as WHAT and WHERE. The ID is default to be the position of truth but if there are collisions, the ID will be the truth position incremented to the first unique number.
- WHAT A top ten list of match types in the same order as WHO and WHERE, which could be either lm, am, lgm, or agm (local match, allele match, local genotype match, allele genotype match, respectively).
- A local match means it has at least 1bp overlap with a target.
- An allele match means it has overlap AND has matches the minimimum match criteria (BPD/PD)
- A genotype match means that it also has a match in their Genotype (GT) sample fields.
- WIT for witty.er decision based on the best match. Potential values are (TP/FP/FN/N). N means not assessed.
- WHY as an extension of WIT to explain reasons for FP and N.
- WOW witty.er overlap window of DEL/DUP/CNV/REF to support base level stats
- WHERE A top ten list of border distances in the same order as WHAT and WHO, which consist of four numbers separated by pipes (|) describing the boundary distances between the entry and the match
- WIN Additional match status information for stratification (e.g. Insertion|1-999 or Insertion|1000-9999 or TranslocationBreakend, which has no bins).
- Detailed vcf spec can be found here
Users can pass in a bed file (via the command line or on a per-SVTYPE basis via the config file) to have witty.er only assess TP/FP/FN decisions on entries that are completely contained within the included bed regions. Note that even if an event is completely outside or only partially inside the bed region, it will still be matched against other entries and have the correct annotations assigned except for WIT will become NotAssessed and the first WHY that would've been Missing Value (.) because it was supposed to be a TP will have OutsideBedRegion as the NotAssessed Reason. Therefore, it is possible that there is an entry that is completely in the included bed region that is assigned TP/FP/FN, but its best match(es) are considered NotAssessed because the partner is not completely within the included bed region. CIPOS and CIEND are taken into consideration when determining if an entry is included allowing CIPOS and CIEND to make the bed region more permissive. For Paired Breakends, both breakends must be within the included bed region to be assessed. Also note that base-level stats will only count bases within the included bed regions as well.
Note that providing an include bed file can result in different values for QueryTP and TruthTP. This is because QueryTP is calculated on entries in the query that are completely inside the bed regions, while TruthTP is calculated on entries in the truth that are completely inside the bed regions.
- Turn on simple counting mode
## for multi-sample vcf, if either or both vcfs are multi-sample, sample pair information is required. If -sp not present, it means to just compare the first columns
## to compare truthA with queryA and truthB with queryB
dotnet Wittyer.dll -i input.vcf -t truth.vcf -o outputDir -em sc -sp truthA:queryA,truthB:queryB
## to compare truthA with queryA, you could leave out truthA:queryA from the parameters
dotnet Wittyer.dll -i input.vcf -t truth.vcf -o outputDir -em sc
- Turn on cross-type matching mode (will be forced to be simpleCounting, see here)
dotnet Wittyer.dll -i input.vcf -t truth.vcf -o outputDir -em cts
- Only consider query entry to be TP when it is 100% overlap with truth entry.
dotnet Wittyer.dll -i input.vcf -t truth.vcf --po 1.0 --bpd 0 -o outputDir
- Update the overlap interval for Breakend to be 100 to be consider as truth
dotnet Wittyer.dll -i input.vcf -t truth.vcf --bpd 100
- Cacluate total stats for all entries (PASS and non-PASS filter entries) with no binning performed
dotnet Wittyer.dll -i input.vcf -t truth.vcf --bs 0 --if
- Calculate stats for entries with L10kb filters
dotnet Wittyer.dll -i input.vcf -t truth.vcf --if L10kb
- Calculate stats for entries with L10kb filters, but NOT with LowGQX filter
dotnet Wittyer.dll -i input.vcf -t truth.vcf --if L10kb -ef LowGQX
-i, --inputVcf=VALUE
Query vcf file (only support one file for now)
-t, --truthVcf=VALUE
Truth vcf file (currently only support one file)
--bpd, --bpDistance=VALUE
Upper bound of boundary distance when
comparing truth and query.
By default it is 500bp for all types
except for Insertions, which are 100bp.
Please note that if you set this value
in the command line, it overrides all
the defaults, so Insertions and other
types will have the same bpd. If
you want customization, please use the
-c config file option.
--pd, --percentDistance=VALUE
In order to consider truth and query to be the
same, the distance between both boundaries
should be within a number that's proportional
to total SV length. Input this as a decimal,
by default is 0.25.
--bs, --binSizes=VALUE
Comma separated list of bin sizes. Default is
1000, 10000 which means there are 3 bins: [1,
1000), [1000,10000), [10000, >10000).
You can ignore certain bins in the calculation of
performance statistics by prepending them with
an '!'. For example, "!1,1000,5000,!10000"
will ignore classifications in the [1, 1000)
and [10000+) bins when calculating and reporting
statistics. Calls will still be made in these bins
in the Wittyer vcf though.
--if, --includedFilters=VALUE
Comma separated list. Only variants contain these
filters will be considered. By default is PASS.
Use Empty String ("") to include all filters.
--ef, --excludedFilters=VALUE
Comma separated list. Variants with any of these
filters will be excluded in comparison. If any
variants have filters conflicting with those in
the included filters, excluded filters will take
priority.
-o, --outputDirectory=VALUE
Directory where all output files located
--sp, --samplePair=VALUE
Optional unless either or both query and truth
vcfs have more than one sample column.Comma
separated list of truth to query sample
mappings using colon (:) as the delimiter. For
convenience, if you just want the first column
compared, you can just provide this option with
empty contents instead.For example, Truth1:
Query1,NA12878:NA1278_S1
--em, --evaluationMode=VALUE
Choose your evaluation mode, options are 'Default'('d')
, 'SimpleCounting'('sc'), '
CrossTypeAndSimpleCounting'('cts'), by default it'
s using 'Default' mode, which does comparison
by SvType and requires genotyping match
-vt, --variantTypes=VALUE
Comma separated list of variant types to include in analysis.
Acceptable variant types are CopyNumberGain, CopyNumberLoss,
TranslocationBreakend, CopyNumberReference, Inversion,
IntraChromosomeBreakend, Insertion, Duplication, and Deletion.
All variant types are included by default. **Case-sensitive**.
-c, --configFile=VALUE
Config file used to specify per variant type settings. Used in place of
bin sizes, basepair distance, percent distance, included filters,
excluded filters, variant types and include bed arguments.
See [Configuration File](#configuration-file).
-b, --includeBed=VALUE
Bed file used to specify regions included in the analysis. Variants not
completely within bed file regions will be marked as not assessed.
This parameter is optional, and by default all variants will be analyzed.
See [Include Bed File](#include-bed-file)
Using command line parameter would use the same bed file for all SVTYPes.
If you want a different one per SVTYPE, use the ConfigFile Option.
See [Configuration File](#configuration-file).
-v, --version
witty.er version information
-h, --help
Show this message and exit
Instead of global settings for basepair distance, percent distance, bin sizes, included filters, excluded filters, and include bed file, a json-format configuration file can be used to specify per variant type settings.
- Pass in a configuration file via the '-c,--configFile=' command line argument. - The '-c,--configFile=' argument cannot be used in combination with arguments for --bpDistance, --percentDistance, --binSizes, --includedFilters, --excludedFilters, --variantTypes, or --includeBed.
- Only the variant types listed in the configuration file will be analyzed.
- There are no default settings; for each variant type listed in the configuration file, all the relevant settings must be explicitly set.
- Json property names should be camel case and are case sensitive.
- The configuration used for analysis will be output in the result folder even if no config file was specified on the commandline, you can then rerun using that config file to ensure the same settings are used. - If a path to a nonexistent file is used for the config file, Wittyer will generate a default configuration file at the specified empty path for future use.
- Default configuration file, equivalent to running Wittyer without specifying settings for basepair distance, percent distance, bin sizes, included filters, excluded filters, or variant types:
[
{
"variantType": "Duplication",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "Deletion",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "CopyNumberGain",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "CopyNumberLoss",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "CopyNumberReference",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "Inversion",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "IntraChromosomeBreakend",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 500,
"percentDistance": 0.25,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "Insertion",
"includeBed": "",
"binSizes": "1000,10000",
"bpDistance": 100,
"includedFilters": "PASS",
"excludedFilters": ""
},
{
"variantType": "TranslocationBreakend",
"includeBed": "",
"bpDistance": 500,
"includedFilters": "PASS",
"excludedFilters": ""
}
]
Currently witty.er provides:
- An overall per-sample-pair stats summary. An example of overall stats summary from stdout:
Overall Stats:
Overall Precision: 19.471 %
Overall Recall: 78.768 %
Overall Fscore: 31.223 %
--------------------------------
QuerySample TruthSample QueryTotal QueryTp QueryFp Precision TruthTotal TruthTp TruthFn Recall Fscore
NA12877 CephFather-12877 10091 1992 8099 19.7% 2550 1997 553 78.3% 31.479%
NA12878 CephMother-12878 10422 2002 8420 19.2% 2532 2006 526 79.2% 30.907%