forked from samtools/hts-specs
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathVCFv4.1.tex
1482 lines (1204 loc) · 92.4 KB
/
VCFv4.1.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[8pt]{article}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{lscape}
\usepackage[margin=0.75in]{geometry}
\usepackage[pdfborder={0 0 0}]{hyperref}
\begin{document}
\input{VCFv4.1.ver}
\title{The Variant Call Format (VCF) Version 4.1 Specification}
\date{\headdate}
\maketitle
\begin{quote}\small
The master version of this document can be found at
\url{https://github.com/samtools/hts-specs}.\\
This printing is version~\commitdesc\ from that repository,
last modified on the date shown above.
\end{quote}
\vspace*{1em}
\section{The VCF specification}
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
The format also has the ability to contain genotype information on samples for each position.
\subsection{An example}
\scriptsize
\begin{verbatim}
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
\end{verbatim}
\normalsize
This example shows (in order): a good simple SNP, a possible SNP that has been filtered out because its quality is below 10, a site at which two alternate alleles are called, with one of them (T) being ancestral (possibly a reference sequencing error), a site that is called monomorphic reference (i.e. with no alternate alleles), and a microsatellite with two alternative alleles, one a deletion of 2 bases (TC), and the other an insertion of one base (T). Genotype data are given for three samples, two of which are phased and the third unphased, with per sample genotype quality, depth and haplotype qualities (the latter only for the phased samples) given as well as the genotypes. The microsatellite calls are unphased.
\subsection{Meta-information lines}
File meta-information is included after the \#\# string and must be key=value pairs. It is strongly encouraged that information lines describing the INFO, FILTER and FORMAT entries used in the body of the VCF file be included in the meta-information section. Although they are optional, if these lines are present then they must be completely well-formed.
\subsubsection{File format}
A single `fileformat' field is always required, must be the first line in the file, and details the VCF format version number. For example, for VCF version 4.1, this line should read:
\begin{verbatim}
##fileformat=VCFv4.1
\end{verbatim}
\subsubsection{Information field format}
INFO fields should be described as follows (all keys are required):
\begin{verbatim}
##INFO=<ID=ID,Number=number,Type=type,Description="description">
\end{verbatim}
Possible Types for INFO fields are: Integer, Float, Flag, Character, and String. The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be $1$; if the INFO field describes a pair of numbers, then this value should be $2$ and so on. If the field has one value per alternate allele then this value should be `A'; if the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be `G'. If the number of possible values varies, is unknown, or is unbounded, then this value should be `.'. The `Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be $0$ in this case. The Description value must be surrounded by double-quotes. Double-quote character can be escaped with backslash $\backslash$ and backslash as $\backslash\backslash$.
\subsubsection{Filter field format}
FILTERs that have been applied to the data should be described as follows:
\begin{verbatim}
##FILTER=<ID=ID,Description="description">
\end{verbatim}
\subsubsection{Individual format field format}
Likewise, Genotype fields specified in the FORMAT field should be described as follows:
\begin{verbatim}
##FORMAT=<ID=ID,Number=number,Type=type,Description="description">
\end{verbatim}
Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field).
\subsubsection{Alternative allele field format}
Symbolic alternate alleles for imprecise structural variants:
\begin{verbatim}
##ALT=<ID=type,Description=description>
\end{verbatim}
The ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive strings and may not contain whitespace or angle brackets. The first level type must be one of the following:
\begin{itemize}
\item DEL Deletion relative to the reference
\item INS Insertion of novel sequence relative to the reference
\item DUP Region of elevated copy number relative to the reference
\item INV Inversion of reference sequence
\item CNV Copy number variable region (may be both deletion and duplication)
\end{itemize}
The CNV category should not be used when a more specific category can be applied. Reserved subtypes include:
\begin{itemize}
\item DUP:TANDEM Tandem duplication
\item DEL:ME Deletion of mobile element relative to the reference
\item INS:ME Insertion of a mobile element relative to the reference
\end{itemize}
In addition, it is highly recommended (but not required) that the header include tags describing the reference and contigs backing the data contained in the file. These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above).
\subsubsection{Assembly field format}
Breakpoint assemblies for structural variations may use an external file:
\begin{verbatim}
##assembly=url
\end{verbatim}
The URL field specifies the location of a fasta file containing breakpoint assemblies referenced in the VCF records for structural variants via the BKPTID INFO key.
\subsubsection{Contig field format}
As with chromosomal sequences it is highly recommended (but not required) that the header include tags describing the contigs referred to in the VCF file. This furthermore allows these contigs to come from different files. The format is identical to that of a reference sequence, but with an additional URL tag to indicate where that sequence can be found. For example:.
\begin{verbatim}
##contig=<ID=ctg1,URL=ftp://somewhere.org/assembly.fa,...>
\end{verbatim}
\subsubsection{Sample field format}
It is possible to define sample to genome mappings as shown below:
\small
\begin{verbatim}
##SAMPLE=<ID=S_ID,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2; ...;SK>
\end{verbatim}
\normalsize
\subsubsection{Pedigree field format}
It is possible to record relationships between genomes using the following syntax:
\begin{verbatim}
##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>
\end{verbatim}
or a link to a database:
\begin{verbatim}
##pedigreeDB=<url>
\end{verbatim}
\subsection{Header line syntax}
The header line names the 8 fixed, mandatory columns. These columns are as follows:
\begin{enumerate}
\item \#CHROM
\item POS
\item ID
\item REF
\item ALT
\item QUAL
\item FILTER
\item INFO
\end{enumerate}
If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs. Duplicate sample IDs are not allowed. The header line is tab-delimited.
\subsection{Data lines}
\subsubsection{Fixed fields}
There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (`.'). Fixed fields are:
\begin{enumerate}
\item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf. the \#\#assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required).
\item POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required)
\item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted)
\item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g. complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required).
\item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
\item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric)
\item FILTER - filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted)
\item INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: $<$key$>$=$<$data$>$[,data]. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):
\begin{itemize}
\item AA : ancestral allele
\item AC : allele count in genotypes, for each ALT allele, in the same order as listed
\item AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
\item AN : total number of alleles in called genotypes
\item BQ : RMS base quality at this position
\item CIGAR : cigar string describing how to align an alternate allele to the reference allele
\item DB : dbSNP membership
\item DP : combined depth across samples, e.g. DP=154
\item END : end position of the variant described in this record (for use with symbolic alleles)
\item H2 : membership in hapmap2
\item H3 : membership in hapmap3
\item MQ : RMS mapping quality, e.g. MQ=52
\item MQ0 : Number of MAPQ == 0 reads covering this record
\item NS : Number of samples with data
\item SB : strand bias at this position
\item SOMATIC : indicates that the record is a somatic mutation, for cancer genomics
\item VALIDATED : validated by follow-up experiment
\item 1000G : membership in 1000 Genomes
\end{itemize}
\end{enumerate}
The exact format of each INFO sub-field should be specified in the meta-information (as described above).
Example for an INFO field: DP=154;MQ=52;H2. Keys without corresponding values are allowed in order to indicate group membership (e.g. H2 indicates the SNP is found in HapMap 2). It is not necessary to list all the properties that a site does NOT have, by e.g. H2=0. See below for additional reserved INFO sub-fields used to encode structural variants.
\subsubsection{Genotype fields}
If genotype information is present, then the same types of data must be present for all samples. First a FORMAT field is given specifying the data types and order (colon-separated alphanumeric String). This is followed by one field per sample, with the colon-separated data in this field corresponding to the types specified in the format. The first sub-field must always be the genotype (GT) if it is present. There are no required sub-fields.
As with the INFO field, there are several common, reserved keywords that are standards across the community:
\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item GT : genotype, encoded as allele values separated by either of $/$ or $\mid$. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be $0/1$, $1\mid0$, or $1/2$, etc. For haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like $0/0/1$. If a call cannot be made for a sample at a given locus, `.' should be specified for each missing allele in the GT field (for example `$./.$' for a diploid genotype and `.' for haploid genotype). The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes):
\begin{itemize}
\item $/$ : genotype unphased
\item $\mid$ : genotype phased
\end{itemize}
\item DP : read depth at this position for this sample (Integer)
\item FT : sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no white-space or semi-colons permitted)
\item GL : genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
\item GLE : genotype likelihoods of heterogeneous ploidy, used in presence of uncertain copy number. For example: GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53 (String)
\item PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers)
\item GP : the phred-scaled genotype posterior probabilities (and otherwise defined precisely as the GL field); intended to store imputed genotype probabilities (Floats)
\item GQ : conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer)
\item HQ : haplotype qualities, two comma separated phred qualities (Integers)
\item PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)
\item PQ : phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set). We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. (Integer)
\item EC : comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field (typically used in association analyses) (Integers)
\item MQ : RMS mapping quality, similar to the version in the INFO field. (Integer)
\end{itemize}
If any of the fields is missing, it is replaced with the missing value. For example if the FORMAT is GT:GQ:DP:HQ then $0\mid0:.:23:23,34$ indicates that GQ is missing. Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field).
See below for additional genotype fields used to encode structural variants. Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed.
\section{Understanding the VCF format and the haplotype representation}
VCF records use a single general system for representing genetic variation data composed of:
\begin{itemize}
\item Allele: representing single genetic haplotypes (A, T, ATC).
\item Genotype: an assignment of alleles for each chromosome of a single named sample at a particular locus.
\item VCF record: a record holding all segregating alleles at a locus (as well as genotypes, if appropriate, for multiple individuals containing alleles at that locus).
\end{itemize}
VCF records use a simple haplotype representation for REF and ALT alleles to describe variant haplotypes at a locus. ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele.
\section{INFO keys used for structural variants}
When the INFO keys reserved for encoding structural variants are used for imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).
The following INFO keys are reserved for encoding structural variants.
\footnotesize
\begin{verbatim}
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
\end{verbatim}
\normalsize
For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate.
\footnotesize
\begin{verbatim}
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
\end{verbatim}
\normalsize
Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering.
\footnotesize
\begin{verbatim}
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
\end{verbatim}
\normalsize
One value for each ALT allele. Longer ALT alleles (e.g. insertions) have positive values, shorter ALT alleles (e.g. deletions) have negative values.
\footnotesize
\begin{verbatim}
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file">
\end{verbatim}
\normalsize
For precise variants, the consensus sequence of the alternate allele assembly is derivable from the REF and ALT fields. However, the alternate allele assembly file may contain additional information about the characteristics of the alt allele contigs.
\footnotesize
\begin{verbatim}
##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY">
##INFO=<ID=METRANS,Number=4,Type=String,Description="Mobile element transduction info of the form CHR,START,END,POLARITY">
##INFO=<ID=DGVID,Number=1,Type=String,Description="ID of this element in Database of Genomic Variation">
##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in DBVAR">
##INFO=<ID=DBRIPID,Number=1,Type=String,Description="ID of this element in DBRIP">
##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends">
##INFO=<ID=PARID,Number=1,Type=String,Description="ID of partner breakend">
##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend">
##INFO=<ID=CILEN,Number=2,Type=Integer,Description="Confidence interval around the inserted material between breakends">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend">
##INFO=<ID=DPADJ,Number=.,Type=Integer,Description="Read Depth of adjacency">
##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number of segment containing breakend">
##INFO=<ID=CNADJ,Number=.,Type=Integer,Description="Copy number of adjacency">
##INFO=<ID=CICN,Number=2,Type=Integer,Description="Confidence interval around copy number for the segment">
##INFO=<ID=CICNADJ,Number=.,Type=Integer,Description="Confidence interval around copy number for the adjacency">
\end{verbatim}
\normalsize
\section{FORMAT keys used for structural variants}
\footnotesize
\begin{verbatim}
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
##FORMAT=<ID=CNL,Number=.,Type=Float,Description="Copy number genotype likelihood for imprecise events">
##FORMAT=<ID=NQ,Number=1,Type=Integer,Description="Phred style probability score that the variant is novel">
##FORMAT=<ID=HAP,Number=1,Type=Integer,Description="Unique haplotype identifier">
##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype">
\end{verbatim}
\normalsize
These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality $-10log_{10}$ p(copy number genotype call is wrong). CNL specifies a list of $log_{10}$ likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys.
\section{Representing variation in VCF records}
\subsection{Creating VCF entries for SNPs and small indels}
\subsubsection{Example 1}
For example, suppose we are looking at a locus in the genome:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & a t C g a & C is the reference base \\ \hline
1 & a t G g a & C base is a G in some individuals \\ \hline
2 & a t \ - \ g a & C base is deleted w.r.t. the reference sequence\\ \hline
3 & a t CAg a & A base is inserted w.r.t. the reference sequence \\ \hline
\end{tabular}
\vspace{0.3cm}
Representing these as VCF records would be done as follows:
\begin{enumerate}
\item A SNP polymorphism of C/G $\rightarrow \{C,G\} \rightarrow$ C is the reference allele
\item A single base deletion of C $\rightarrow \{tC,t\} \rightarrow$ tC is the reference allele
\item A single base insertion of A $\rightarrow \{tC,tCA\} \rightarrow$ tC is the reference allele
\end{enumerate}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $3$ & . & C & G & . & PASS & DP=100 \\
$20$ & $2$ & . & TC & T & . & PASS & DP=100 \\
$20$ & $2$ & . & TC & TCA & . & PASS & DP=100 \\
\end{tabular}
\subsubsection{Example 2}
Suppose I see a the following in a population of individuals and want to represent these three segregating alleles:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & a t C g a & C is the reference base \\ \hline
$1$ & a t G g a & C base is a G in some individuals \\ \hline
$2$ & a t \ - \ g a & C base is deleted w.r.t. the reference sequence \\ \hline
\end{tabular}
\vspace{0.3cm}
In this case there are three segregating alleles: $\{tC,tG,t\}$ with a corresponding VCF record:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $2$ & . & TC & TG,T & . & PASS & DP=100 \\
\end{tabular}
\subsubsection{Example 3}
Now suppose I have this more complex example:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & a t C g a & C is the reference base \\ \hline
$1$ & a t \ - \ g a & C base is is deleted w.r.t. the reference sequence \\ \hline
$2$ & a t \ - - \ a & C and G bases are deleted w.r.t. the reference sequence\\ \hline
$3$ & a t CAg a & A base is inserted w.r.t. the reference sequence \\ \hline
\end{tabular}
\vspace{0.3cm}
There are actually four segregating alleles: $\{tCg,tg,t,tCAg\}$ over bases 2-4. This complex set of allele is represented in VCF as:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $2$ & . & TCG & TG,T,TCAG & . & PASS & DP=100 \\
\end{tabular}
\vspace{0.3cm}
Note that in VCF records, the molecular equivalence explicitly listed above in the per-base alignment is discarded, so the actual placement of equivalent g isn't retained. For completeness, VCF records are dynamically typed, so whether a VCF record is a SNP, Indel, Mixed, or Reference site depends on the properties of the alleles in the record.
\subsection{Decoding VCF entries for SNPs and small indels}
\subsubsection{SNP VCF record}
Suppose I receive the following VCF record:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $3$ & . & C & T & . & PASS & DP=100 \\
\end{tabular}
\vspace{0.3cm}
This is a SNP since its only single base substitution and there are only two alleles so I have the two following segregating haplotypes:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & \verb|a t C g a| & C is the reference base \\ \hline
$1$ & \verb|a t T g a| & C base is a T in some individuals \\ \hline
\end{tabular}
\subsubsection{Insertion VCF record}
Suppose I receive the following VCF record:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $3$ & . & C & CTAG & . & PASS & DP=100 \\
\end{tabular}
\vspace{0.3cm}
This is a insertion since the reference base C is being replaced by C [the reference base] plus three insertion bases TAG. Again there are only two alleles so I have the two following segregating haplotypes:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & \verb|a t C - - - g a| & C is the reference base \\ \hline
$1$ & \verb|a t C T A G g a| & following the C base is an insertion of 3 bases \\ \hline
\end{tabular}
\subsubsection{Deletion VCF record}
Suppose I receive the following VCF record:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $2$ & . & TCG & T & . & PASS & DP=100 \\
\end{tabular}
\vspace{0.3cm}
This is a deletion of two reference bases since the reference allele TCG is being replaced by just the T [the reference base]. Again there are only two alleles so I have the two following segregating haplotypes:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & \verb|a T C G a| & T is the (first) reference base \\ \hline
$1$ & \verb|a T - - a| & following the T base is a deletion of 2 bases \\ \hline
\end{tabular}
\subsubsection{Mixed VCF record for a microsatellite}
Suppose I receive the following VCF record:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l}
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$20$ & $4$ & . & GCG & G,GCGCG & . & PASS & DP=100 \\
\end{tabular}
\vspace{0.3cm}
This is a mixed type record containing a 2 base insertion and a 2 base deletion. There are are three segregating alleles so I have the three following haplotypes:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & \verb|a t c G C G - - a| & G is the (first) reference base \\ \hline
$1$ & \verb|a t c G - - - - a| & following the G base is a deletion of 2 bases \\ \hline
$2$ & \verb|a t c G C G C G a| & following the G base is a insertion of 2 bases \\ \hline
\end{tabular}
\vspace{0.3cm}
Note that in all of these examples dashes have been added to make the haplotypes clearer but of course the equivalence among bases isn't provided by the VCF. Technically the following is an equivalent alignment:
\vspace{0.3cm}
\begin{tabular}{ | l | l | l | }
\hline
Example & Sequence & Alteration \\ \hline
Ref & \verb|a t c G - - C G a| & G is the (first) reference base \\ \hline
$1$ & \verb|a t c G - - - - a| & following the G base is a deletion of 2 bases \\ \hline
$2$ & \verb|a t c G C G C G a| & following the G base is a insertion of 2 bases \\ \hline
\end{tabular}
\subsection{Encoding Structural Variants}
The following page contains examples of structural variants encoded in VCF:
\pagebreak
\footnotesize
\begin{landscape}
\begin{verbatim}
VCF STRUCTURAL VARIANT EXAMPLE
##fileformat=VCFv4.1
##fileDate=20100501
##reference=1000GenomesPilot-NCBI36
##assembly=ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/sv/breakpoint_assemblies.fasta
##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=DEL:ME:L1,Description="Deletion of L1 element">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=DUP:TANDEM,Description="Tandem Duplication">
##ALT=<ID=INS,Description="Insertion of novel sequence">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:L1,Description="Insertion of L1 element">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype quality">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
1 2827694 rs2376870 CGTGGATGCGGGGAC C . PASS SVTYPE=DEL;END=2827708;HOMLEN=1;HOMSEQ=G;SVLEN=-14 GT:GQ 1/1:14
2 321682 . T <DEL> 6 PASS SVTYPE=DEL;END=321887;SVLEN=-205;CIPOS=-56,20;CIEND=-10,62 GT:GQ 0/1:12
2 14477084 . C <DEL:ME:ALU> 12 PASS SVTYPE=DEL;END=14477381;SVLEN=-297;CIPOS=-22,18;CIEND=-12,32 GT:GQ 0/1:12
3 9425916 . C <INS:ME:L1> 23 PASS SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22 GT:GQ 1/1:15
3 12665100 . A <DUP> 14 PASS SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500 GT:GQ:CN:CNQ ./.:0:3:16.2
4 18665128 . T <DUP:TANDEM> 11 PASS SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10 GT:GQ:CN:CNQ ./.:0:5:8.3
\end{verbatim}
\end{landscape}
\pagebreak
\normalsize
The example shows in order:
\begin{enumerate}
\item A precise deletion with known breakpoint, a one base micro-homology, and a sample that is homozygous for the deletion.
\item An imprecise deletion of approximately 105 bp.
\item An imprecise deletion of an ALU element relative to the reference.
\item An imprecise insertion of an L1 element relative to the reference.
\item An imprecise duplication of approximately 21Kb. The sample genotype is copy number 3 (one extra copy of the duplicated sequence).
\item An imprecise tandem duplication of 76bp. The sample genotype is copy number 5 (but the two haplotypes are not known).
\end{enumerate}
\subsection{Specifying complex rearrangements with breakends}
An arbitrary rearrangement event can be summarized as a set of novel \textbf{adjacencies}. Each adjacency ties together $2$ \textbf{breakends}. The two breakends at either end of a novel adjacency are called \textbf{mates}.
There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency. A breakend record is identified with the tag ``SYTYPE=BND'' in the INFO field. The REF field of a breakend record indicates a base or sequence s of bases beginning at position POS, as in all VCF records. The ALT field of a breakend record indicates a replacement for s. This ``breakend replacement'' has three parts:
\begin{enumerate}
\item The string t that replaces places s. The string t may be an extended version of s if some novel bases are inserted during the formation of the novel adjacency.
\item The position p of the mate breakend, indicated by a string of the form ``chr:pos''. This is the location of the first mapped base in the piece being joined at this novel adjacency.
\item The direction that the joined sequence continues in, starting from p. This is indicated by the orientation of square brackets surrounding p.
\end{enumerate}
These 3 elements are combined in 4 possible ways to create the ALT. In each of the 4 cases, the assertion is that s is replaced with t, and then some piece starting at position p is joined to t. The cases are:
\vspace{0.3cm}
\begin{tabular}{ l l l }
REF & ALT & Meaning \\
s & t$[$p$[$ & piece extending to the right of p is joined after t \\
s & t$]$p$]$ & reverse comp piece extending left of p is joined after t \\
s & $]$p$]$t & piece extending to the left of p is joined before t \\
s & $[$p$[$t & reverse comp piece extending right of p is joined before t \\
\end{tabular}
\vspace{0.3cm}
The example in Figure 1 shows a 3-break operation involving 6 breakends. It exemplifies all possible orientations of breakends in adjacencies. Notice how the ALT field expresses the orientation of the breakends.
\begin{figure}[ht]
\centering
\includegraphics[width=4in,height=2.96in]{img/all_orientations-400x296.png}
\caption{All possible orientations of breakends}
\end{figure}
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l }
\#CHROM &POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$2$ & $321681$ & bnd\_W & G & G$]17$:$198982]$ & $6$ & PASS & SVTYPE=BND \\
$2$ & $321682$ & bnd\_V & T & $]$13:123456$]$T & 6 & PASS & SVTYPE=BND \\
$13$ & $123456$ & bnd\_U & C & C$[$2:321682$[$ & 6 & PASS & SVTYPE=BND \\
$13$ & $123457$ & bnd\_X & A & $[$17:198983$[$A & 6 & PASS & SVTYPE=BND \\
$17$ & $198982$ & bnd\_Y & A & A$]$2:321681$]$ & 6 & PASS & SVTYPE=BND \\
$17$ & $198983$ & bnd\_Z & C & $[$13:123457$[$C & 6 & PASS & SVTYPE=BND \\
\end{tabular}
\subsubsection{Inserted Sequence}
Sometimes, as shown in Figure 2, some bases are inserted between the two breakends, this information is also carried in the ALT column:
\begin{figure}[h]
\centering
\includegraphics[width=4in,height=1.89in]{img/inserted_sequence-400x189.png}
\caption{Inserted sequence between breakends}
\end{figure}
\vspace{0.3cm}
\footnotesize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$2$ & $321682$ & bnd\_V & T & $]13:123456]$AGTNNNNNCAT & $6$ & PASS & SVTYPE=BND;MATEID=bnd\_U \\
$13$ & $123456$ & bnd\_U & C & CAGTNNNNNCA$[2:321682[$ & $6$ & PASS & SVTYPE=BND;MATEID=bnd\_V \\
\end{tabular}
\normalsize
\vspace{0.3cm}
\subsubsection{Large Insertions}
If the insertion is too long to be conveniently stored in the ALT column, as in the 329 base insertion shown in Figure 3, it can be represented by a contig from the assembly file:
\begin{figure}[h]
\centering
\includegraphics[width=4in,height=2.47in]{img/inserted_contig-400x247.png}
\caption{Inserted contig}
\end{figure}
\vspace{0.3cm}
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$13$ & $123456$ & bnd\_U & C & C$[<$ctg1$>:1[$ & $6$ & PASS & SVTYPE=BND \\
$13$ & $123457$ & bnd\_V & A & $]<$ctg$1>:329]$A & $6$ & PASS & SVTYPE=BND \\
\end{tabular}
\normalsize
\vspace{0.3cm}
\textbf{Note}: In the special case of the complete insertion of a sequence between two base pairs, it is recommended to use the shorthand notation described above:
\vspace{0.3cm}
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$13$ & $321682$ & INS0 & T & C$<$ctg$1>$ & $6$ & PASS & SVTYPE=INS \\
\end{tabular}
\vspace{0.3cm}
If only a portion of $<$ctg$1>$, say from position $7$ to position $214$, is inserted, the VCF would be:
\vspace{0.3cm}
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$13$ & $123456$ & bnd\_U & C & C$[<$ctg1$>:7[$ & $6$ & PASS & SVTYPE=BND \\
$13$ & $123457$ & bnd\_V & A & $]<$ctg$1>:214]$A & $6$ & PASS & SVTYPE=BND \\
\end{tabular}
\normalsize
\vspace{0.3cm}
If $<$ctg$1>$ is circular and a segment from position 229 to position 45 is inserted, i.e. continuing from position 329 on to position 1, this is represented by adding a circular adjacency:
\vspace{0.3cm}
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$13$ & $123456$ & bnd\_U & C & C$[<$ctg$1>:229[$ & 6 & PASS & SVTYPE=BND \\
$13$ & $123457$ & bnd\_V & A & $]<$ctg$1>:45]$A & 6 & PASS & SVTYPE=BND \\
$<$ctg$1>$ & 1 & bnd\_X & A & $]<$ctg$1>:329]$A & 6 & PASS & SVTYPE=BND \\
$<$ctg$1>$ & 329 & bnd\_Y & T & T$[<$ctg$1>:1[$ & 6 & PASS & SVTYPE=BND \\
\end{tabular}
\normalsize
\subsubsection{Multiple mates}
If a breakend has multiple mates such as in Figure 4 (either because of breakend reuse or of uncertainty in the measurement), these alternate adjacencies are treated as alternate alleles:
\begin{figure}[h]
\centering
\includegraphics[width=4in,height=2.80in]{img/multiple_mates-400x280.png}
\caption{Breakend with multiple mates}
\end{figure}
\footnotesize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
$2$ & $321682$ & bnd\_V & T & $]13:123456]$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U \\
$13$ & $123456$ & bnd\_U & C & C$[2:321682[$,C$[17:198983[$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V,bnd\_Z \\
$17$ & $198983$ & bnd\_Z & A & $]13:123456]$A & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U \\
\end{tabular}
\normalsize
\subsubsection{Explicit partners}
Two breakends which are connected in the reference genome but disconnected in the variants are called partners. Each breakend only has one partner, typically one basepair left or right. However, it is not uncommon to observe loss of a few basepairs during the rearrangement. It is then possible to explicitly name a breakend's partner, such as in Figure 5.:
\begin{figure}[ht]
\centering
\includegraphics[width=4in,height=2.11in]{img/erosion-400x211.png}
\caption{Partner breakends}
\end{figure}
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
2 & 321681 & bnd\_W & G & G$[13:123460[$ & 6 & PASS & PARID=bnd\_V;MATEID=bnd\_X \\
2 & 321682 & bnd\_V & T & $]13:123456]$T & 6 & PASS & PARID=bnd\_W;MATEID=bnd\_U \\
13 & 123456 & bnd\_U & C & C$[2:321682[$ & 6 & PASS & PARID=bnd\_X;MATEID=bnd\_V \\
13 & 123460 & bnd\_X & A & $]2:321681]$A & 6 & PASS & PARID=bnd\_U;MATEID=bnd\_W \\
\end{tabular}
\normalsize
\subsubsection{Telomeres}
For a rearrangement involving the telomere end of a reference chromosome, we define a virtual telomeric breakend that serves as a breakend partner for the breakend at the telomere. That way every breakend has a partner. If the chromosome extends from position 1 to N, then the virtual telomeric breakends are at positions 0 and N+1. For example, to describe the reciprocal translocation of the entire chromosome 1 into chromosome 13, as illustrated in Figure 6:
\begin{figure}[h]
\centering
\includegraphics[width=4in,height=2.51in]{img/telomere-400x251.png}
\caption{Telomeres}
\end{figure}
the records would look like:
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
1 & 0 & bnd\_X & N & $.[13:123457[$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V \\
1 & 1 & bnd\_Y & T & $]13:123456]$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U \\
13 & 123456 & bnd\_U & C & C$[1:1[$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_Y \\
13 & 123457 & bnd\_V & A & $]1:0]$A & 6 & PASS & SVTYPE=BND;MATEID=bnd\_X \\
\end{tabular}
\normalsize
\subsubsection{Event modifiers}
As mentioned previously, a single rearrangement event can be described as a set of novel adjacencies. For example, a reciprocal rearrangement such as in Figure 7:
\begin{figure}[h]
\centering
\includegraphics[width=4in,height=1.92in]{img/reciprocal_rearrangement-400x192.png}
\caption{Rearrangements}
\end{figure}
would be described as:
\vspace{0.3cm}
\footnotesize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
2 & 321681 & bnd\_W & G & G$[13:123457[$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_X;EVENT=RR0 \\
2 & 321682 & bnd\_V & T & $]13:123456]$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U;EVENT=RR0 \\
13 & 123456 & bnd\_U & C & C$[2:321682[$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V;EVENT=RR0 \\
13 & 123457 & bnd\_X & A & $]2:321681]$A & 6 & PASS & SVTYPE=BND;MATEID=bnd\_W;EVENT=RR0 \\
\end{tabular}
\normalsize
\subsubsection{Inversions}
Similarly an inversion such as in Figure 8:
\begin{figure}[ht]
\centering
\includegraphics[width=4in,height=0.95in]{img/inversion-400x95.png}
\caption{Inversion}
\end{figure}
can be described equivalently in two ways. Either one uses the short hand notation described previously (recommended for simple cases):
\vspace{0.3cm}
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
2 & 321682 & INV0 & T & $<$INV$>$ & 6 & PASS & SVTYPE=INV;END=421681 \\
\end{tabular}
\normalsize
\vspace{0.3cm}
or one describes the breakends:
\vspace{0.3cm}
\footnotesize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
2 & 321681 & bnd\_W & G & G$]2:421681]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U;EVENT=INV0 \\
2 & 321682 & bnd\_V & T & $[2:421682[$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_X;EVENT=INV0 \\
2 & 421681 & bnd\_U & A & A$]2:321681]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_W;EVENT=INV0 \\
2 & 421682 & bnd\_X & C & $[2:321682[$C & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V;EVENT=INV0 \\
\end{tabular}
\normalsize
\subsubsection{Uncertainty around breakend location}
It sometimes is difficult to determine the exact position of a break, generally because of homologies between the sequences being modified, such as in Figure 9. The breakend is then placed arbitrarily at the left most position, and the uncertainty is represented with the CIPOS tag. The ALT string is then constructed assuming this arbitrary breakend choice.
\begin{figure}[h]
\centering
\includegraphics[width=4in,height=2.48in]{img/microhomology-400x248.png}
\caption{Homology}
\end{figure}
The figure above represents a nonreciprocal translocation with microhomology. Even if we know that breakend U is rearranged with breakend V, actually placing these breaks can be extremely difficult. The red and green dashed lines represent the most extreme possible recombination events which are allowed by the sequence evidence available. We therefore place both U and V arbitrarily within the interval of possibility:
\vspace{0.3cm}
\footnotesize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
2 & 321681 & bnd\_V & T & T$]13:123462]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U;CIPOS=0,6 \\
13 & 123456 & bnd\_U & A & A$]2:321687]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V;CIPOS=0,6 \\
\end{tabular}
\normalsize
\vspace{0.3cm}
Note that the coordinate in breakend U's ALT string does not correspond to the designated position of breakend V, but to the position that V would take if U's position were fixed (and vice-versa). The CIPOS tags describe the uncertainty around the positions of U and V.
The fact that breakends U and V are mates is preserved thanks to the MATEID tags. If this were a reciprocal translocation, then there would be additional breakends X and Y, say with X the partner of V on Chr 2 and Y the partner of U on Chr 13, and there would be two more lines of VCF for the XY novel adjacency. Depending on which positions are chosen for the breakends X and Y, it might not be obvious that X is the partner of V and Y is the partner of U from their locations alone. This partner relationship can be specified explicitly with the tag PARID=bnd\_X in the VCF line for breakend V and PARID=bnd\_Y in the VCF line for breakend U, and vice versa.
\subsubsection{Single breakends}
We allow for the definition of a breakend that is not part of a novel adjacency, also identified by the tag SVTYPE=BND. We call these single breakends, because they lack a mate. Breakends that are unobserved partners of breakends in observed novel adjacencies are one kind of single breakend. For example, if the true situation is known to be either as depicted back in Figure 1, and we only observe the adjacency (U,V), and no adjacencies for W, X, Y, or Z, then we cannot be sure whether we have a simple reciprocal translocation or a more complex 3-break operation. Yet we know the partner X of U and the partner W of V exist and are breakends. In this case we can specify these as single breakends, with unknown mates. The 4 lines of VCF representing this situation would be:
\vspace{0.3cm}
\small
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
2 & 321681 & bnd\_W & G & G. & 6 & PASS & SVTYPE=BND \\
2 & 321682 & bnd\_V & T & $]13:123456]$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U \\
13 & 123456 & bnd\_U & C & C$[2:321682[$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V \\
13 & 123457 & bnd\_X & A & .A & 6 & PASS & SVTYPE=BND \\
\end{tabular}
\normalsize
\vspace{0.3cm}
On the other hand, if we know a simple reciprocal translocation has occurred as in Figure 7, then even if we have no evidence for the (W,X) adjacency, for accounting purposes an adjacency between W and X may also be recorded in the VCF file. These two breakends W and X can still be crossed-referenced as mates. The 4 VCF records describing this situation would look exactly as below, but perhaps with a special quality or filter value for the breakends W and X.
Another possible reason for calling single breakends is an observed but unexplained change in copy number along a chromosome.
\vspace{0.3cm}
\scriptsize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
3 & 12665 & bnd\_X & A & .A & 6 & PASS & SVTYPE=BND;CIPOS=-50,50 \\
3 & 12665 & . & A & $<$DUP$>$ & 14 & PASS & SVTYPE=DUP;END=13686;CIPOS=-50,50;CIEND=-50,50 \\
3 & 13686 & bnd\_Y & T & T. & 6 & PASS & SVTYPE=BND;CIPOS=-50,50 \\
\end{tabular}
\normalsize
\vspace{0.3cm}
Finally, if an insertion is detected but only the first few base-pairs provided by overhanging reads could be assembled, then this inserted sequence can be provided on that line, in analogy to paired breakends:
\vspace{0.3cm}
\scriptsize
\begin{tabular}{ l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO \\
3 & 12665 & bnd\_X & A & .TGCA & 6 & PASS & SVTYPE=BND;CIPOS=-50,50 \\
3 & 12665 & . & A & $<$DUP$>$ & 14 & PASS & SVTYPE=DUP;END=13686;CIPOS=-50,50;CIEND=-50,50 \\
3 & 13686 & bnd\_Y & T & TCC. & 6 & PASS & SVTYPE=BND;CIPOS=-50,50 \\
\end{tabular}
\normalsize
\subsubsection{Sample mixtures}
It may be extremely difficult to obtain clinically perfect samples, with only one type of cell. Let's imagine that two samples are taken from a cancer patient: healthy blood, and some tumor tissue with an estimated 30\% stromal contamination. This would then be expressed in the header as:
\footnotesize
\begin{verbatim}
##SAMPLE=<ID=Blood,Genomes=Germline,Mixture=1.,Description="Patient germline genome">
##SAMPLE=<ID=TissueSample,Genomes=Germline;Tumor,Mixture=.3;.7,Description="Patient germline genome;Patient tumor genome">
\end{verbatim}
\normalsize
Because of this distinction between sample and genome, it is possible to express the data along both distinctions. For example, in a first pass, a structural variant caller would simply report counts per sample. Using the example of the inversion just above, the VCF code could become:
\vspace{0.3cm}
\tiny
\begin{flushleft}
\begin{tabular}{ l l l l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & Blood & TissueSample\\
2 & 321681 & bnd\_W & G & G$]2:421681]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U & GT:DPADJ & 0:32 & $0|1:9,21$ \\
2 & 321682 & bnd\_V & T & $[2:421682[$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_X & GT:DPADJ & 0:29 & $0|1:11,25$ \\
13 & 421681 & bnd\_U & A & A$]2:321681]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_W & GT:DPADJ & 0:34 & $0|1:10,23$ \\
13 & 421682 & bnd\_X & C & $[2:321682[$C & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V & GT:DPADJ & 0:31 & $0|1:8,20$ \\
\end{tabular}
\end{flushleft}
\normalsize
\vspace{0.3cm}
However, a more evolved algorithm could attempt actually deconvolving the two genomes and generating copy number estimates based on the raw data:
\vspace{0.3cm}
\tiny
\begin{flushleft}
\begin{tabular}{ l l l l l l l l l l l }
\#CHROM & POS & ID & REF & ALT & QUAL & FILTER & INFO & FORMAT & Blood & TumorSample \\
2 & 321681 & bnd\_W & G & G$]2:421681]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_U & GT:CNADJ & 0:1 & 1:1 \\
2 & 321682 & bnd\_V & T & $[2:421682[$T & 6 & PASS & SVTYPE=BND;MATEID=bnd\_X & GT:CNADJ & 0:1 & 1:1 \\
13 & 421681 & bnd\_U & A & A$]2:321681]$ & 6 & PASS & SVTYPE=BND;MATEID=bnd\_W & GT:CNADJ & 0:1 & 1:1 \\
13 & 421682 & bnd\_X & C & $[2:321682[$C & 6 & PASS & SVTYPE=BND;MATEID=bnd\_V & GT:CNADJ & 0:1 & 1:1 \\
\end{tabular}
\end{flushleft}
\normalsize
\subsubsection{Clonal derivation relationships}
In cancer, each VCF file represents several genomes from a patient, but one genome is special in that it represents the germline genome of the patient. This genome is contrasted to a second genome, the cancer tumor genome. In the simplest case the VCF file for a single patient contains only these two genomes. This is assumed in most of the discussion of the sections below.
In general there may be several tumor genomes from the same patient in the VCF file. Some of these may be secondary tumors derived from an original primary tumor. We suggest the derivation relationships between genomes in a cancer VCF file be represented in the header with PEDIGREE tags.
Analogously, there might also be several normal genomes from the same patient in the VCF (typically double normal studies with blood and solid tissue samples). These normal genomes are then considered to be derived from the original germline genome, which has to be inferred by parsimony.
The general format of a PEDIGREE line describing asexual, clonal derivation is:
\begin{verbatim}
PEDIGREE=<Derived=ID2,Original=ID1>
\end{verbatim}
This line asserts that the DNA in genome is asexually or clonally derived with mutations from the DNA in genome . This is the asexual analog of the VCF format that has been proposed for family relationships between genomes, i.e. there is one entry per of the form:
\begin{verbatim}
PEDIGREE=<Child=CHILD-GENOME-ID,Mother=MOTHER-GENOME-ID,Father=FATHER-GENOME-ID>
\end{verbatim}
Let's consider a cancer patient VCF file with 4 genomes: germline, primary\_tumor, secondary\_tumor1, and secondary\_tumor2 as illustrated in Figure 10. The primary\_tumor is derived from the germline and the secondary tumors are each derived independently from the primary tumor, in all cases by clonal derivation with mutations. The PEDIGREE lines would look like:
\begin{figure}[ht]
\centering
\includegraphics[width=4in,height=2.67in]{img/derivation-400x267.png}
\caption{Pedigree example}
\end{figure}
\begin{verbatim}
##PEDIGREE=<Derived=PRIMARY-TUMOR-GENOME-ID,Original=GERMLINE-GENOME-ID>
##PEDIGREE=<Derived=SECONDARY1-TUMOR-GENOME-ID,Original=PRIMARY-TUMOR-GENOME-ID>
##PEDIGREE=<Derived=SECONDARY2-TUMOR-GENOME-ID,Original=PRIMARY-TUMOR-GENOME-ID>
\end{verbatim}
Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided:
\begin{verbatim}
##pedigreeDB=<url>
\end{verbatim}
The most general form of a pedigree line is:
\begin{verbatim}
##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>
\end{verbatim}
This means that the genome Name\_0 is derived from the N $\ge$ 1 genomes Name\_1, ..., Name\_N. Based on these derivation relationships two new pieces of information can be specified.
Firstly, we wish to express the knowledge that a variant is novel to a genome, with respect to its parent genome. Ideally, this could be derived by simply comparing the features on either genomes. However, insufficient data or sample mixtures might prevent us from clearly determining at which stage a given variant appeared. This would be represented by a mutation quality score.
Secondly, we define a \textbf{haplotype} as a set of variants which are known to be on the same chromosome in the germline genome. Haplotype identifiers must be unique across the germline genome, and are conserved along clonal lineages, regardless of mutations, rearrangements, or recombination. In the case of the duplication of a region within a haplotype, one copy retains the original haplotype identifier, and the others are considered to be novel haplotypes with their own unique identifiers. All these novel haplotypes have in common their \textbf{haplotype ancestor} in the parent genome.
\subsubsection{Phasing adjacencies in an aneuploid context}
In a cancer genome, due to duplication followed by mutation, there can in principle exist any number of haplotypes in the sampled genome for a given location in the reference genome. We assume each haplotype that the user chooses to name is named with a numerical haplotype identifier. Although it is difficult with current technologies to associate haplotypes with novel adjacencies, it might be partially possible to deconvolve these connections in the near future. We therefore propose the following notation to allow haplotype-ambiguous as well as haplotype-unambiguous connections to be described. The general term for these haplotype-specific adjacencies is \textbf{bundles}.
The diagram in Figure 11 will be used to support examples below:
\begin{figure}[ht]
\centering
\includegraphics[width=4in,height=2.59in]{img/phasing-400x259.png}
\caption{Phasing}
\end{figure}
In this example, we know that in the sampled genome:
\begin{enumerate}
\item A reference bundle connects breakend U, haplotype 5 on chr13 to its partner, breakend X, haplotype 5 on chr13,
\item A novel bundle connects breakend U, haplotype 1 on chr13 to its mate breakend V, haplotype 11 on chr2, and finally,
\item A novel bundle connects breakend U, haplotypes 2, 3 and 4 on chr13 to breakend V, haplotypes 12, 13 or 14 on chr2 without any explicit pairing.
\end{enumerate}
These three are the bundles for breakend U. Each such bundle is referred to as a haplotype of the breakend U. Each allele of a breakend corresponds to one or more haplotypes. In the above case there are two alleles: the 0 allele, corresponding to the adjacency to the partner X, which has haplotype (1), and the 1 allele, corresponding to the two haplotypes (2) and (3) with adjacency to the mate V.
For each haplotype of a breakend, say the haplotype (2) of breakend U above, connecting the end of haplotype 1 on a segment of Chr 13 to a mate on Chr 2 with haplotype 11, in addition to the list of haplotype-specific adjacencies that define it, we can also specify in VCF several other quantities. These include:
\begin{enumerate}
\item The depth of reads on the segment where the breakend occurs that support the haplotype, e.g. the depth of reads supporting haplotype 1 in the segment containing breakend U
\item The estimated copy number of the haplotype on the segment where the breakend occurs
\item The depth of paired-end or split reads that support the haplotype-specific adjacencies, e.g. that support the adjacency between haplotype 1 on Chr 13 to haplotype 11 on Chr 2
\item The estimated copy number of the haplotype-specific adjacencies
\item An overall quality score indicating how confident we are in this asserted haplotype
\end{enumerate}
These are specified using the using the DP, CN, BDP, BCN, and HQ subfields, respectively. The total information available about the three haplotypes of breakend U in the figure above may be visualized in a table as follows.
\vspace{0.3cm}
\begin{tabular}{ l l l l }
Allele & 1 & 1 & 0 \\
Haplotype & 1$>$11 & 2,3,4$>$12,13,14 & 5$>$5 \\
Segment Depth & 5 & 17 & 4 \\
Segment Copy Number & 1 & 3 & 1 \\
Bundle Depth & 4 & 0 & 3 \\
Bundle Copy Number & 1 & 3 & 1 \\
Haplotype quality & 30 & 40 & 40 \\
\end{tabular}
\pagebreak
\section{BCF specification}
VCF is very expressive, accommodates multiple samples, and is widely used
in the community. Its biggest drawback is that it is big and slow.
Files are text and therefore require a lot of space on disk. A normal batch
of \~100~exomes is a few GB, but large-scale VCFs with thousands of exome
samples quickly become hundreds of GBs. Because the file is text, it is
extremely slow to parse.
Overall, the idea behind is BCF2 is simple. BCF2 is a binary, compressed
equivalent of VCF that can be indexed with tabix and can be efficiently decoded
from disk or streams. For efficiency reasons BCF2 only supports a subset
of~VCF, in that all info and genotype fields must have their full types
specified. That is, BCF2 requires that if e.g. an info field {\tt AC} is
present then it must contain an equivalent VCF header line noting that {\tt AC}
is an allele indexed array of type integer.
\subsection{Overall file organization}
A BCF2 file is composed of a mandatory header, followed by a series of BGZF
compressed blocks of binary BCF2 records. The BGZF blocks allow BCF2 files
to be indexed with tabix.
BGZF blocks are composed of a VCF header with a few additional records and a
block of records. Following the last BGZF BCF2 record block is an empty
BGZF block (a block containing zero type of data), indicating that the
records are done.
A BCF2 header follows exactly the specification as VCF, with a few
extensions/restrictions:
\begin{itemize}
\item All BCF2 files must have fully specified contigs definitions.
No record may refer to a contig not present in the header itself.
\item All INFO and GENOTYPE fields must be fully typed in the BCF2 header to
enable type-specific encoding of the fields in records. An error should be
thrown when converting a VCF to BCF2 when an unknown or not fully specified
field is encountered in the records.
\end{itemize}
\subsection{Header}
The BCF2 header begins with the ``BCF2 magic'' 5 bytes that encode
{\tt BCF\em XY} where {\em X} and {\em Y} are bytes indicating the major
number (currently 2) and the minor number (currently 1). This magic can be
used to quickly examine the file to determine that it's a BCF2 file.
Immediately following the BCF2 magic is the standard VCF header lines in text
format, beginning with \verb|##fileformat=VCFvX.Y|. Because the type is