Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add reference blocksize and checkpointing to VCF #435

Closed
wants to merge 3 commits into from

Conversation

yfarjoun
Copy link
Contributor

This is the reference-block part of #420

@hts-specs-bot
Copy link

Changed PDFs as of bd61dee: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of fd669b4: VCFv4.3 (diff).

@mlin
Copy link
Member

mlin commented Jul 27, 2019

@yfarjoun thanks for making the split!
@lbergelson here's my recommendation in a more straightforward form:

In those infrequent cases where a reference block ends and no downstream reference block (or variant) abuts it, leave an explicit indication of "no data" on the next line, rather than having RBS subtly imply it.

The current proposal might write,

POS  FORMAT     Alice
100  GT:DP:RBS  0/0:30:250
200  GT:DP:RBS  .
300  GT:DP:RBS  .
400  GT:DP:RBS  .
500  GT:DP:RBS  .
600  GT:DP:RBS  .

indicating Alice is confidently 0/0 at positions 200 and 300, with no information at the remaining sites. I'd have us write instead,

POS  FORMAT  Alice
100  GT:DP   0/0:30
200  GT:DP   .
300  GT:DP   .
400  GT:DP   ./.:0
500  GT:DP   .
600  GT:DP   .

Reading this is simpler and less error-prone: the interpretation of an empty cell is to look up to the last filled-in cell, with no position check needed. Usually, another reference block (or variant site) picks up where the last one left off, so these "no data" markers would be rare and wouldn't materially inflate the representation. Actually I think it'd be smaller without the RBS values, but we needn't forbid them, of course.


Here are two non-GVCF-merging use cases I think we shouldn't disenfranchise:

  1. Streaming through a sorted BAM file to genotype a given list of variant sites
  2. Converting an existing, non-sparse, multi-sample VCF to the sparse form.

RBS is less convenient for the first application because of the need to see well downtream of the current position being processed. The same is true in the second application and in that one, the RBS information may not even be completely there in the input. I believe the suggestion above simplifies both of these cases as well as the reader/decoder, without really hurting the GVCF merging use case.


In the case of GVCF merging, the RBS information could be useful for incremental/hierarchical merging, so I'm not saying it should be banished, merely not required in all spec-compliant implementations. As a nitpick, GVCF provides an END position for each reference block, so why switch to block size at this late stage?

@yfarjoun
Copy link
Contributor Author

First, regarding the nit: END is an info field, not format, so this isn't a "change". In particular, RBS will be much smaller numbers and will thus compress much better than END unless we define a special delta encoding for it so instead I thought it would be better (space wise) to use delta directly.

Regarding your proposal, I don't understand how to interpret the '.' in position 600, can you elaborate? how do I know if it's a missing, or a ref block?

@kevinjacobs-progenity
Copy link

This PR looks like it is adding an interesting feature. However, nowhere is the term "reference block" clearly defined. RBS is defined, but only in terms of a reference block.

I'll speculate and try to put the puzzle this out from the text...

Downstream positions that are covered by the reference block should be missing (‘.’)

This raises a few questions since I'm not grasping the intent here:

  1. Why is it called a reference block if the spanning interval is to be treated as missing (as opposed to reference (with GT being a list of zeros to indicate ploidy)?
  2. So is the intent of FORMAT/RBS to be a mechanism to specify gVCF-like reference assertions in a multi-sample-friendly manner by providing a per-sample mechanism to specify a reference interval? Or am I digging in the wrong spot?

Missing genotypes (‘.’) that are not covered by a reference block are to be interpreted
as truly missing.

Apologies again for being dense, but what is the difference between "missing" and "truly missing"?

@mlin
Copy link
Member

mlin commented Jul 27, 2019

@yfarjoun position 600 adopts the contents of the first nonempty cell found above it, which is the ./.:0 on position 400. By leaving this marker in position 400 saying "the previous reference block no longer applies", we've obviated the need for readers to look at RBS values and therefore for writers to generate them if inconvenient. They can still be present as extra information from GVCF pipelines of course. Does this make sense?

This is just how spVCF uses the double-quote marker (@kevinjacobs-progenity that link may help provide historical context, albeit slightly different as we're now reconciling parallel efforts). I like . instead of " to the extent it minimizes the format syntax delta we're introducing. (" is cute though as it reflects everyday usage to fill out a form sparsely)

@kevinjacobs-progenity
Copy link

@mlin: Thanks for the background reading. I see the intent now. I also understand the aesthetic appeal of using '.' as a minimal indicator of sparse content that doesn't require adding any new encoding methods. I still think it is a bad choice because '.' already means nocall and 0 means reference in VCF. Would it be a terrible compromise to require a true reference assertion with proper ploidy? i.e. 0, 0/0, ..., ., ./., ..., etc.? Uncompressed sparse files would be very slightly larger, but the overhead would be negligible when compressed. More so, many more existing tools will sensibly interpret such records, because they are making a coherent assertion according to current VCF semantics. Thus, sparse encoding would be a semantic convention that doesn't require explicit support in the specification.

Here is what your example above would look like based on my suggestion:

POS  FORMAT     Alice
100  GT:DP  0/0:30
200  GT:DP  0/0
300  GT:DP  0/0
400  GT:DP  ./.:0
500  GT:DP  ./.
600  GT:DP  ./.

I don't have a strong opinion on whether the explicit RBS tag is sufficiently valuable to include. It should be easy enough to implement a coherent random-access scheme either way.

@mlin
Copy link
Member

mlin commented Aug 2, 2019

The compatibility opportunity of @kevinjacobs-progenity idea is difficult to dispute!

Uncompressed sparse files would be very slightly larger, but the overhead would be negligible when compressed.

The compressed overhead would be slightly more than 'negligible' I think, because of relatively increased entropy across the row (flipping between ./. and 0/0 GT). However I do think the low-hanging fruit would remain picked.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Aug 12, 2019

I'm not convinced.

  1. @mlin: I don't understand. I don't see a huge difference between . and ./. except that ./. seems to be trying to say something about ploidy. but in the presence of RBS, I think it can be overcome. Without RBS, you are lets to the whims of the implementation which may put . or ./. when there's no data. the RBS field forces an ambiguous call to mean something else.
  2. I think that using 0/0 is a really bad idea since that can be interpreted as a confident reference call whereas the confidence of the reference blocks can be low or high. I would be concerned that naive readers of 0/0 would be unaware of the RBS and assume a confident reference call.
  3. A missing genotype field is independent of the actual values in the format field. requiring a 0/0 means that there has to be a GT field, creating even more dependency between the records.

EDIT: Having thought about it more, I think that using 0/0 has enough upsides that it should be used. thanks @kevinjacobs-progenity for the suggestion and patience.

EDIT of the EDIT: the 0/0 suggestion causes problems in the implementation since it requires a GT FORMAT field, but it isn't clear if we will end up having a GT or a LGT (Local GenoType) and thus this will create a mess in terms of dependencies between records.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Aug 12, 2019

  • add discussion about end tag for this suggestion

@yfarjoun
Copy link
Contributor Author

  • add an optional flag for the end of a reference block.
  • add option to have unknown RBS value

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Aug 12, 2019

  • Add paragraph describing a reference block.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Aug 12, 2019

  • pseudo-code for parsing out the reference-block information

@mlin
Copy link
Member

mlin commented Aug 12, 2019

@lbergelson pressed me on a good corner case of a haploid, GT-only VCF: then it's unclear how to write the end-of-reference-block marker. Maybe it needs a dedicated FORMAT field?

@mlin
Copy link
Member

mlin commented Sep 8, 2019

To explicate the last idea that arose from the discussion with @lbergelson last month,

The 'reference block no longer applies here' marker could be a FORMAT flag, perhaps "EOR" for "end of run":

POS  FORMAT      Alice
100  GT:DP:EOR   0/0:30
200  GT:DP:EOR   .
300  GT:DP:EOR   .
400  GT:DP:EOR   .:.:1
500  GT:DP:EOR   .
600  GT:DP:EOR   .

The interpretation of a . entry is to look up the column to the previous non-missing entry, or if we see an EOR flag first, then we really have no data. The EOR markers would appear only rarely, because usually a variant record or downstream reference block picks up just where the last one leaves off.

And let me reiterate I think it's a very good idea for GVCF pipelines to include the RBS or equivalent information; but with the above, I suggest it shouldn't otherwise be required to write or read it. The lively discussion on #437 should highlight a bit of awkwardness in standardizing a form of combined GVCF, when the basic GVCF way of supplying reference identity isn't really standardized itself.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Sep 9, 2019

note: filtering the EOR will make the reference "block" longer.

@d-cameron
Copy link
Contributor

As a general principle, as a specifications we should err on the side of data loss, rather than data corruption. In this particular instance, EOR will result in data corruption when the VCF is filtered thus I strongly prefer the representation that encodes the start and end in the same record.

@mlin
Copy link
Member

mlin commented Sep 10, 2019

To illustrate how RBS overfits GVCF pipelines, I'd like to focus the use case of making a sparse version of an existing project VCF file, such as this one:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/CCDG_13607_B01_GRM_WGS_2019-02-19_chr22.recalibrated_variants.vcf.gz

Given this input, we do not know the GVCF reference blocks and it will be awkward to write RBS without implications that weren't present originally. That's because the project VCF does not, at face value, supply the reference coverage "in between" its sites.

What we do have to work with, are repetitive runs of reference-homozygous calls down the column for each sample. I've approached our current juncture from the perspective of succinctly encoding these runs as they appear in the text of an existing multi-sample VCF file. I believe RBS approaches from a different perspective, merging a multitude of GVCF files into a new kind of "combined" GVCF. My proposed encoding is quite awkward when viewed from the latter perspective, and the reverse is also true.

(Note: the degree of repetition in the public file with "only" N=2.5K samples is much less than fonud in the N>100K datasets many of us work behind the scenes. Nonetheless we can find many examples of entries exactly equal to the one above, including all FORMAT values. Furthermore we can ask what FORMAT value fluctuations are worth keeping at huge scale.)

The distinction is further illustrated by the action item stated above to "Add paragraph describing reference block," needed because that concept does not exist in the VCF spec (though the defined fields accommodate it). In this PR we'd essentially standardize (the main concept of) GVCF and a new kind of combined GVCF — with the curiosity that one de facto uses END and the other uses RBS.

I think my suggested perspective carries less baggage, as it's just based on repetition found plainly in the existing multi-sample VCF format, such as the above-linked file. But I'm prepared to agree to disagree on which perspective the group should move ahead with, so long as whomever calls the play has thoughtfully considered both. And if we're to proceed with standardizing GVCF concepts, then I'd think that would involve a wider interest group.


@d-cameron It's true that filtering the run-encoded form is trickier. It's not surprising that deleting stuff from a run-encoded format, without trying to fix up the encoding, leads to corruption. This would be a fine reason to strongly prefer a length encoding all other things being equal — but I hope the above clarifies that there are other key differences right now, especially the concept of "reference block." In fact, from my perspective we could look at ways to use a length encoding for "next L lines in this file" as opposed to "next D bp of reference genome."

@hts-specs-bot
Copy link

Changed PDFs as of c1f092c: VCFv4.3 (diff).

@@ -410,8 +421,8 @@ \subsubsection{Genotype fields}
PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\
PQ & 1 & Integer & Phasing quality \\
PS & 1 & Integer & Phase set \\
RBS & 1 & Integer & Reference Block Size\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RBS & 1 & Integer & Reference Block Size\\
RBS & 1 & Integer & Reference block size\\

@@ -253,6 +253,17 @@ \subsubsection{Pedigree field format}
##pedigreeDB=URL
\end{verbatim}


\subsubsection{Reference block checkpoint}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be placed after the final sentence (“See [PedigreeInDetail] for details”) of the preceding section.

Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block.
To enable this, one may specify a Reference Block Checkpoint scheme:
\begin{verbatim}
##REFERENCE_BLOCK=<CHECKPOINT=1000>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is invalid as a structured meta-information line, because it does not have an ID=… subfield.

Are there plans to expand this in future with additional subfields within the <…>? If not, I'd suggest reworking this as a plain unstructured meta-information line, e.g.,

##REFERENCE_BLOCK_CHECKPOINT=1000

@d-cameron
Copy link
Contributor

Notes for Future of VCF review meeting:

  • Does this do anything other than being a per-sample <*> encoding?
    • What (if anything) is stopping the auto-conversion from gVCF to RBS? Is RBS->gVCF lossless?
  • In gVCF, END is used for indexing to ensure we get all the relevant records for a given position (i.e. differentiate no-call from REF). We could do the same with RBS and have END implicitly calculated as POS+max(RBS)
    • We'd need to require no RBS record be <*> for the implicit END but that's ok if we can auto-convert.

Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block.
To enable this, one may specify a Reference Block Checkpoint scheme:
\begin{verbatim}
##REFERENCE_BLOCK=<CHECKPOINT=1000>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##REFERENCE_BLOCK=<CHECKPOINT=1000>
##REFERENCE_BLOCK_CHECKPOINT=1000

just reifying this as a one-click GitHub suggestion.


For example (with CHROM, ID, REF, ALT, QUAL, FILTER, INFO fields/columns removed for brevity \& clarity):

\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\
\#\#REFERENCE\_BLOCK\_CHECKPOINT=1000\\

See above comment by jmarshall

To disambiguate a `.' between being truly missing and part of a reference block, one would therefore need to "look up" and find the previous RBS FORMAT value in that sample.
In addition, any non-missing value (including `.:.' or `./.') would effectively break a reference block, and should be treated as a violation of the specification if RBS is specified, or an implicit end of the block if RBS is unknown.
When reading the file from top to bottom, an implementation can simply remember what the RBS is for each sample, however when using the index to ``seek" to a particular point of the reference, one may need to seek to an unknown location in the file.
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.
To assist in seeking, the \verb!##REFERENCE_BLOCK_CHECKPOINT! header line defines a multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.

@d-cameron
Copy link
Contributor

Proposing #758 instead of this. Please comment as to whether #758 fulfills all the requirements/use cases that this PR was designed to address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

9 participants