Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Local Alleles in VCF to allow for sparser format #434

Closed
wants to merge 6 commits into from

Conversation

yfarjoun
Copy link
Contributor

This is the local-alleles part of #420

@hts-specs-bot
Copy link

Changed PDFs as of 970a87b: VCFv4.3 (diff).

VCFv4.3.tex Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LAA & . & Integer & Local Alternate Alleles\footnotemark[1]\\
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only.\footnotemark[1]\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't LPL be sorted after LAD?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both LPL and LAD should be sorted after REF+LAA, the same way as REF+ALT define the order of PL and AD.

@hts-specs-bot
Copy link

Changed PDFs as of 4bc4299: VCFv4.3 (diff).

VCFv4.3.tex Outdated
\end{itemize}
\item LAA (and LAD and LPL (*):
For callsets with a large number of samples, it is often the case that the majority of sites are not called and sites end up involving many alleles for which all the samples need to provide PL and AD.
This can cause the file-sizes to grow super-linearly with the number of samples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different take on the previous two sentences; feel free to use/recombine

In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. Usually, few of these alleles are actually observed in any one sample, but each sample must supply fields like PL and AD for all of the alleles -- a very inefficient representation as PL's size is quadratic in the allele count.

VCFv4.3.tex Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
LAA & . & Integer & Local Alternate Alleles the strictly-increasing, 1-based indices into the alternate alleles indicating which are relevant for the current sample.\\

Copy link
Member

@cyenyxe cyenyxe Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

Suggested change
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM but let's say the indices have to be strictly increasing; obviously that's a pedantic constraint but without it, the ordering of LPL could be unclear as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion updated.

Copy link
Member

@mlin mlin Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dwelling on this further -- @yfarjoun @lbergelson -- I'm worried just slightly about confusion over the LPL ordering. Lets say the joint VCF has REF=A ALT=G,C,T and the sample GVCF has ALT=T,G. so LAA=1,3 if we require the indices to be in order, but this order is flipped with respect to the GVCF.

The merging algo then has to be really careful to permute the GVCF PL values in just the right way to match that. In the wild, I'm nervous this is going to lead to difficult-to-detect bugs with subtle swapping of probability values. Did you have any experience with this in developing the idea? Should we permit the merger to write an out-of-order LAA (this might just shift the potential confusion to readers)? I don't have a better idea right now, it just seems to me like a significant pitfall at least warranting explication where LPL is discussed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a concern. The only actual implementation so far of a format with local alleles decided to store the allele string itself instead of the index because they didn't want to deal with the reordering problem. So unordered [T, G] instead of [1, 3]. Move the difficulties from the merge step to the reading step in that now if you want to look up non-local values associated with an allele you have to do a search in the INFO allele values to find the corresponding index. I

My thinking was that it was better to require the indexes as numbers and do the reordering, but I think there's a strong argument the other way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll spawn a background thread on this. BTW, if anyone knows a better formula than the following to go from PL index gt to the indices (i,j) of the constituent alleles, I would like to know. This is seriously the best I could come up with, holy sh**.

    unsigned j = (unsigned)((sqrt(8*gt+1)-1.0)/2.0);
    unsigned i = gt - j*(j+1)/2;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @cyenyxe's suggestion, it solves a possible confusion. It also allows to determine the LPL ordering per allele-set, samples with the same set of local alleles are guaranteed to use the same order.

@mlin Note that the proposed formula is for the special case of diploid genotypes. In my experience, in practice it is sufficient to pre-compute the indexes using nested/recursive loops as suggested in the description of GL, section 1.6.2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do the indexes need to be strictly increasing?

One of the main points is to lessen the burden on the creation of these pVCFs...so that when the alleles change the only change that needs to happen in the format field is the LAA. if we require strictly increasing order, that goes out the window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If LAA are strictly increasing, LPL's are guaranteed to have the same order as 1) the original PLs and 2) samples with the same set of alleles.

Editing of VCFs is much less frequent operation than read-only processing, therefore it is more important to optimize the speed and convenience of the latter.

VCFv4.3.tex Outdated
@@ -503,7 +505,15 @@ \subsubsection{Genotype fields}
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
If the genotype in the GT field is unphased, the corresponding PS field is ignored.
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
\end{itemize}
\item LAA (and LAD and LPL (*):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would follow the existing of one item per field, otherwise it may be confusing to readers searching for LAD and LPL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I leave the paragraph as is, and add entries for LAD and LPL pointing to LAA?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AD, ADF, ADR paragraph also groups related items into a single paragraph. Fortuitously in both cases the three entries are adjacent alphabetically.

(Though the whole LAA/LAD/LPL paragraph wants to move up to between HQ and MQ, and this label (LAA (and LAD and LPL (*)) wants less ‘and’ and better balanced parentheses!)

VCFv4.3.tex Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
Copy link
Member

@cyenyxe cyenyxe Jul 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

Suggested change
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\

VCFv4.3.tex Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\
LAD & . & Integer & Read depth for the reference and each of the local alternate alleles listed in LAA\\

VCFv4.3.tex Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
GQ & 1 & Integer & Conditional genotype quality \\
GT & 1 & String & Genotype \\
HQ & 2 & Integer & Haplotype quality \\
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA\\

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA, in the genotype order implied by LAA such that LPL would be a subsequence of PL were the latter present.\\

@lbergelson @pd3 following the previous discussion, I feel it's worth leaving no possible question about the LPL order -- here's my attempt, better suggestions welcome.

For GVCF merging, we're already responsible for carefully permuting the GVCF PL values into the target joint vector, the new wrinkle will be that each sample can have a different target. I think this is okay, it was just tempting to want to copy in the GVCF PL verbatim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to speak about ordering only and not require a strict subsequence: it should be possible to rescale the LPL likelihoods.

@mlin
Copy link
Member

mlin commented Sep 8, 2019

Hi all, I was motivated to come back here after another collaborator just run aground on the htslib int overflow when it tries to allocate a 2GB+ PL vector 😄 😭

It seems like we are not far; the open items are

  1. Must the LAA allele indices for each sample be strictly increasing? I advocate yes, to make reading LPL less tricky, but the trickiness burden is shifted to the writer.
  2. Precise language for specifying the order of LPL values, if anyone agrees with me it's worthwhile to leave no doubt.
  3. Move the item describing LAA, LAD, and LPL into alphabetical position within the itemized list. (+ misc wording and punctuation edits)

@jmarshall
Copy link
Member

jmarshall commented Sep 9, 2019

I think this is a nice new feature, but it took me many minutes of very close reading to figure out just what the feature is! If I understand it correctly, it could perhaps be better described as:

  • The table of reserved keys is just a brief taster / memory jogger of what each field is about, so: LAA List of relevant Local Alternate Alleles / LAD Read depth for each local allele / LPL Phred-scaled local genotype likelihoods rounded to the closest integer

  • The text in the bulleted paragraph motivating this doesn't mention that the point of all this is to only list “interesting” alleles on a per-sample basis because each sample has a different small subset of alt alleles that are present for that sample. The text suggested in Define Local Alleles in VCF to allow for sparser format #434 (comment) is rather better.

  • It needs a complete example so you can see what is happening. Something like

… REF  ALT        …  FORMAT          SAMPLE1                       SAMPLE2
… G    A,C,T,<*>  …  GT:LAA:LAD:LPL  1/1:2,4:20,30,10:1,2,3,4,5,6  1/1:3:15,25:1,2,3

thus demonstrating the different subsets per sample.

  • The numbers are all represented as ., but the paragraph text should explain the relationships between them. So:

LAA is a list of n integers, where 1 <= n <= the number of ALT alleles, giving the (1-based) indices within ALT of the alleles that are observed in the sample. [And a sentence about the ordering if so decided.] (Uh — can n be 0?)

LAD is a list of n+1 integers giving read depths (as per AD) for the REF allele and each of the local alleles as listed in LAA.

LPL is a list of n+1 choose P (or whatever the combinatoric calculation is!) integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the REF and LAA local alleles. [And refer to the precise ordering defined in the GL paragraph.]

(Writing out all that for each field might mean they would be better as separate paragraphs (as per #434 (comment)) or it might be better to keep them together to collaborate on the same example better…)

@hts-specs-bot
Copy link

Changed PDFs as of 4f131de: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of 0c65b48: VCFv4.3 (diff).

@mlin
Copy link
Member

mlin commented Oct 7, 2019

Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference.

This is a good point; we should probably clarify how one should write LAA & LPL for samples reporting no evidence for any particular alternate allele.

@hts-specs-bot
Copy link

Changed PDFs as of 605e8ec: VCFv4.3 (diff).

@hts-specs-bot
Copy link

Changed PDFs as of fbf87d3: VCFv4.3 (diff).

VCFv4.3.tex Outdated Show resolved Hide resolved
@lbergelson
Copy link
Member

Some misc things I've been thinking about while implementing this:

  1. ./. calls are tricky to convert to local alleles in a joined gvcf because we don't know which subset of alleles is relevant, in worst case you may have to include every allele in LAA or analyze the PL's to see which ones are not actually meaningful.
  2. Requiring LAA to be sorted is annoying when merging files.
  3. Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site? Benefits: potentially space saving, makes single sample files in local allele format cleaner. Cons: easier to make mistakes when merging files.

VCFv4.3.tex Outdated
2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\
3&C&G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 4:0/0:30,1:0,30,80\\
3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.1:0,.,.,.,.,.,.,.,.,.,30,.,.,.,80\\
4&G&A,T,\textless*\textgreater& LAA:LGT:LAD:LPL& :0/0:30:0\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here: we either need to remove the LAA field and indicate that it can be left out if there are no non-ref local alleles, or we need to add a leading .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a typo: the intention is that LAA is a 0-length list of integers (indicating that this record is homozygous ref for this sample and that none of the ALT alleles are in play), not an unknown MISSING value.

This may be a problem, as it's not clear that 0-length lists are valid syntax in VCF or BCF. Omitting LAA is probably not the right workaround, as presence of LAA is the simple indicator that local alleles are being used in a VCF record. Setting it to . (unknown) is somewhat misleading, as it is known…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example contradicts the current text, which says “LAA is a sorted list of n distinct integers, where n ≥ 1 […]”. So the text will need adjustment if this example record is deemed valid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bcftools converts the zero-length list into a missing value. This comes from bcftools treating everything as BCF internally, and BCF says zero-length lists are missing.

Possible fixes would be to say missing LAA means REF; to explicitly put 0 in LAA for REF-only; or use GT in these rows, but allow use of LPL, LAD etc. in the specific case that GT is REF-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to use the missing value . and state it explicitly that it means REF. I can't think of a case where this could cause problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Petr here. While the BCF specification states that a zero length array is defined to be a MISSING array, hence ., we could add an explicit statement to LAA saying that missing LAAs means that it matches REF. What we gain is a solution to this problem. What we lose is the ability to store an unknown LAA list. I'm unsure what the latter practically means. Why would you need to explicitly state it's missing instead of just omitting it. If it's missing then we're not using LAA (and if the PL, AD etc are missing we're not defining it that way either).

PS. Technically even if it's . this still contradicts "LAA is a sorted list of n distinct integers" as . isn't an integer. However that's nit-picking and maybe somewhere else VCF already defines . as being an integer of unknown value? I don't know what we do elsewhere when we have this technical disparty. Is it just assumed that "." is a polymophic type that matches whatever the item wants it to be?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the BCF spec says is that a type descriptor byte with ‘size’=0 denotes MISSING. It does not say what a type descriptor byte with ‘size’=15 followed by a typed Integer with value 0 denotes, but that would be a natural representation of a 0-length list. (The spec does say that ‘size’=15 implies the list length is ≥15, but the typed Integer is not biased so no meanings are described for typed Integer values <15.) Thus BCF does not provide a binding answer on whether 0-length lists are valid or representable in VCF/BCF.

But I'm not going to advocate blessing 0-length lists in VCF or BCF, especially not just for this, even though SAM found them useful for regularity (cf #135).

IMHO the clearest way to represent this corner case would be

As a special case, the homozygous REF case in which no ALT alleles are observed in a particular sample is represented as LAA=[0].

(And we'd need to rework n in the length maths: e.g. by dropping n from the description of LAA, and adding “where n is the number of ALT alleles listed in LAA” to the others.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could get behind using LAA=0 for the special case of no alternates. It wouldn't need much wording alteration. It mentions 0 is implicit and that LAA goes from 1 to <= n, but we could add an explicit caveat for the empty list and an explanation of why it's necessary. I don't like the mix of LAA=2,4 and LAA=[2,4] in the text though. I find it confusing to have two different syntaxes mixed together. I get it's trying to imply an array in a textual sense, but it would be clearer IMO if the textual description used the actual text encoding.

However I'm not really sure I understand the necessity for a distinction between empty and missing. I appreciate they are infact different things, however pragmatically are we expecting the tools to treat them differently? If so how? Is it a confidence thing, where we want to be able to specify a likelihood for REF-only?

I don't know enough about the internals of bcftools and htsjdk though to know whether an LAA of 0 would cause breakage. Likewise the type / size shenanigans. My instinct tells me htslib would explode if the size is incorrect and I'd rather not be relying on robustness of weird undiscovered corner cases in BCF. It's a confusing enough format as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To expand on my previous comment: the logic of LAA is to define a list of alleles that can be referred to in subsequent localized fields. If only the reference allele is relevant, then it makes sense to provide the missing value ., in this sense it behaves in exactly the same way as what we write in REF and ALT columns. Note also the naming of the tag stands for "Local Alternate Alleles". By using . we are not introducing a new concept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parachuting in here a bit, so my apologies if I'm missing context, but defining LAA as an R-length array whose first element is always the number zero both avoids these nasal demons and avoids a branch [1] when connecting a local index into a global index.

Is it too late to consider an R-length LAA instead of an A-length one?

[1] Text processing cost probably dominates, but if you convert a VCF into some binary format without adjusting LAA, maybe this matters a bit.

@jmarshall
Copy link
Member

Re (3) Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site?

Wouldn't sites where all samples include all the alleles just use GT/AD/PL rather than LAA/LGT/LAD/LPL? (And this would obviously be allowed as there is no text saying a file has to have all or none of its variant lines using LAA, and the default assumption would be that a mixture is fine.)

Or do you mean: when some samples include all the alleles and others don't, instead of writing 1,2,3,…,N for LAA for the all-alleles samples perhaps allow some special value to be an abbreviation for that? e.g. . but frankly that would be an unexpected meaning for .

@yfarjoun yfarjoun added this to the VCF v4.4 milestone May 18, 2020
@hts-specs-bot
Copy link

Changed PDFs as of a5d41dc: VCFv4.3 (diff).

pd3 added a commit to samtools/bcftools that referenced this pull request Aug 14, 2020
This is a draft implementation of samtools/hts-specs#434
and haploid Number=G tags are not handled yet.
@h-2
Copy link

h-2 commented Jan 13, 2023

I was just reading through all of this, since I am implementing it for one of our tools, and I wondered: what's the point of LGT ? It doesn't save any space and just adds the indirection of parsing through LAA...

In our applications, the only point of doing this is to transform fields with number G (PL-values), although an argument could be made for transforming any/all fields with number A and R, as well. Is there a reason for only suggesting AD and not other A/R-fields?

@d-cameron
Copy link
Contributor

d-cameron commented Jul 10, 2023

My preference is to expand the scope of this and make it generic over all Type=R/A/G fields with the local notation implictly defined by the presence of LAA^. That is, PL, GT, etc. are redefined from indexes w.r.t ALT to indexes w.r.t ALT[LAA] with a missing LAA being interpreted as the identity function.

In this manner, converting between the short and long-hand notations can be done not with L-prefixed fields, but with the fields themselves based on the presence/absence of LAA. This would actually simplify VCF merging as LAA would act as an indirection/remapping marker with the only FORMAT change required when merging VCFs is the addition of the LAA field.

Another benefit of this approach is that is allows fields such as PL to preserve their original meaning.

^ AI (alt index) is probably a better name than LAA as no L-prefixed fields would be required

@pd3
Copy link
Member

pd3 commented Jul 11, 2023

@d-cameron That's a terrible idea, it would break existing tools.

Also, having implemented this in bcftools merge, I don't see how it would help with merging VCFs at all, or what do you mean by your comment that it would help preserve PL its original meaning?

Perhaps better is to apply this idea to the LXX tags, rather than changing the semantics of existing tags. (That is, allow other LXX tags even when LAA is not present.)

@olest
Copy link

olest commented Oct 6, 2023

Hi,

we are using local alleles in DRAGEN gVCF Genotyper. More specifically, we write FORMAT/LPL, LAD, LGT and LAA in the multi-sample VCF output.

In particular FORMAT/LPL gives a significant space saving compared to FORMAT/PL which scales quadratically with the number of alleles at a variant site.

This becomes important for an msVCF representing a large cohort because multi-allelic sites with hundreds or even thousand of alleles will become more frequent.

In these cases, the msVCF line can exceed the size limit of the htslib parser as @mlin mentions above.

I would love to see this change to become part of the VCF spec. If there is anything needed to support this, please let me know.

VCFv4.3.tex Outdated Show resolved Hide resolved
@jmarshall
Copy link
Member

Since this PR was started, v4.4 has become the canonical VCF version. We may wish to adjust this PR to add the new fields to the VCFv4.4.tex document instead (or perhaps to both, but primarily to the current one, v4.4).

I have drafted PR #745 applying the local alleles changes to the VCFv4.4 document on top of current master and fixing some formatting etc. It is not intended to split the conversation from this PR, but to provide a preview PDF and demonstration of how these changes applies to current v4.4.

If @yfarjoun and the VCF maintainers agree, and wish to apply the local alleles additions to v4.4 instead of v4.3 (or perhaps to both), then we can force push the new PR's branch to yfarjoun:yf_local_alleles, close #745, and continue the conversation on this original PR.

@jkbonfield
Copy link
Contributor

Background into this, that came up in Future of VCF. All Of Us use VDS natively (Hail Variant Dataset). I think this is likely the origin of this PR and #435.

https://support.researchallofus.org/hc/en-us/articles/14927774297620-The-new-VariantDataset-VDS-format-for-All-of-Us-short-read-WGS-data
https://hail.is/docs/0.2/vds/index.html

One difference there is VDS uses LA and not LAA,and LA includes 0 for REF.

clrpackages pushed a commit to clearlinux-pkgs/htslib that referenced this pull request Dec 28, 2023
Andrew Whitwham (12):
      Uodate copyright for winter release.
      January 2023 NEWS update.
      Keeping the NEWS file up-to-date.
      More additions and improvements.
      Added htscodecs update to v1.4.0
      More space.
      Switched back to openssl for Alpine.
      Amalgamate multiple CIGAR ops into single entry. (#1607)
      Stop the overwriting of the end value.
      Summer 2023 copyright update.
      Speed up removal of lines in large headers.
      Winter News 2023 (PR #1703)

Bergur Ragnarsson (3):
      draft fix
      fix memory leak
      exit early on error

David Seifert (1):
      Use POSIX `grep`

Fabian Klötzl (1):
      improve parsing performance

Fangrui Song (1):
      Apply the packed attribute to uint*_u types for Clang

James Bonfield (86):
      Update man SEE ALSO sections from .BR to .IR so the website uses URLs
      bgzip text compression mode
      Make the bgzip -g option less opaque.
      Make tabix support CSI indices with large positions.
      Prevent crash when only FASTA entry has no sequence.
      Add an fai_line_length function.
      Check for invalid BC tags in fastq output.
      Warn if ref file is given but it doesn't contain the refs we need.
      Fix buffer read-overrun in bam_plp_insertion_mod.
      Fix ref fix from c91804c
      Make it easier to modify shared library permissions during install
      Add CRAM SQ/M5 header checking when specifying a fasta file. (PR #1522)
      Speed up load_ref_portion.
      Expand CRAM API a bit to cope with new samtools cram_size command. (PR #1546)
      Merges neighbouring I and D ops into one op within pileup. (PR #1552)
      Improve API docs for bgzf_mt
      FIx a bug in the codec learning algorithm for TOKA
      Fix a bug with multi-threading and embed_ref=2 on name sorted data
      Use non-ref mode when all else fails for CRAM encoding
      Add some documentation on cram encoder code structure
      Fix Cram compression container substitution matrix generation.
      Tweak the CRAM_SUBST_MATRIX table.
      Prevent spurious and random system errors from test_bgzf.c
      Fix cram_index_query_last function
      Avoid deeply nested containment list on old CRAM indices.
      Permit fastq output to create empty FASTQ records for seq "*".
      Fix a couple small VCF auto-indexing bugs.
      Backport attractivechaos/klib#78 to htslib.
      Slightly speed up various cram decoding functions (#1580)
      Remove CRAM 3.1 warning.
      Trivial fix to expr, removing "^".
      Add MZ:i tag as a check for base modification validity. (#1590)
      Fix typo in kh_int_hash_func2 macro.
      Rename aux tag MZ to MN.
      Protect against overly large containers.
      Don't create overly large CRAM blocks.
      Add a missing break statement in cram_codec_to_id. (#1614)
      Fix fd_seek on pipes on modern MinGW releases.
      Change bounds checking in probaln_glocal
      Fix a containment bug in cram_index_last.
      Migrate base modification code out of sam.c
      Correct base modification implicit / explicit status when mixed together.
      Add a bam_mods_queryi interface.
      Add bam_parse_basemod2 API with additional flags argument.
      Add more internal sam_mods.c documentation
      Update bam_next_basemod too to cope with HTS_MOD_REPORT_UNCHECKED.
      Fix decompress_peek_gz to cope with files starting on empty gzip blocks.
      Fix to 2e672f33 decompress_peek_gz change.
      Add fai_thread_pool interface.
      NEWS updates for pending release
      Makes bam_parse_cigar able to modify existing BAM records rather than
      Fix cut and paste errors in bam_aux2f documentation
      The first stage of vcf_parse_format speed improvements.
      Further VCF reading speeds optimisations.
      Revert most of the vcf_parse_info improvements.
      Add an hclen SAM filter function.
      Fix a minor memory leak in malformed CRAM EXTERNAL blocks. (#1671)
      Enable auto-vectorisation in CRAM 3.1 codecs.
      Cache key header lengths.
      Allow vcf_format to work on packed data, plus bcf_fmt_array improvements.
      Speed up kputd by approx 130%.
      Minor tweaks to bcf_fmt_array and bcf_str_missing usage.
      Fix a cram decode hang from block_resize.
      Always do the CRAM mutex lock/destroys.
      Add C++ casts for external headers.
      Enable optimisation level -O3 for SAM QUAL+33 formatting.
      Avoid a NULL pointer dereference while building CRAM embedded ref.
      Avoid a NULL pointer deref when erroring writing CRAM to stdout
      Make CRAM internal data structures use hts_pos_t.
      Prevent CRAM 3 from attempting to write out out-of-bounds ref positions
      Remove memory leak when cram_encode_container fails during a close.
      Fix out by one error on extend_ref memory allocation.
      Don't call cram_ref_decr on consensus-based references.
      Check for 64-bit values in BETA codec initialiser.
      Protect against CRAM slice end going beyond end of reference.
      Prevent extend_ref from making huge mallocs on very sparse data.
      Improve the fuzzer to write BAM/CRAM and BCF too.
      Fix memory leaks on failed CRAM encode.
      Permit embed_ref=2 mode to be reenabled after using no_ref.
      Tighten memory constraints for cram_decode.
      Further rewrite of the fuzz test harness
      Reduce maxmimum BCF header len in fuzzer.
      Fix buffer read overrun in cram_encode_aux.
      Disable hts_set_fai_filename call in hts_open_fuzzer.
      Avoid undefined behaviour integer overflow in extend_ref
      Fix integer overflow in cram_compress_block2

John Marshall (19):
      Add bam_aux_first()/bam_aux_next() tagged aux field iterator API
      Document that bam_aux_del()'s `s` parameter must be non-NULL (& reformat)
      Add symbol versioning to the ELF shared-object file
      Mention in INSTALL that using plugins may need -rdynamic
      Set _XOPEN_SOURCE in configure if it's not already set
      Add missing Makefile dependencies [minor]
      Make last_in a pointer to const [minor]
      Add "uncompressed" in hts_format_description() where appropriate
      Add hclose()-doesn't-close-fd option and use it for hopen("-")
      Take advantage of shared hopen("-") in htsfile.c
      Explicitly fclose(stdout) in test/test_view.c too
      Fix hfile_libcurl small seek bug
      Apply memchr optimisation in the general delimiter case too
      Add test case compiling the public headers as C++
      Remove remnants of HTS_HAVE_NEON, unused since PR #1587
      Remove NUMERIC_VERSION, unused since PR #1226
      Document primarily MM/ML and fix base-mod-related typos
      Compare as off_t (not size_t) and fix printf specifiers
      Install annot-tsv.1 man page and alphabetise annot-tsv rules

Lilian Janin (2):
      Fix error code 0 returned by bcftools after error
      Make bcftools return an error code != 0 after [E::bgzf_read_block] Invalid BGZF header at offset xxx

Petr Danecek (15):
      Remove variable redeclaration warnings from perl test script
      Make bcf_hdr_seqnames() work with gapped chromosome ids
      Make bcf_hdr_idinfo_exists() more robust
      Check if VCF POS column could be fully parsed
      Allow repeated calls of bcf_sr_set_regions (PR #1624)
      An attempt to parse malformatted region such as {1:1}-2 should fail
      Add new annot-tsv program
      Output full diff on failing tests
      Output full diff on failing tests
      Make qsort in regidx order-reproducible across platforms
      Provide a nill (dot) value when the field is empty
      Use EXIT_SUCCESS with -h and EXIT_FAILURE on errors
      Address various comments
      VCF parsing fix
      Allow renaming of the default -a annotations (#1709)

Rob Davies (65):
      Fix n-squared complexity in sample line with many adjacent tabs
      Switch to building libdeflate with cmake
      Add faidx_seq_len64() and fai_adjust_region() interfaces
      Rework / add new faidx tests
      Fix build on ancient versions of gcc
      Ensure strings in config_vars.h are escaped correctly
      Switch MacOS CI tests to an ARM-based image
      Cut down the number of embed_ref=2 tests that get run
      Cap bgzf_getline return value to INT_MAX
      Make tbx_parse1 work for lines longer than 2Gbytes
      Use correct type for ret in vcf_write()
      Don't error when making an iterator on a tid not in the index
      Happy New Year
      Catch errors from bgzf_getline() in hts_readlist, hts_readlines
      Add configure (enable|disable)-versioned-symbols options and tests
      Add Makefile rule to update the symbol version file
      Strip out symbol versions from shlib-exports-so.txt
      Update to htscodecs v1.4.0
      Minor NEWS adjustment and additonal item
      Switch to CURLINFO_CONTENT_LENGTH_DOWNLOAD_T for newer libcurl
      Fix crypt4gh redirection
      Remove use of sprintf() from HTSlib source
      Make SIMD tests work when building multiarch binaries
      Make MacOS tests build a multiarch version of the library
      Fix bug where bin number could overflow when looking for max_off
      Make reg2bins faster on whole-chromosome queries
      Make reg2intervals() faster on whole-chromosome queries
      Update to latest htscodecs
      Don't set _POSIX_C_SOURCE for htscodecs tests
      Fix trailing space in config.h made by configure
      Ignore generated config_vars.h file in copyright check
      Switch to `/usr/bin/env perl` for all perl scripts
      Adjust comments in probaln_glocal()
      Expand test-bcf-sr.c capabilities
      Add synced reader region tests, and move no-index tests
      Fix possible double frees in bcf_hdr_add_hrec() error handling
      Prevent dangling hrec pointer after bcf_hdr_add_hrec() failure
      Remove items from hdict in bcf_hdr_remove()
      Ensure number of modifications is always set in bam_parse_basemod2()
      Ensure simple_test_driver.sh cleans up its temporary files
      Ensure base mod test result is noticed by the Makefile
      Improve test/test_mod.c
      Switch to htscodecs 1.5.1
      Skip CRC checks when fuzzing
      Prevent out-of-memory reports when fuzzing
      Add missing dependency on libhts.a for hts_open_fuzzer
      Fix virtual offset adjustment when indexing on-the-fly
      Fix BCF/VCF on-the-fly indexing issues
      Fix crypt4gh redirection
      Update htscodecs to v1.5.2 (2aca18b3)
      Simplify run_test function
      Fix test/test.pl -F dependencies
      Simplify test/test.pl
      Move all annot_tsv tests into their own function
      Fix up some annot-tsv error checks / reports
      Reformulate man page for house style
      Make compiler flag detection work with zig cc
      Fix a couple of unused value warnings when built with NDEBUG
      Fix @pg linking when records make a loop
      Fix possible shift of negative value in cram_encode_aux()
      Move typecast to the right place
      Ensure no-stored-sequence reads are counted in container size
      Make output on unrecognised options go to stderr
      Update to latest htscodecs
      Update htscodecs to v1.6.0

kojix2 (2):
      Fix a typo in sam.h documentation
      Fix example in docs for sam_hdr_add_line

pd3 (4):
      Remove a bottleneck in VCF header processing
      Add support for non-standard chromosome names containing [:-] characters
      Clarify usage, include output example
      Temporary workaround when excessive memory is required by FORMAT fields

vasudeva8 (10):
      Adds bcf_strerror method (PR #1510)
      Ensure NUL termination of Z/H data in sam_format_aux1; fix base mod state reuse
      Changes to avoid segfault with uncompressed bam (PR #1632)
      Demonstration of htslib/sam api usage.
      formatting update
      bgzf_useek fails when offset is above block limits
      Add support for multiple files to bgzip
      Limited usage text to 80 chars per line
      man update for -a option (PR #1716)
      Fix example column names and explain core column renaming

Étienne Mollier (2):
      cram/cram_external.c: fix external htscodecs include
      htslib-s3-plugin.7: fix whatis entry

Noteworthy changes in release 1.19 (12th December 2023)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Updates
-------

* A temporary work-around has been put in the VCF parser so that it is
  less likely to fail on rows with a large number of ALT alleles,
  where Number=G tags like PL can expand beyond the 2Gb limit enforced
  by HTSlib.  For now, where this happens the offending tag will be dropped
  so the data can be processed, albeit without the likelihood data.

  In future work, the library will instead convert such tags into their
  local alternatives (see samtools/hts-specs#434).

(NEWS truncated at 15 lines)
Copy link

github-actions bot commented Jan 2, 2024

Changed PDFs as of 4e4bae4: VCFv4.4 (diff — see pp10–13).

@jmarshall
Copy link
Member

As discussed in today's meeting, this PR has been updated to apply to the v4.4 document instead (and some formatting is improved).

Some further work needed was discussed in the meeting:

  • We may wish to apply this to an upcoming v4.5 draft instead;
  • The new genotype fields are listed as Number=., which may wish to be updated to Number=P (for some of them at least) once we have settled on a suitable definition of P (cf vcf: Define Number value P #705).

@danking
Copy link
Contributor

danking commented Jan 31, 2024

@h-2

I was just reading through all of this, since I am implementing it for one of our tools, and I wondered: what's the point of LGT ? It doesn't save any space and just adds the indirection of parsing through LAA...
In our applications, the only point of doing this is to transform fields with number G (PL-values), although an argument could be made for transforming any/all fields with number A and R, as well. Is there a reason for only suggesting AD and not other A/R-fields?

We (Hail) also questioned the value of LGT until we discovered a footgun. Suppose:

  1. You have a VCF that uses local alleles.
  2. At locus chr1:1, the alternate allele array is A,T,G and there are two samples whose genotypes are: 0/3 and 1/2.
  3. You decide to remove the second sample.

It is not unreasonable to also remove the now unused alternate alleles from the alternate allele array, but now your GT is wrong. So, the options when filtering samples are:

  1. Always keep the "dead" alternate alleles.
  2. Remove the "dead" alternate alleles and also update any affected GTs.
  3. Use LGT.

For Hail's purposes, using LGT seems best because it allows sample subsetting without O(N_SAMPLES_KEPT) edits. It also avoids the footgun of "whoopsie I changed the meaning of all my GTs!".

@danking
Copy link
Contributor

danking commented Jan 31, 2024

@jmarshall

The new genotype fields are listed as Number=., which may wish to be updated to Number=P (for some of them at least) once we have settled on a suitable definition of P (cf #705).

Do I understand correctly that

one for each allele specified in the GT

means that P is the ploidy of the GT? Concretely, 1/1, 1/2, and 2/2 all have P equal to 2?

I'd prefer three new numbers (all of which are distinct from the ploidy):

  • LOCAL-A. One element for each allele present in LA minus one (to exclude the reference). For example: LEC.
  • LOCAL-R. One element for each allele present in LA (including the reference). For example: LAD.
  • LOCAL-G. One element for each possible genotype. If $N_{LA}$ is the number of local alleles, LOCAL-G is the triangle number of $N_{LA}$. Concretely: if diploid, $\frac{1}{2}N_{LA} (N_{LA}+1)$, and, if haploid, $N_{LA}$. For example: LPL.

I think P is sometimes equal to LOCAL-A, sometimes equal to LOCAL-R, and sometimes neither:

GT P LOCAL-A LOCAL-R
0/0 2 0 1
0/1 2 1 2
1/2 2 2 3

@LeeTL1220
Copy link

LeeTL1220 commented Jan 31, 2024

All Of Us use VDS natively (Hail Variant Dataset).

Apologies, if I am being nitpicky...

@jkbonfield FYI: I don't believe that "natively" is quite accurate, since it implies that we do our joint calling, conversions, and search functionality using Hail/VDS. We use Genomic Variant Store (GVS) for all of that, except for some conversions. We use GVS to render a VDS (also Hail for this one), VCF, and pgen. We do use the Hail/VDS to render bgen and plinkbed, because GVS does not support those. Note that GVS is a variant of SVCR (no pun intended).

@d-cameron
Copy link
Contributor

Based onto VCF 4.5. Please continue discussion on #758.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.