-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define Local Alleles in VCF to allow for sparser format #434
Conversation
VCFv4.3.tex
Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields} | |||
GQ & 1 & Integer & Conditional genotype quality \\ | |||
GT & 1 & String & Genotype \\ | |||
HQ & 2 & Integer & Haplotype quality \\ | |||
LAA & . & Integer & Local Alternate Alleles\footnotemark[1]\\ | |||
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only.\footnotemark[1]\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't LPL be sorted after LAD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both LPL and LAD should be sorted after REF+LAA, the same way as REF+ALT define the order of PL and AD.
VCFv4.3.tex
Outdated
\end{itemize} | ||
\item LAA (and LAD and LPL (*): | ||
For callsets with a large number of samples, it is often the case that the majority of sites are not called and sites end up involving many alleles for which all the samples need to provide PL and AD. | ||
This can cause the file-sizes to grow super-linearly with the number of samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different take on the previous two sentences; feel free to use/recombine
In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. Usually, few of these alleles are actually observed in any one sample, but each sample must supply fields like PL and AD for all of the alleles -- a very inefficient representation as PL's size is quadratic in the allele count.
VCFv4.3.tex
Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields} | |||
GQ & 1 & Integer & Conditional genotype quality \\ | |||
GT & 1 & String & Genotype \\ | |||
HQ & 2 & Integer & Haplotype quality \\ | |||
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ | |
LAA & . & Integer & Local Alternate Alleles the strictly-increasing, 1-based indices into the alternate alleles indicating which are relevant for the current sample.\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe:
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ | |
LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM but let's say the indices have to be strictly increasing; obviously that's a pedantic constraint but without it, the ordering of LPL could be unclear as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dwelling on this further -- @yfarjoun @lbergelson -- I'm worried just slightly about confusion over the LPL ordering. Lets say the joint VCF has REF=A ALT=G,C,T and the sample GVCF has ALT=T,G. so LAA=1,3 if we require the indices to be in order, but this order is flipped with respect to the GVCF.
The merging algo then has to be really careful to permute the GVCF PL values in just the right way to match that. In the wild, I'm nervous this is going to lead to difficult-to-detect bugs with subtle swapping of probability values. Did you have any experience with this in developing the idea? Should we permit the merger to write an out-of-order LAA (this might just shift the potential confusion to readers)? I don't have a better idea right now, it just seems to me like a significant pitfall at least warranting explication where LPL is discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a concern. The only actual implementation so far of a format with local alleles decided to store the allele string itself instead of the index because they didn't want to deal with the reordering problem. So unordered [T, G] instead of [1, 3]. Move the difficulties from the merge step to the reading step in that now if you want to look up non-local values associated with an allele you have to do a search in the INFO allele values to find the corresponding index. I
My thinking was that it was better to require the indexes as numbers and do the reordering, but I think there's a strong argument the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll spawn a background thread on this. BTW, if anyone knows a better formula than the following to go from PL index gt
to the indices (i,j)
of the constituent alleles, I would like to know. This is seriously the best I could come up with, holy sh**.
unsigned j = (unsigned)((sqrt(8*gt+1)-1.0)/2.0);
unsigned i = gt - j*(j+1)/2;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like @cyenyxe's suggestion, it solves a possible confusion. It also allows to determine the LPL ordering per allele-set, samples with the same set of local alleles are guaranteed to use the same order.
@mlin Note that the proposed formula is for the special case of diploid genotypes. In my experience, in practice it is sufficient to pre-compute the indexes using nested/recursive loops as suggested in the description of GL, section 1.6.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do the indexes need to be strictly increasing?
One of the main points is to lessen the burden on the creation of these pVCFs...so that when the alleles change the only change that needs to happen in the format field is the LAA. if we require strictly increasing order, that goes out the window.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If LAA are strictly increasing, LPL's are guaranteed to have the same order as 1) the original PLs and 2) samples with the same set of alleles.
Editing of VCFs is much less frequent operation than read-only processing, therefore it is more important to optimize the speed and convenience of the latter.
VCFv4.3.tex
Outdated
@@ -503,7 +505,15 @@ \subsubsection{Genotype fields} | |||
All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. | |||
If the genotype in the GT field is unphased, the corresponding PS field is ignored. | |||
The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). | |||
\end{itemize} | |||
\item LAA (and LAD and LPL (*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would follow the existing of one item per field, otherwise it may be confusing to readers searching for LAD and LPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I leave the paragraph as is, and add entries for LAD and LPL pointing to LAA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AD, ADF, ADR paragraph also groups related items into a single paragraph. Fortuitously in both cases the three entries are adjacent alphabetically.
(Though the whole LAA/LAD/LPL paragraph wants to move up to between HQ and MQ, and this label (LAA (and LAD and LPL (*)
) wants less ‘and’ and better balanced parentheses!)
VCFv4.3.tex
Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields} | |||
GQ & 1 & Integer & Conditional genotype quality \\ | |||
GT & 1 & String & Genotype \\ | |||
HQ & 2 & Integer & Haplotype quality \\ | |||
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe:
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ | |
LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\ |
VCFv4.3.tex
Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields} | |||
GQ & 1 & Integer & Conditional genotype quality \\ | |||
GT & 1 & String & Genotype \\ | |||
HQ & 2 & Integer & Haplotype quality \\ | |||
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ | |||
LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\ | |
LAD & . & Integer & Read depth for the reference and each of the local alternate alleles listed in LAA\\ |
VCFv4.3.tex
Outdated
@@ -406,12 +406,14 @@ \subsubsection{Genotype fields} | |||
GQ & 1 & Integer & Conditional genotype quality \\ | |||
GT & 1 & String & Genotype \\ | |||
HQ & 2 & Integer & Haplotype quality \\ | |||
LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\ | |||
LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\ | |||
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\ | |
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\ | |
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA, in the genotype order implied by LAA such that LPL would be a subsequence of PL were the latter present.\\ |
@lbergelson @pd3 following the previous discussion, I feel it's worth leaving no possible question about the LPL order -- here's my attempt, better suggestions welcome.
For GVCF merging, we're already responsible for carefully permuting the GVCF PL values into the target joint vector, the new wrinkle will be that each sample can have a different target. I think this is okay, it was just tempting to want to copy in the GVCF PL verbatim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to speak about ordering only and not require a strict subsequence: it should be possible to rescale the LPL likelihoods.
Hi all, I was motivated to come back here after another collaborator just run aground on the htslib int overflow when it tries to allocate a 2GB+ PL vector 😄 😭 It seems like we are not far; the open items are
|
I think this is a nice new feature, but it took me many minutes of very close reading to figure out just what the feature is! If I understand it correctly, it could perhaps be better described as:
thus demonstrating the different subsets per sample.
LAA is a list of n integers, where 1 <= n <= the number of ALT alleles, giving the (1-based) indices within ALT of the alleles that are observed in the sample. [And a sentence about the ordering if so decided.] (Uh — can n be 0?) LAD is a list of n+1 integers giving read depths (as per AD) for the REF allele and each of the local alleles as listed in LAA. LPL is a list of n+1 choose P (or whatever the combinatoric calculation is!) integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the REF and LAA local alleles. [And refer to the precise ordering defined in the GL paragraph.] (Writing out all that for each field might mean they would be better as separate paragraphs (as per #434 (comment)) or it might be better to keep them together to collaborate on the same example better…) |
This is a good point; we should probably clarify how one should write LAA & LPL for samples reporting no evidence for any particular alternate allele. |
Some misc things I've been thinking about while implementing this:
|
VCFv4.3.tex
Outdated
2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\ | ||
3&C&G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 4:0/0:30,1:0,30,80\\ | ||
3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.1:0,.,.,.,.,.,.,.,.,.,30,.,.,.,80\\ | ||
4&G&A,T,\textless*\textgreater& LAA:LGT:LAD:LPL& :0/0:30:0\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here: we either need to remove the LAA field and indicate that it can be left out if there are no non-ref local alleles, or we need to add a leading .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a typo: the intention is that LAA is a 0-length list of integers (indicating that this record is homozygous ref for this sample and that none of the ALT alleles are in play), not an unknown MISSING value.
This may be a problem, as it's not clear that 0-length lists are valid syntax in VCF or BCF. Omitting LAA is probably not the right workaround, as presence of LAA is the simple indicator that local alleles are being used in a VCF record. Setting it to .
(unknown) is somewhat misleading, as it is known…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example contradicts the current text, which says “LAA is a sorted list of n distinct integers, where n ≥ 1 […]”. So the text will need adjustment if this example record is deemed valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bcftools
converts the zero-length list into a missing value. This comes from bcftools treating everything as BCF internally, and BCF says zero-length lists are missing.
Possible fixes would be to say missing LAA
means REF; to explicitly put 0 in LAA
for REF-only; or use GT
in these rows, but allow use of LPL
, LAD
etc. in the specific case that GT
is REF-only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to use the missing value .
and state it explicitly that it means REF. I can't think of a case where this could cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Petr here. While the BCF specification states that a zero length array is defined to be a MISSING array, hence .
, we could add an explicit statement to LAA saying that missing LAAs means that it matches REF. What we gain is a solution to this problem. What we lose is the ability to store an unknown LAA list. I'm unsure what the latter practically means. Why would you need to explicitly state it's missing instead of just omitting it. If it's missing then we're not using LAA (and if the PL, AD etc are missing we're not defining it that way either).
PS. Technically even if it's .
this still contradicts "LAA is a sorted list of n distinct integers" as .
isn't an integer. However that's nit-picking and maybe somewhere else VCF already defines .
as being an integer of unknown value? I don't know what we do elsewhere when we have this technical disparty. Is it just assumed that "." is a polymophic type that matches whatever the item wants it to be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the BCF spec says is that a type descriptor byte with ‘size’=0 denotes MISSING. It does not say what a type descriptor byte with ‘size’=15 followed by a typed Integer with value 0 denotes, but that would be a natural representation of a 0-length list. (The spec does say that ‘size’=15 implies the list length is ≥15, but the typed Integer is not biased so no meanings are described for typed Integer values <15.) Thus BCF does not provide a binding answer on whether 0-length lists are valid or representable in VCF/BCF.
But I'm not going to advocate blessing 0-length lists in VCF or BCF, especially not just for this, even though SAM found them useful for regularity (cf #135).
IMHO the clearest way to represent this corner case would be
As a special case, the homozygous REF case in which no ALT alleles are observed in a particular sample is represented as LAA=[0].
(And we'd need to rework n in the length maths: e.g. by dropping n from the description of LAA, and adding “where n is the number of ALT alleles listed in LAA” to the others.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could get behind using LAA=0
for the special case of no alternates. It wouldn't need much wording alteration. It mentions 0 is implicit and that LAA goes from 1 to <= n, but we could add an explicit caveat for the empty list and an explanation of why it's necessary. I don't like the mix of LAA=2,4 and LAA=[2,4] in the text though. I find it confusing to have two different syntaxes mixed together. I get it's trying to imply an array in a textual sense, but it would be clearer IMO if the textual description used the actual text encoding.
However I'm not really sure I understand the necessity for a distinction between empty and missing. I appreciate they are infact different things, however pragmatically are we expecting the tools to treat them differently? If so how? Is it a confidence thing, where we want to be able to specify a likelihood for REF-only?
I don't know enough about the internals of bcftools and htsjdk though to know whether an LAA of 0 would cause breakage. Likewise the type / size shenanigans. My instinct tells me htslib would explode if the size is incorrect and I'd rather not be relying on robustness of weird undiscovered corner cases in BCF. It's a confusing enough format as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on my previous comment: the logic of LAA is to define a list of alleles that can be referred to in subsequent localized fields. If only the reference allele is relevant, then it makes sense to provide the missing value .
, in this sense it behaves in exactly the same way as what we write in REF and ALT columns. Note also the naming of the tag stands for "Local Alternate Alleles". By using .
we are not introducing a new concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parachuting in here a bit, so my apologies if I'm missing context, but defining LAA as an R-length array whose first element is always the number zero both avoids these nasal demons and avoids a branch [1] when connecting a local index into a global index.
Is it too late to consider an R-length LAA instead of an A-length one?
[1] Text processing cost probably dominates, but if you convert a VCF into some binary format without adjusting LAA, maybe this matters a bit.
Re (3) Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site? Wouldn't sites where all samples include all the alleles just use GT/AD/PL rather than LAA/LGT/LAD/LPL? (And this would obviously be allowed as there is no text saying a file has to have all or none of its variant lines using LAA, and the default assumption would be that a mixture is fine.) Or do you mean: when some samples include all the alleles and others don't, instead of writing |
This is a draft implementation of samtools/hts-specs#434 and haploid Number=G tags are not handled yet.
I was just reading through all of this, since I am implementing it for one of our tools, and I wondered: what's the point of In our applications, the only point of doing this is to transform fields with number G (PL-values), although an argument could be made for transforming any/all fields with number A and R, as well. Is there a reason for only suggesting |
My preference is to expand the scope of this and make it generic over all In this manner, converting between the short and long-hand notations can be done not with Another benefit of this approach is that is allows fields such as ^ |
@d-cameron That's a terrible idea, it would break existing tools. Also, having implemented this in Perhaps better is to apply this idea to the LXX tags, rather than changing the semantics of existing tags. (That is, allow other LXX tags even when LAA is not present.) |
Hi, we are using local alleles in DRAGEN gVCF Genotyper. More specifically, we write FORMAT/LPL, LAD, LGT and LAA in the multi-sample VCF output. In particular FORMAT/LPL gives a significant space saving compared to FORMAT/PL which scales quadratically with the number of alleles at a variant site. This becomes important for an msVCF representing a large cohort because multi-allelic sites with hundreds or even thousand of alleles will become more frequent. In these cases, the msVCF line can exceed the size limit of the htslib parser as @mlin mentions above. I would love to see this change to become part of the VCF spec. If there is anything needed to support this, please let me know. |
Since this PR was started, v4.4 has become the canonical VCF version. We may wish to adjust this PR to add the new fields to the VCFv4.4.tex document instead (or perhaps to both, but primarily to the current one, v4.4). I have drafted PR #745 applying the local alleles changes to the VCFv4.4 document on top of current master and fixing some formatting etc. It is not intended to split the conversation from this PR, but to provide a preview PDF and demonstration of how these changes applies to current v4.4. If @yfarjoun and the VCF maintainers agree, and wish to apply the local alleles additions to v4.4 instead of v4.3 (or perhaps to both), then we can force push the new PR's branch to yfarjoun:yf_local_alleles, close #745, and continue the conversation on this original PR. |
Background into this, that came up in Future of VCF. All Of Us use VDS natively (Hail Variant Dataset). I think this is likely the origin of this PR and #435. https://support.researchallofus.org/hc/en-us/articles/14927774297620-The-new-VariantDataset-VDS-format-for-All-of-Us-short-read-WGS-data One difference there is VDS uses LA and not LAA,and LA includes 0 for REF. |
Andrew Whitwham (12): Uodate copyright for winter release. January 2023 NEWS update. Keeping the NEWS file up-to-date. More additions and improvements. Added htscodecs update to v1.4.0 More space. Switched back to openssl for Alpine. Amalgamate multiple CIGAR ops into single entry. (#1607) Stop the overwriting of the end value. Summer 2023 copyright update. Speed up removal of lines in large headers. Winter News 2023 (PR #1703) Bergur Ragnarsson (3): draft fix fix memory leak exit early on error David Seifert (1): Use POSIX `grep` Fabian Klötzl (1): improve parsing performance Fangrui Song (1): Apply the packed attribute to uint*_u types for Clang James Bonfield (86): Update man SEE ALSO sections from .BR to .IR so the website uses URLs bgzip text compression mode Make the bgzip -g option less opaque. Make tabix support CSI indices with large positions. Prevent crash when only FASTA entry has no sequence. Add an fai_line_length function. Check for invalid BC tags in fastq output. Warn if ref file is given but it doesn't contain the refs we need. Fix buffer read-overrun in bam_plp_insertion_mod. Fix ref fix from c91804c Make it easier to modify shared library permissions during install Add CRAM SQ/M5 header checking when specifying a fasta file. (PR #1522) Speed up load_ref_portion. Expand CRAM API a bit to cope with new samtools cram_size command. (PR #1546) Merges neighbouring I and D ops into one op within pileup. (PR #1552) Improve API docs for bgzf_mt FIx a bug in the codec learning algorithm for TOKA Fix a bug with multi-threading and embed_ref=2 on name sorted data Use non-ref mode when all else fails for CRAM encoding Add some documentation on cram encoder code structure Fix Cram compression container substitution matrix generation. Tweak the CRAM_SUBST_MATRIX table. Prevent spurious and random system errors from test_bgzf.c Fix cram_index_query_last function Avoid deeply nested containment list on old CRAM indices. Permit fastq output to create empty FASTQ records for seq "*". Fix a couple small VCF auto-indexing bugs. Backport attractivechaos/klib#78 to htslib. Slightly speed up various cram decoding functions (#1580) Remove CRAM 3.1 warning. Trivial fix to expr, removing "^". Add MZ:i tag as a check for base modification validity. (#1590) Fix typo in kh_int_hash_func2 macro. Rename aux tag MZ to MN. Protect against overly large containers. Don't create overly large CRAM blocks. Add a missing break statement in cram_codec_to_id. (#1614) Fix fd_seek on pipes on modern MinGW releases. Change bounds checking in probaln_glocal Fix a containment bug in cram_index_last. Migrate base modification code out of sam.c Correct base modification implicit / explicit status when mixed together. Add a bam_mods_queryi interface. Add bam_parse_basemod2 API with additional flags argument. Add more internal sam_mods.c documentation Update bam_next_basemod too to cope with HTS_MOD_REPORT_UNCHECKED. Fix decompress_peek_gz to cope with files starting on empty gzip blocks. Fix to 2e672f33 decompress_peek_gz change. Add fai_thread_pool interface. NEWS updates for pending release Makes bam_parse_cigar able to modify existing BAM records rather than Fix cut and paste errors in bam_aux2f documentation The first stage of vcf_parse_format speed improvements. Further VCF reading speeds optimisations. Revert most of the vcf_parse_info improvements. Add an hclen SAM filter function. Fix a minor memory leak in malformed CRAM EXTERNAL blocks. (#1671) Enable auto-vectorisation in CRAM 3.1 codecs. Cache key header lengths. Allow vcf_format to work on packed data, plus bcf_fmt_array improvements. Speed up kputd by approx 130%. Minor tweaks to bcf_fmt_array and bcf_str_missing usage. Fix a cram decode hang from block_resize. Always do the CRAM mutex lock/destroys. Add C++ casts for external headers. Enable optimisation level -O3 for SAM QUAL+33 formatting. Avoid a NULL pointer dereference while building CRAM embedded ref. Avoid a NULL pointer deref when erroring writing CRAM to stdout Make CRAM internal data structures use hts_pos_t. Prevent CRAM 3 from attempting to write out out-of-bounds ref positions Remove memory leak when cram_encode_container fails during a close. Fix out by one error on extend_ref memory allocation. Don't call cram_ref_decr on consensus-based references. Check for 64-bit values in BETA codec initialiser. Protect against CRAM slice end going beyond end of reference. Prevent extend_ref from making huge mallocs on very sparse data. Improve the fuzzer to write BAM/CRAM and BCF too. Fix memory leaks on failed CRAM encode. Permit embed_ref=2 mode to be reenabled after using no_ref. Tighten memory constraints for cram_decode. Further rewrite of the fuzz test harness Reduce maxmimum BCF header len in fuzzer. Fix buffer read overrun in cram_encode_aux. Disable hts_set_fai_filename call in hts_open_fuzzer. Avoid undefined behaviour integer overflow in extend_ref Fix integer overflow in cram_compress_block2 John Marshall (19): Add bam_aux_first()/bam_aux_next() tagged aux field iterator API Document that bam_aux_del()'s `s` parameter must be non-NULL (& reformat) Add symbol versioning to the ELF shared-object file Mention in INSTALL that using plugins may need -rdynamic Set _XOPEN_SOURCE in configure if it's not already set Add missing Makefile dependencies [minor] Make last_in a pointer to const [minor] Add "uncompressed" in hts_format_description() where appropriate Add hclose()-doesn't-close-fd option and use it for hopen("-") Take advantage of shared hopen("-") in htsfile.c Explicitly fclose(stdout) in test/test_view.c too Fix hfile_libcurl small seek bug Apply memchr optimisation in the general delimiter case too Add test case compiling the public headers as C++ Remove remnants of HTS_HAVE_NEON, unused since PR #1587 Remove NUMERIC_VERSION, unused since PR #1226 Document primarily MM/ML and fix base-mod-related typos Compare as off_t (not size_t) and fix printf specifiers Install annot-tsv.1 man page and alphabetise annot-tsv rules Lilian Janin (2): Fix error code 0 returned by bcftools after error Make bcftools return an error code != 0 after [E::bgzf_read_block] Invalid BGZF header at offset xxx Petr Danecek (15): Remove variable redeclaration warnings from perl test script Make bcf_hdr_seqnames() work with gapped chromosome ids Make bcf_hdr_idinfo_exists() more robust Check if VCF POS column could be fully parsed Allow repeated calls of bcf_sr_set_regions (PR #1624) An attempt to parse malformatted region such as {1:1}-2 should fail Add new annot-tsv program Output full diff on failing tests Output full diff on failing tests Make qsort in regidx order-reproducible across platforms Provide a nill (dot) value when the field is empty Use EXIT_SUCCESS with -h and EXIT_FAILURE on errors Address various comments VCF parsing fix Allow renaming of the default -a annotations (#1709) Rob Davies (65): Fix n-squared complexity in sample line with many adjacent tabs Switch to building libdeflate with cmake Add faidx_seq_len64() and fai_adjust_region() interfaces Rework / add new faidx tests Fix build on ancient versions of gcc Ensure strings in config_vars.h are escaped correctly Switch MacOS CI tests to an ARM-based image Cut down the number of embed_ref=2 tests that get run Cap bgzf_getline return value to INT_MAX Make tbx_parse1 work for lines longer than 2Gbytes Use correct type for ret in vcf_write() Don't error when making an iterator on a tid not in the index Happy New Year Catch errors from bgzf_getline() in hts_readlist, hts_readlines Add configure (enable|disable)-versioned-symbols options and tests Add Makefile rule to update the symbol version file Strip out symbol versions from shlib-exports-so.txt Update to htscodecs v1.4.0 Minor NEWS adjustment and additonal item Switch to CURLINFO_CONTENT_LENGTH_DOWNLOAD_T for newer libcurl Fix crypt4gh redirection Remove use of sprintf() from HTSlib source Make SIMD tests work when building multiarch binaries Make MacOS tests build a multiarch version of the library Fix bug where bin number could overflow when looking for max_off Make reg2bins faster on whole-chromosome queries Make reg2intervals() faster on whole-chromosome queries Update to latest htscodecs Don't set _POSIX_C_SOURCE for htscodecs tests Fix trailing space in config.h made by configure Ignore generated config_vars.h file in copyright check Switch to `/usr/bin/env perl` for all perl scripts Adjust comments in probaln_glocal() Expand test-bcf-sr.c capabilities Add synced reader region tests, and move no-index tests Fix possible double frees in bcf_hdr_add_hrec() error handling Prevent dangling hrec pointer after bcf_hdr_add_hrec() failure Remove items from hdict in bcf_hdr_remove() Ensure number of modifications is always set in bam_parse_basemod2() Ensure simple_test_driver.sh cleans up its temporary files Ensure base mod test result is noticed by the Makefile Improve test/test_mod.c Switch to htscodecs 1.5.1 Skip CRC checks when fuzzing Prevent out-of-memory reports when fuzzing Add missing dependency on libhts.a for hts_open_fuzzer Fix virtual offset adjustment when indexing on-the-fly Fix BCF/VCF on-the-fly indexing issues Fix crypt4gh redirection Update htscodecs to v1.5.2 (2aca18b3) Simplify run_test function Fix test/test.pl -F dependencies Simplify test/test.pl Move all annot_tsv tests into their own function Fix up some annot-tsv error checks / reports Reformulate man page for house style Make compiler flag detection work with zig cc Fix a couple of unused value warnings when built with NDEBUG Fix @pg linking when records make a loop Fix possible shift of negative value in cram_encode_aux() Move typecast to the right place Ensure no-stored-sequence reads are counted in container size Make output on unrecognised options go to stderr Update to latest htscodecs Update htscodecs to v1.6.0 kojix2 (2): Fix a typo in sam.h documentation Fix example in docs for sam_hdr_add_line pd3 (4): Remove a bottleneck in VCF header processing Add support for non-standard chromosome names containing [:-] characters Clarify usage, include output example Temporary workaround when excessive memory is required by FORMAT fields vasudeva8 (10): Adds bcf_strerror method (PR #1510) Ensure NUL termination of Z/H data in sam_format_aux1; fix base mod state reuse Changes to avoid segfault with uncompressed bam (PR #1632) Demonstration of htslib/sam api usage. formatting update bgzf_useek fails when offset is above block limits Add support for multiple files to bgzip Limited usage text to 80 chars per line man update for -a option (PR #1716) Fix example column names and explain core column renaming Étienne Mollier (2): cram/cram_external.c: fix external htscodecs include htslib-s3-plugin.7: fix whatis entry Noteworthy changes in release 1.19 (12th December 2023) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Updates ------- * A temporary work-around has been put in the VCF parser so that it is less likely to fail on rows with a large number of ALT alleles, where Number=G tags like PL can expand beyond the 2Gb limit enforced by HTSlib. For now, where this happens the offending tag will be dropped so the data can be processed, albeit without the likelihood data. In future work, the library will instead convert such tags into their local alternatives (see samtools/hts-specs#434). (NEWS truncated at 15 lines)
a5d41dc
to
4e4bae4
Compare
As discussed in today's meeting, this PR has been updated to apply to the v4.4 document instead (and some formatting is improved). Some further work needed was discussed in the meeting:
|
We (Hail) also questioned the value of LGT until we discovered a footgun. Suppose:
It is not unreasonable to also remove the now unused alternate alleles from the alternate allele array, but now your
For Hail's purposes, using LGT seems best because it allows sample subsetting without O(N_SAMPLES_KEPT) edits. It also avoids the footgun of "whoopsie I changed the meaning of all my GTs!". |
Do I understand correctly that
means that P is the ploidy of the GT? Concretely, I'd prefer three new numbers (all of which are distinct from the ploidy):
I think
|
Apologies, if I am being nitpicky... @jkbonfield FYI: I don't believe that "natively" is quite accurate, since it implies that we do our joint calling, conversions, and search functionality using Hail/VDS. We use Genomic Variant Store (GVS) for all of that, except for some conversions. We use GVS to render a VDS (also Hail for this one), VCF, and pgen. We do use the Hail/VDS to render bgen and plinkbed, because GVS does not support those. Note that GVS is a variant of SVCR (no pun intended). |
Based onto VCF 4.5. Please continue discussion on #758. |
This is the local-alleles part of #420