-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representation of cross-join mappings on circular reference sequences #405
Conversation
I've rebased my draft of the same thing on top of yours. If your footnote stays (the original style in this table would be to have the bare minimum of information (cf |
SAMv1.tex
Outdated
@@ -658,6 +658,7 @@ \section{Recommended Practice for the SAM Format} | |||
should be present. | |||
\item The {\tt NM} tag should be present. | |||
\end{enumerate} | |||
\item At present, this specification does not describe how mappings across the ``join'' in a circular reference sequence should be represented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be controversial to just say "use supplementary alignments"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using supplementary alignments (so representing the linear alignment (§1.2) in two separate alignment records?!) is one way to represent it, but other ways exist and (ignoring downstream considerations for the moment) this one doesn't seem like the best way to represent this in SAM.
As for the representation of mappings across the join, this draft just adds a note to §2 Recommended Practice punting on defining that. Instead of that note, IMHO we should change ¶4.3 just above there (If POS plus the sum of lengths of M/=/X/D/N operations … exceeds LN …) so that it only applies to non- It seems to me that this would be sufficient and a more natural way to represent these things in SAM than e.g. splitting the mapping into two separate alignment records using a supplementary record. This may need a little work in downstream tools, but wouldn't that be the case however such alignments were represented? (We could just add |
So choices for this PR are:
|
@jmarshall I am all for merging as is now, with a slight preference to say you can just use supplementary for now, but with a long-term preference for a single alignment record (but how do we pileup efficiently? need to modify htslib/htsjdk to support off-the-end mappings?). Perhaps merge now and you or I can start a new issue about the representation of cross-join mappings? |
+1 for having a single alignment that simply wraps around. Tools need to have coped with alignments off the end of sequences anyway as some buggy aligners produced such data, so the reasonably worse scenario here isn't buffer overflow but rejection of some data as illegal. Upgrading analysis tools to do modulo arithmetic shouldn't be excessive, but I'd expect take up of TP to be gradual so we have time. Maybe... |
+1 for splitting alignment into primary & secondary
…On Thu, 2 May 2019 at 16:01, James Bonfield ***@***.***> wrote:
+1 for having a single alignment that simply wraps around.
Tools need to have coped with alignments off the end of sequences anyway
as some buggy aligners produced such data, so the reasonably worse scenario
here isn't buffer overflow but rejection of some data as illegal. Upgrading
analysis tools to do modulo arithmetic shouldn't be excessive, but I'd
expect take up of TP to be gradual so we have time. Maybe...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#405 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALRZ6UBS7NZMASRBJSCO6LPTKNWXANCNFSM4HJWXRPA>
.
|
There are no significant htslib/htsjdk changes necessary for sequential access to a single-record-spanning-beyond-LN representation. (I'm arbitrarily claiming any htsjdk validation changes would be insignificant.) OTOH for random access implementations would need to arrange for So that's probably the reason why @nh13 thinks using a supplementary alignment representation is a pragmatic approach — as it can be processed by existing versions of tools. It's indexed and iterated correctly because the aligner has already done this work and emitted an extra copy of the read at POS=1. However it seems to me that it's an awkward representation for tools wishing to analyse such mappings. The location of POS=1 is just an artifact, and these really are just 50M mappings like any others — like those at POS=8284 (halfway around the MT circle) for example. Doesn't the two separate alignment records spuriously look like the read is spanning a rearrangement? Tools actually analysing the data would need special cases to cope with this, and it's not obvious to me that that's better than making indexing/iterating capable of coping with the wrapping-around representation. Hence IMHO the spec should be agnostic on this representation question until some experience has been gained and the trade-offs can be measured. So the spec should for now allow either representation to be used. Are there (bacterial, mitochondrial) people working with such mappings who have experience of the trade-offs of these representations? What are they doing at the moment? Do they like it, or are they just shoehorning their data into what the existing tools like? @colinhercus: It would be helpful if you would explain why you support splitting alignments into primary & secondary. |
I now regret making that comment before the PR is merged. Any objection to taking the alignment representation to a different issue and getting this PR merged? |
We will discuss it at the meeting this afternoon and almost certainly merge the header field. However I would like the spec to be agnostic on the representation and at present the Recommended Practice section somewhat accidentally precludes one of the representations. |
I should have said primary & supplementary, I'd had a busy day :)
The reason being that it should work with existing variant callers &
pileups and if these were circular aware they could stitch the two
alignments back together. If primary and supplementary can be joined across
ends of a circular chromosome then it's not an SV. Pretty easy to detect!
This allows us to implement the changes in aligners and SAM while variant
callers play catchup. Don't forget there are a lot of them.
Allowing both forms would also be acceptable and then the aligner could
have an option to control which format it produces.
…On Thu, 2 May 2019 at 19:22, John Marshall ***@***.***> wrote:
There are no significant htslib/htsjdk changes necessary for *sequential*
access to a single-record-spanning-beyond-LN representation. (I'm
arbitrarily claiming any htsjdk validation changes would be insignificant.)
OTOH for *random access* implementations would need to arrange for
chrMT:1-10 to bring back alignment records with POS=16550, which would
require htslib/htsjdk changes in indexing (relatively easy) and iterators
(perhaps harder). Once these changes were made, pileup/etc would follow
fairly easily, but the iterator changes needed would not be insignificant.
So that's probably the reason why @nh13 <https://github.com/nh13> thinks
using a supplementary alignment representation is a pragmatic approach — as
it can be processed by existing versions of tools. It's indexed and
iterated correctly because the aligner has already done this work and
emitted an extra copy of the read at POS=1.
However it seems to me that it's an awkward representation for tools
wishing to analyse such mappings. The location of POS=1 is just an
artifact, and these really are just 50M mappings like any others — like
those at POS=8284 (halfway around the MT circle) for example. Doesn't the
two separate alignment records spuriously look like the read is spanning a
rearrangement? Tools actually analysing the data would need special cases
to cope with this, and it's not obvious to me that that's better than
making indexing/iterating capable of coping with the wrapping-around
representation.
Hence IMHO the spec should be agnostic on this representation question
until some experience has been gained and the trade-offs can be measured.
So the spec should for now allow either representation to be used.
Are there (bacterial, mitochondrial) people working with such mappings who
have experience of the trade-offs of these representations? What are they
doing at the moment? Do they like it, or are they just shoehorning their
data into what the existing tools like?
@colinhercus <https://github.com/colinhercus>: It would be helpful if you
would explain *why* you support splitting alignments into primary &
secondary.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#405 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALRZ6USUYITAARCMJEJE7DPTLFJDANCNFSM4HJWXRPA>
.
|
This is to support annotating reference sequences as circular, e.g., for bacterial organisms or the human mitochondrial chromosome. [Summarise @nh13's footnote text so it fits on one line, so `@RG-SM` is not pushed off to the next page as an orphan. Remove now unneeded pagebreak hint.] Fixes samtools#403.
Agreed at the meeting to proceed in two stages. Okay I've reset the branch to include the commit (2a12b7e) I plan to push by itself to master once @mlin has done some stuff in the repository, if he's planning to. That'll leave this PR open, and I reckon we should reuse it for stage 2 (cross-join representation) as there's already some useful discussion here. |
One thing is troubling me: If two people have a two different versions of a reference sequence, one which is TP:circular and the other is not, they will have the same md5 which will mean that refGet will clash, and other sanity checks will fail. Should we redefine md5 for TP:circular in some way to avoid this? For example:
|
Hmm, I'd say not. What other than CRAM is using the md5 as a way of obtaining the reference? The sequence itself isn't changing anyway, just our interpretation of it. Unless we modified the CRAM spec to take into account the circular statement with regards to reference based compression, then it'll make no difference there anyway (which is a good reason for not doing that). Right now bases hanging off the end of the reference just get encoded verbatim. I don't know what other tools will make of this, but it seems to me as long as the sequence doesn't change (which it doesn't), then it's simply the meta-data in the header which could differ per file. I don't really see how using two different md5 calculation methods helps us here. It's more likely to break things as old code that doesn't honour TP:circular generates the wrong md5. |
Actually, that's not true for Picard. It swallows CRAM files made with Scramble that have CIGARs off the end of the reference and it converts back to SAM properly, however it cannot create such a file itself.
Somewhere in that verbal diarrhoea is the key bit: "Read CIGAR M operator maps off end of reference". So it looks like it is forbidden within htsjdk. How do you deal with all that legacy broken bwa output? Just flat out refuse to accept it? |
^ this is fantastic The legacy broken output is dealt with by Picard's MergeBamAlignment tool. I've also made a PR to htsjdk so that if the |
I think you are missing my point about obtaining a reference (from via refGet for example). For example, we now have in the "immutable" record a chrM sequence without |
But refget Similar information is returned by the I see your point that it would be nice if circularity was something that you could inquire from the oracle given just an MD5 id token, but IMHO the answer is one of (a) that's part of the metadata or (b) it's too late to change what goes into the MD5, too much incompatibility. To be sure, TP is new and you are proposing only to change the MD5 value for reference sequences that specify this new TP value. So the compatibility issue might be surmountable. Then it comes down to: how are you proposing that refget or similar should report the TP value to the client? |
Ah yeas..I forgot that we only get back the sequence and not the metadata as well. I rescind my objection. |
The Let's keep this PR open and continue to discuss the eventual stage 2 (cross-join representation) here. |
I was looking at this a few years back, see https://sourceforge.net/p/samtools/mailman/message/28372847/ etc. I was mapping to circular references (mitochondria). Given the tool limitations at the time, I ending up running my mappings using doubled-linear-references (giving mappings in the range 0 < POS <= 2*LENGTH), post processed to ensure 0 < POS <= LENGTH with a little modulo arithmetic (if a read chanced to map in the duplicated region, move it into the first half). i.e. I was in effect using a single primary alignment for origin wrapping reads (as @jkbonfield suggests above), with additional work needed for computing coverage near the origin (having to look at reads spilling off the right and end, and doing modulo arithmetic). This also had implications for the index bins too - but things seemed to work. For example, if you were doing SNP calling near the origin, you'd need to look in the region 0 < POS < WINDOW as now, and also off the right hand end in region LENGTH < POS < LENGTH+WINDOW. For visualisation, in increasing utility: With the new Note that with small plasmids or other circular structures, it is not out of the question that a modern long read could circle the origin more than once (i.e. sequence a PCR product longer than the circle's length). In this case again a single mapping entry would work, with 0 < POS < LENGTH and an appropriate long CIGAR string. |
@peterjc I like your suggestion of recommending having 0 < POS < LENGTH and an appropriate long CIGAR string, with the requirement that the I'd be happy to write a short tool in https://github.com/fulcrumgenomics/fgbio that would convert the former single alignment to the latter (primary+supplementaries) for folks to enable use with existing variant callers. |
@jmarshall I have tried to add a recommended practice for cross-joins. I tried to add language similar to the rest of the spec, but feel free to change as you see fit. Once we agree, I'll write a quick tool to split records that map across the origin of a circular reference sequence into primer+supplementaries. |
Change "should" to "may" when talking about the alignment to a circular reference as the sentence doesn't limit itself to describing only alignments that spans the end/start boundary. Other than that, approved. |
I agree with the should -> may (since most reads will not wrap the origin). We have the pain of SAM vs BAM counting, but should this be clearer what the exact bounds of POS are? i.e. 1, ..., N for mapped reads with SAM, equivalently 0, ..., N-1 for mapped reads with BAM counting. Otherwise the text looks good. |
@jmarshall @peterjc I changed two of the three "should"s to "may". I think the first should remain (I am guessing that's what you both meant). |
SAMv1.tex
Outdated
@@ -658,6 +659,19 @@ \section{Recommended Practice for the SAM Format} | |||
should be present. | |||
\item The {\tt NM} tag should be present. | |||
\end{enumerate} | |||
\item Alignment to circular reference sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a latex newline here?
SAMv1.tex
Outdated
This applies to circular reference sequences that are annotated with | ||
{\tt TP:circular} in the {\tt @SQ} header line. | ||
\begin{enumerate}[label=\arabic*] | ||
\item (preferred) {\sf POS} should be between 0 and the length specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
POS is described as 1-based in this document.
SAMv1.tex
Outdated
in {\tt LN} field of the {\tt @SQ} header line with an {\tt SN} equal to | ||
{\sf RNAME}. {\sf POS} plus the sum of lengths of {\tt M/=/X/D/N} | ||
operations in {\sf CIGAR} may exceed the length specified in the {\tt LN} | ||
field of the {\tt @SQ} header line (if exists) with an SN equal to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it must exist in this case, since we are saying that the @SN
line has a TP:circular
....so let's clean this up a little.
SAMv1.tex
Outdated
operations in {\sf CIGAR} may exceed the length specified in the {\tt LN} | ||
field of the {\tt @SQ} header line (if exists) with an SN equal to | ||
{\sf RNAME}. | ||
\item (alternative) The alignment may be split across multiple records, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we should explicitly say that one of the record should "start" at POS=1 and another should "end" at POS=LN. otherwise, we are not really saying much in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a CDS where the origin is in an intron? In that case a GenBank/EMBL style CDS record would be multi-p
Update: I thought I discarded this half written comment, such already multi-part records don't need to do anything new, so the comment above about POS=1 and LN stands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some clarifications required.
Yes. As discussed in previous meetings, I've already been fixing up all those issues. Good point about the split representation though. |
Non-trivially reworded, and in particular I've added details of how the two representations are intended to be interpreted. I think that's ready for merging (unless others think the |
|
||
\item | ||
Alternatively, such alignments may be split across several records: one record representing the initial portion of the segment ending at~{\tt LN}, one representing the final portion starting from~1, and any other records representing additional portions in between spanning the entire reference sequence. | ||
One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small suggestion based on line 678
One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set. | |
One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} {\sf FLAG} bit (0x800) set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shrug… I was following the similar lowercase ‘flag bit’ phrasing in footnote 12 and in the Unmapped reads and Multiple mapping sections just above…
…part 2) Add recommended practice for records spanning the origin of circular reference sequences. [Added details of what the two representations *mean* -- JM] Co-authored-by: John Marshall <[email protected]>
Squashed and merged as discussed in the last meeting. Closing this PR this time as it's getting unwieldy and we're now more or less “feature complete” though further tweaks will probably be needed as experience is gained. |
This is to support annotating reference sequences as circular, for example for bacterial organisms or the human mitochondrial chromosome.
This is my attempt to keep the momentum for issue #403 going.
@jmarshall @yfarjoun