Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representation of cross-join mappings on circular reference sequences #405

Merged
merged 1 commit into from
Jan 6, 2020

Conversation

nh13
Copy link
Member

@nh13 nh13 commented May 1, 2019

This is to support annotating reference sequences as circular, for example for bacterial organisms or the human mitochondrial chromosome.

This is my attempt to keep the momentum for issue #403 going.

@jmarshall @yfarjoun

@hts-specs-bot
Copy link

Changed PDFs as of 0fb8f61: SAMv1 (diff).

@jmarshall
Copy link
Member

I've rebased my draft of the same thing on top of yours.

If your footnote stays (the original style in this table would be to have the bare minimum of information (cf SP Species 😄), but I guess details are helpful), it should note that the usual thing for linear chromosomes will be the status quo of just omitting TP.

SAMv1.tex Outdated
@@ -658,6 +658,7 @@ \section{Recommended Practice for the SAM Format}
should be present.
\item The {\tt NM} tag should be present.
\end{enumerate}
\item At present, this specification does not describe how mappings across the ``join'' in a circular reference sequence should be represented.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be controversial to just say "use supplementary alignments"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using supplementary alignments (so representing the linear alignment (§1.2) in two separate alignment records?!) is one way to represent it, but other ways exist and (ignoring downstream considerations for the moment) this one doesn't seem like the best way to represent this in SAM.

@jmarshall
Copy link
Member

As for the representation of mappings across the join, this draft just adds a note to §2 Recommended Practice punting on defining that.

Instead of that note, IMHO we should change ¶4.3 just above there (If POS plus the sum of lengths of M/=/X/D/N operations … exceeds LN …) so that it only applies to non-circular reference sequences. This would allow mappings stretching past LN on circular chromosomes to be considered to have wrapped around to position 1 and following.

It seems to me that this would be sufficient and a more natural way to represent these things in SAM than e.g. splitting the mapping into two separate alignment records using a supplementary record. This may need a little work in downstream tools, but wouldn't that be the case however such alignments were represented?

(We could just add with an SN equal to RNAME [and a TP not equal to circular] but really the whole sentence wants to be rewritten for clarity.)

@jmarshall jmarshall added the sam label May 1, 2019
@jmarshall
Copy link
Member

So choices for this PR are:

  • Merge the SQ-TP header now and leave the representation of cross-join mappings for later

    • either with or without an explicit note disclaiming a description of that representation
  • Or also now try to reach consensus on the recommended representation, which could be

    • using supplementary alignment records
    • using a single alignment record that spans beyond LN
    • something else

@nh13
Copy link
Member Author

nh13 commented May 1, 2019

@jmarshall I am all for merging as is now, with a slight preference to say you can just use supplementary for now, but with a long-term preference for a single alignment record (but how do we pileup efficiently? need to modify htslib/htsjdk to support off-the-end mappings?). Perhaps merge now and you or I can start a new issue about the representation of cross-join mappings?

@jkbonfield
Copy link
Contributor

+1 for having a single alignment that simply wraps around.

Tools need to have coped with alignments off the end of sequences anyway as some buggy aligners produced such data, so the reasonably worse scenario here isn't buffer overflow but rejection of some data as illegal. Upgrading analysis tools to do modulo arithmetic shouldn't be excessive, but I'd expect take up of TP to be gradual so we have time. Maybe...

@colinhercus
Copy link

colinhercus commented May 2, 2019 via email

@jmarshall
Copy link
Member

There are no significant htslib/htsjdk changes necessary for sequential access to a single-record-spanning-beyond-LN representation. (I'm arbitrarily claiming any htsjdk validation changes would be insignificant.)

OTOH for random access implementations would need to arrange for chrMT:1-10 to bring back alignment records with POS=16550, which would require htslib/htsjdk changes in indexing (relatively easy) and iterators (perhaps harder). Once these changes were made, pileup/etc would follow fairly easily, but the iterator changes needed would not be insignificant.

So that's probably the reason why @nh13 thinks using a supplementary alignment representation is a pragmatic approach — as it can be processed by existing versions of tools. It's indexed and iterated correctly because the aligner has already done this work and emitted an extra copy of the read at POS=1.

However it seems to me that it's an awkward representation for tools wishing to analyse such mappings. The location of POS=1 is just an artifact, and these really are just 50M mappings like any others — like those at POS=8284 (halfway around the MT circle) for example. Doesn't the two separate alignment records spuriously look like the read is spanning a rearrangement? Tools actually analysing the data would need special cases to cope with this, and it's not obvious to me that that's better than making indexing/iterating capable of coping with the wrapping-around representation.

Hence IMHO the spec should be agnostic on this representation question until some experience has been gained and the trade-offs can be measured. So the spec should for now allow either representation to be used.

Are there (bacterial, mitochondrial) people working with such mappings who have experience of the trade-offs of these representations? What are they doing at the moment? Do they like it, or are they just shoehorning their data into what the existing tools like?

@colinhercus: It would be helpful if you would explain why you support splitting alignments into primary & secondary.

@nh13
Copy link
Member Author

nh13 commented May 2, 2019

I now regret making that comment before the PR is merged. Any objection to taking the alignment representation to a different issue and getting this PR merged?

@jmarshall
Copy link
Member

We will discuss it at the meeting this afternoon and almost certainly merge the header field.

However I would like the spec to be agnostic on the representation and at present the Recommended Practice section somewhat accidentally precludes one of the representations.

@colinhercus
Copy link

colinhercus commented May 3, 2019 via email

@jmarshall jmarshall self-assigned this May 3, 2019
jmarshall pushed a commit to nh13/hts-specs that referenced this pull request May 3, 2019
This is to support annotating reference sequences as circular,
e.g., for bacterial organisms or the human mitochondrial chromosome.
[Summarise @nh13's footnote text so it fits on one line, so `@RG-SM`
is not pushed off to the next page as an orphan. Remove now unneeded
pagebreak hint.] Fixes samtools#403.
@jmarshall
Copy link
Member

Agreed at the meeting to proceed in two stages.

Okay I've reset the branch to include the commit (2a12b7e) I plan to push by itself to master once @mlin has done some stuff in the repository, if he's planning to. That'll leave this PR open, and I reckon we should reuse it for stage 2 (cross-join representation) as there's already some useful discussion here.

@yfarjoun
Copy link
Contributor

yfarjoun commented May 6, 2019

One thing is troubling me: If two people have a two different versions of a reference sequence, one which is TP:circular and the other is not, they will have the same md5 which will mean that refGet will clash, and other sanity checks will fail. Should we redefine md5 for TP:circular in some way to avoid this?

For example:

For the purposes of calculating md5, a TP:circular reference will be appended a '$' sign at the end. Example:
>chr1
CTCAACCACTTGAGCAAACTCCAAGAC
has md5 of 4ac2dee8e6ac06422295df0320981c70
but 
>chr1 TP:circular 
CTCAACCACTTGAGCAAACTCCAAGAC
has md5 of de8fd4b0606b285e11251b538b27ce76

@jkbonfield
Copy link
Contributor

jkbonfield commented May 7, 2019

Hmm, I'd say not. What other than CRAM is using the md5 as a way of obtaining the reference? The sequence itself isn't changing anyway, just our interpretation of it.

Unless we modified the CRAM spec to take into account the circular statement with regards to reference based compression, then it'll make no difference there anyway (which is a good reason for not doing that). Right now bases hanging off the end of the reference just get encoded verbatim.

I don't know what other tools will make of this, but it seems to me as long as the sequence doesn't change (which it doesn't), then it's simply the meta-data in the header which could differ per file. I don't really see how using two different md5 calculation methods helps us here. It's more likely to break things as old code that doesn't honour TP:circular generates the wrong md5.

@jkbonfield
Copy link
Contributor

Actually, that's not true for Picard. It swallows CRAM files made with Scramble that have CIGARs off the end of the reference and it converts back to SAM properly, however it cannot create such a file itself.

@HD	VN:1.4	SO:coordinate
@SQ	SN:MT	LN:16569	UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz	AS:NCBI37	M5:c68f52674c9fb33aef52dcf399755519	SP:Human
@RG	ID:1#49	PL:ILLUMINA	PU:130508_HS25_09827_B_C1PJLACXX_2#49	LB:7098543	DS:ERP002385: 4X coverage for the Mandinka from Gambia Western Division (GWD). This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/ 	DT:2013-05-08T00:00:00+0100	SM:ERS220911	CN:SC
HS25_09827:2:2216:2067:86061#49	147	MT	16499	29	2S98M	=	16105	-464	CTATCTGGTTCCTCCTTCAGGGTCATACATCCTAAATAGCACACACGTTCTCTTCAAATAAGCCACCACGATCGATAACAGCTATTTCACCCTCTTAACA	(4:E5C:+.EF'>;*;FGE<5(=25F()(*48&+;(?G6=-&6;-):'-G,D'9',C(-1,1$:,$-'3*6A'%+8++C%D05+-,-8<H,/,(7B%'*'	MD:Z:11A13A1G10C9C1C1T7A2T6	RG:Z:1#49	XG:i:0	AM:i:29	NM:i:9	SM:i:29	XM:i:9	XO:i:0	XT:A:M
/usr/bin/java -jar ~/work/picard-2.20.0.jar SamFormatConverter I=MT.sam O=MT2.cram R=$HREF
INFO	2019-05-07 09:53:37	SamFormatConverter	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    SamFormatConverter -I MT.sam -O MT2.cram -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa
**********


09:53:38.054 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/users/nfs_j/jkb/work/picard-2.20.0.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue May 07 09:53:38 BST 2019] SamFormatConverter INPUT=MT.sam OUTPUT=MT2.cram REFERENCE_SEQUENCE=/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue May 07 09:53:38 BST 2019] Executing as jkb@deskpro107386 on Linux 4.15.0-47-generic amd64; OpenJDK 64-Bit Server VM 11.0.2+9-Ubuntu-3ubuntu118.04.3; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.0-SNAPSHOT
[Tue May 07 09:53:38 BST 2019] picard.sam.SamFormatConverter done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=132120576
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name HS25_09827:2:2216:2067:86061#49, Read CIGAR M operator maps off end of reference
	at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:455)
	at htsjdk.samtools.SAMRecord.getCigar(SAMRecord.java:820)
	at htsjdk.samtools.SAMRecord.getCigarLength(SAMRecord.java:831)
	at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2002)
	at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:1883)
	at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:352)
	at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268)
	at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255)
	at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:574)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:553)
	at picard.sam.SamFormatConverter.convert(SamFormatConverter.java:93)
	at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:71)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Somewhere in that verbal diarrhoea is the key bit: "Read CIGAR M operator maps off end of reference". So it looks like it is forbidden within htsjdk. How do you deal with all that legacy broken bwa output? Just flat out refuse to accept it?

@nh13
Copy link
Member Author

nh13 commented May 7, 2019

@jkbonfield

Somewhere in that verbal diarrhoea is the key bit

^ this is fantastic

The legacy broken output is dealt with by Picard's MergeBamAlignment tool. I've also made a PR to htsjdk so that if the TP tag is present for a reference sequence the "off the end" error isn't thrown:
samtools/htsjdk#1363

jmarshall pushed a commit that referenced this pull request May 7, 2019
This is to support annotating reference sequences as circular,
e.g., for bacterial organisms or the human mitochondrial chromosome.
[Summarise @nh13's footnote text so it fits on one line, so `@RG-SM`
is not pushed off to the next page as an orphan. Remove now unneeded
pagebreak hint.] Fixes #403.
@yfarjoun
Copy link
Contributor

yfarjoun commented May 7, 2019

I think you are missing my point about obtaining a reference (from via refGet for example). For example, we now have in the "immutable" record a chrM sequence without TP:circular. However, it would be nice to also include a chrM with TP:circular. But those two sequences will have the same md5 and thus one would not be able to distinguish between them. I don't see how this is resolved without changing the definition of md5 in some way.

@jmarshall
Copy link
Member

But refget /sequence/<id> (by design) returns just the sequence text, which is 16569 characters that are identical regardless of what TP value some non-attached header might contain.

Similar information is returned by the /sequence/<id>/metadata endpoint, and that could be augmented to have a topology item in its response (there would be a conversation to be had about whether that should be nested under aliases or not), but that is not “immutable” as different servers can and will have different amounts of metadata.

I see your point that it would be nice if circularity was something that you could inquire from the oracle given just an MD5 id token, but IMHO the answer is one of (a) that's part of the metadata or (b) it's too late to change what goes into the MD5, too much incompatibility.

To be sure, TP is new and you are proposing only to change the MD5 value for reference sequences that specify this new TP value. So the compatibility issue might be surmountable. Then it comes down to: how are you proposing that refget or similar should report the TP value to the client?

@yfarjoun
Copy link
Contributor

yfarjoun commented May 7, 2019

Ah yeas..I forgot that we only get back the sequence and not the metadata as well. I rescind my objection.

jmarshall pushed a commit that referenced this pull request May 14, 2019
This is to support annotating reference sequences as circular,
e.g., for bacterial organisms or the human mitochondrial chromosome.
[Summarise @nh13's footnote text so it fits on one line, so `@RG-SM`
is not pushed off to the next page as an orphan. Remove now unneeded
pagebreak hint.] Fixes #403.
@hts-specs-bot
Copy link

Changed PDFs as of 690e758: SAMv1 (diff).

@jmarshall jmarshall changed the title Add the TP tag to the SQ in the SAM header Representation of cross-join mappings on circular reference sequences May 14, 2019
@jmarshall
Copy link
Member

The @SQ-TP header has now been merged to master and the published PDF has been updated.

Let's keep this PR open and continue to discuss the eventual stage 2 (cross-join representation) here.

@peterjc
Copy link
Contributor

peterjc commented May 14, 2019

I was looking at this a few years back, see https://sourceforge.net/p/samtools/mailman/message/28372847/ etc.

I was mapping to circular references (mitochondria). Given the tool limitations at the time, I ending up running my mappings using doubled-linear-references (giving mappings in the range 0 < POS <= 2*LENGTH), post processed to ensure 0 < POS <= LENGTH with a little modulo arithmetic (if a read chanced to map in the duplicated region, move it into the first half).

i.e. I was in effect using a single primary alignment for origin wrapping reads (as @jkbonfield suggests above), with additional work needed for computing coverage near the origin (having to look at reads spilling off the right and end, and doing modulo arithmetic).

This also had implications for the index bins too - but things seemed to work. For example, if you were doing SNP calling near the origin, you'd need to look in the region 0 < POS < WINDOW as now, and also off the right hand end in region LENGTH < POS < LENGTH+WINDOW.

For visualisation, in increasing utility:
(1) Crop reads spilling over the right hand edge
(2) Draw reads which spill off the right hand edge.
(3) Split reads spilling over the right hand edge, so their tail appears from the left edge.

With the new @SQ-TP header, I think this could all be made rigorous.

Note that with small plasmids or other circular structures, it is not out of the question that a modern long read could circle the origin more than once (i.e. sequence a PCR product longer than the circle's length). In this case again a single mapping entry would work, with 0 < POS < LENGTH and an appropriate long CIGAR string.

@jmarshall jmarshall removed their assignment May 29, 2019
@nh13
Copy link
Member Author

nh13 commented Jul 24, 2019

@peterjc I like your suggestion of recommending having 0 < POS < LENGTH and an appropriate long CIGAR string, with the requirement that the SQ entry in the header must have TP with value in circular. We can then say an alternative representation is to have primary with supplementary alignments (as @colinhercus suggested).

I'd be happy to write a short tool in https://github.com/fulcrumgenomics/fgbio that would convert the former single alignment to the latter (primary+supplementaries) for folks to enable use with existing variant callers.

@hts-specs-bot
Copy link

Changed PDFs as of acf3209: SAMv1 (diff).

@nh13
Copy link
Member Author

nh13 commented Aug 19, 2019

@jmarshall I have tried to add a recommended practice for cross-joins. I tried to add language similar to the rest of the spec, but feel free to change as you see fit. Once we agree, I'll write a quick tool to split records that map across the origin of a circular reference sequence into primer+supplementaries.

@jkbonfield
Copy link
Contributor

Change "should" to "may" when talking about the alignment to a circular reference as the sentence doesn't limit itself to describing only alignments that spans the end/start boundary.

Other than that, approved.

@peterjc
Copy link
Contributor

peterjc commented Aug 20, 2019

I agree with the should -> may (since most reads will not wrap the origin).

We have the pain of SAM vs BAM counting, but should this be clearer what the exact bounds of POS are? i.e. 1, ..., N for mapped reads with SAM, equivalently 0, ..., N-1 for mapped reads with BAM counting.

Otherwise the text looks good.

@nh13
Copy link
Member Author

nh13 commented Aug 21, 2019

@jmarshall @peterjc I changed two of the three "should"s to "may". I think the first should remain (I am guessing that's what you both meant).

@hts-specs-bot
Copy link

Changed PDFs as of 5de445b: SAMv1 (diff).

SAMv1.tex Outdated
@@ -658,6 +659,19 @@ \section{Recommended Practice for the SAM Format}
should be present.
\item The {\tt NM} tag should be present.
\end{enumerate}
\item Alignment to circular reference sequences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a latex newline here?

SAMv1.tex Outdated
This applies to circular reference sequences that are annotated with
{\tt TP:circular} in the {\tt @SQ} header line.
\begin{enumerate}[label=\arabic*]
\item (preferred) {\sf POS} should be between 0 and the length specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

POS is described as 1-based in this document.

SAMv1.tex Outdated
in {\tt LN} field of the {\tt @SQ} header line with an {\tt SN} equal to
{\sf RNAME}. {\sf POS} plus the sum of lengths of {\tt M/=/X/D/N}
operations in {\sf CIGAR} may exceed the length specified in the {\tt LN}
field of the {\tt @SQ} header line (if exists) with an SN equal to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it must exist in this case, since we are saying that the @SN line has a TP:circular....so let's clean this up a little.

SAMv1.tex Outdated
operations in {\sf CIGAR} may exceed the length specified in the {\tt LN}
field of the {\tt @SQ} header line (if exists) with an SN equal to
{\sf RNAME}.
\item (alternative) The alignment may be split across multiple records,
Copy link
Contributor

@yfarjoun yfarjoun Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we should explicitly say that one of the record should "start" at POS=1 and another should "end" at POS=LN. otherwise, we are not really saying much in this case.

Copy link
Contributor

@peterjc peterjc Nov 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a CDS where the origin is in an intron? In that case a GenBank/EMBL style CDS record would be multi-p

Update: I thought I discarded this half written comment, such already multi-part records don't need to do anything new, so the comment above about POS=1 and LN stands.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clarifications required.

@jmarshall jmarshall self-assigned this Nov 14, 2019
@jmarshall
Copy link
Member

Yes. As discussed in previous meetings, I've already been fixing up all those issues. Good point about the split representation though.

@hts-specs-bot
Copy link

Changed PDFs as of b8d7f0a: SAMv1 (diff).

@jmarshall
Copy link
Member

Non-trivially reworded, and in particular I've added details of how the two representations are intended to be interpreted.

I think that's ready for merging (unless others think the ((p-1) mod LN)+1 malarkey is too much…).


\item
Alternatively, such alignments may be split across several records: one record representing the initial portion of the segment ending at~{\tt LN}, one representing the final portion starting from~1, and any other records representing additional portions in between spanning the entire reference sequence.
One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small suggestion based on line 678

Suggested change
One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set.
One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} {\sf FLAG} bit (0x800) set.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shrug… I was following the similar lowercase ‘flag bit’ phrasing in footnote 12 and in the Unmapped reads and Multiple mapping sections just above…

…part 2)

Add recommended practice for records spanning the origin of circular
reference sequences.

[Added details of what the two representations *mean* -- JM]

Co-authored-by: John Marshall <[email protected]>
@hts-specs-bot
Copy link

Changed PDFs as of e5f1cf4: SAMv1 (diff).

@jmarshall jmarshall merged commit e5f1cf4 into samtools:master Jan 6, 2020
@github-pages github-pages bot temporarily deployed to github-pages January 6, 2020 10:09 Inactive
@jmarshall
Copy link
Member

Squashed and merged as discussed in the last meeting. Closing this PR this time as it's getting unwieldy and we're now more or less “feature complete” though further tweaks will probably be needed as experience is gained.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants