Representation of cross-join mappings on circular reference sequences #405

nh13 · 2019-05-01T19:41:19Z

This is to support annotating reference sequences as circular, for example for bacterial organisms or the human mitochondrial chromosome.

This is my attempt to keep the momentum for issue #403 going.

@jmarshall @yfarjoun

hts-specs-bot · 2019-05-01T22:55:29Z

Changed PDFs as of 0fb8f61: SAMv1 (diff).

jmarshall · 2019-05-01T23:01:07Z

I've rebased my draft of the same thing on top of yours.

If your footnote stays (the original style in this table would be to have the bare minimum of information (cf SP Species 😄), but I guess details are helpful), it should note that the usual thing for linear chromosomes will be the status quo of just omitting TP.

nh13 · 2019-05-01T23:13:50Z

SAMv1.tex

@@ -658,6 +658,7 @@ \section{Recommended Practice for the SAM Format}
    should be present.
  \item The {\tt NM} tag should be present.
  \end{enumerate}
+\item At present, this specification does not describe how mappings across the ``join'' in a circular reference sequence should be represented.


Would it be controversial to just say "use supplementary alignments"?

Using supplementary alignments (so representing the linear alignment (§1.2) in two separate alignment records?!) is one way to represent it, but other ways exist and (ignoring downstream considerations for the moment) this one doesn't seem like the best way to represent this in SAM.

jmarshall · 2019-05-01T23:19:58Z

As for the representation of mappings across the join, this draft just adds a note to §2 Recommended Practice punting on defining that.

Instead of that note, IMHO we should change ¶4.3 just above there (If POS plus the sum of lengths of M/=/X/D/N operations … exceeds LN …) so that it only applies to non-circular reference sequences. This would allow mappings stretching past LN on circular chromosomes to be considered to have wrapped around to position 1 and following.

It seems to me that this would be sufficient and a more natural way to represent these things in SAM than e.g. splitting the mapping into two separate alignment records using a supplementary record. This may need a little work in downstream tools, but wouldn't that be the case however such alignments were represented?

(We could just add with an SN equal to RNAME [and a TP not equal to circular] but really the whole sentence wants to be rewritten for clarity.)

jmarshall · 2019-05-01T23:31:29Z

So choices for this PR are:

Merge the SQ-TP header now and leave the representation of cross-join mappings for later
- either with or without an explicit note disclaiming a description of that representation
Or also now try to reach consensus on the recommended representation, which could be
- using supplementary alignment records
- using a single alignment record that spans beyond LN
- something else

nh13 · 2019-05-01T23:41:13Z

@jmarshall I am all for merging as is now, with a slight preference to say you can just use supplementary for now, but with a long-term preference for a single alignment record (but how do we pileup efficiently? need to modify htslib/htsjdk to support off-the-end mappings?). Perhaps merge now and you or I can start a new issue about the representation of cross-join mappings?

jkbonfield · 2019-05-02T08:01:45Z

+1 for having a single alignment that simply wraps around.

Tools need to have coped with alignments off the end of sequences anyway as some buggy aligners produced such data, so the reasonably worse scenario here isn't buffer overflow but rejection of some data as illegal. Upgrading analysis tools to do modulo arithmetic shouldn't be excessive, but I'd expect take up of TP to be gradual so we have time. Maybe...

colinhercus · 2019-05-02T10:27:44Z

+1 for splitting alignment into primary & secondary

…

On Thu, 2 May 2019 at 16:01, James Bonfield ***@***.***> wrote: +1 for having a single alignment that simply wraps around. Tools need to have coped with alignments off the end of sequences anyway as some buggy aligners produced such data, so the reasonably worse scenario here isn't buffer overflow but rejection of some data as illegal. Upgrading analysis tools to do modulo arithmetic shouldn't be excessive, but I'd expect take up of TP to be gradual so we have time. Maybe... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#405 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALRZ6UBS7NZMASRBJSCO6LPTKNWXANCNFSM4HJWXRPA> .

jmarshall · 2019-05-02T11:22:55Z

There are no significant htslib/htsjdk changes necessary for sequential access to a single-record-spanning-beyond-LN representation. (I'm arbitrarily claiming any htsjdk validation changes would be insignificant.)

OTOH for random access implementations would need to arrange for chrMT:1-10 to bring back alignment records with POS=16550, which would require htslib/htsjdk changes in indexing (relatively easy) and iterators (perhaps harder). Once these changes were made, pileup/etc would follow fairly easily, but the iterator changes needed would not be insignificant.

So that's probably the reason why @nh13 thinks using a supplementary alignment representation is a pragmatic approach — as it can be processed by existing versions of tools. It's indexed and iterated correctly because the aligner has already done this work and emitted an extra copy of the read at POS=1.

However it seems to me that it's an awkward representation for tools wishing to analyse such mappings. The location of POS=1 is just an artifact, and these really are just 50M mappings like any others — like those at POS=8284 (halfway around the MT circle) for example. Doesn't the two separate alignment records spuriously look like the read is spanning a rearrangement? Tools actually analysing the data would need special cases to cope with this, and it's not obvious to me that that's better than making indexing/iterating capable of coping with the wrapping-around representation.

Hence IMHO the spec should be agnostic on this representation question until some experience has been gained and the trade-offs can be measured. So the spec should for now allow either representation to be used.

Are there (bacterial, mitochondrial) people working with such mappings who have experience of the trade-offs of these representations? What are they doing at the moment? Do they like it, or are they just shoehorning their data into what the existing tools like?

@colinhercus: It would be helpful if you would explain why you support splitting alignments into primary & secondary.

nh13 · 2019-05-02T13:15:53Z

I now regret making that comment before the PR is merged. Any objection to taking the alignment representation to a different issue and getting this PR merged?

jmarshall · 2019-05-02T13:19:22Z

We will discuss it at the meeting this afternoon and almost certainly merge the header field.

However I would like the spec to be agnostic on the representation and at present the Recommended Practice section somewhat accidentally precludes one of the representations.

colinhercus · 2019-05-03T01:10:39Z

I should have said primary & supplementary, I'd had a busy day :) The reason being that it should work with existing variant callers & pileups and if these were circular aware they could stitch the two alignments back together. If primary and supplementary can be joined across ends of a circular chromosome then it's not an SV. Pretty easy to detect! This allows us to implement the changes in aligners and SAM while variant callers play catchup. Don't forget there are a lot of them. Allowing both forms would also be acceptable and then the aligner could have an option to control which format it produces.

…

On Thu, 2 May 2019 at 19:22, John Marshall ***@***.***> wrote: There are no significant htslib/htsjdk changes necessary for *sequential* access to a single-record-spanning-beyond-LN representation. (I'm arbitrarily claiming any htsjdk validation changes would be insignificant.) OTOH for *random access* implementations would need to arrange for chrMT:1-10 to bring back alignment records with POS=16550, which would require htslib/htsjdk changes in indexing (relatively easy) and iterators (perhaps harder). Once these changes were made, pileup/etc would follow fairly easily, but the iterator changes needed would not be insignificant. So that's probably the reason why @nh13 <https://github.com/nh13> thinks using a supplementary alignment representation is a pragmatic approach — as it can be processed by existing versions of tools. It's indexed and iterated correctly because the aligner has already done this work and emitted an extra copy of the read at POS=1. However it seems to me that it's an awkward representation for tools wishing to analyse such mappings. The location of POS=1 is just an artifact, and these really are just 50M mappings like any others — like those at POS=8284 (halfway around the MT circle) for example. Doesn't the two separate alignment records spuriously look like the read is spanning a rearrangement? Tools actually analysing the data would need special cases to cope with this, and it's not obvious to me that that's better than making indexing/iterating capable of coping with the wrapping-around representation. Hence IMHO the spec should be agnostic on this representation question until some experience has been gained and the trade-offs can be measured. So the spec should for now allow either representation to be used. Are there (bacterial, mitochondrial) people working with such mappings who have experience of the trade-offs of these representations? What are they doing at the moment? Do they like it, or are they just shoehorning their data into what the existing tools like? @colinhercus <https://github.com/colinhercus>: It would be helpful if you would explain *why* you support splitting alignments into primary & secondary. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#405 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALRZ6USUYITAARCMJEJE7DPTLFJDANCNFSM4HJWXRPA> .

@nh13

This is to support annotating reference sequences as circular, e.g., for bacterial organisms or the human mitochondrial chromosome. [Summarise @nh13's footnote text so it fits on one line, so `@RG-SM` is not pushed off to the next page as an orphan. Remove now unneeded pagebreak hint.] Fixes samtools#403.

jmarshall · 2019-05-03T15:04:35Z

Agreed at the meeting to proceed in two stages.

Okay I've reset the branch to include the commit (2a12b7e) I plan to push by itself to master once @mlin has done some stuff in the repository, if he's planning to. That'll leave this PR open, and I reckon we should reuse it for stage 2 (cross-join representation) as there's already some useful discussion here.

yfarjoun · 2019-05-06T09:11:42Z

One thing is troubling me: If two people have a two different versions of a reference sequence, one which is TP:circular and the other is not, they will have the same md5 which will mean that refGet will clash, and other sanity checks will fail. Should we redefine md5 for TP:circular in some way to avoid this?

For example:

For the purposes of calculating md5, a TP:circular reference will be appended a '$' sign at the end. Example:
>chr1
CTCAACCACTTGAGCAAACTCCAAGAC
has md5 of 4ac2dee8e6ac06422295df0320981c70
but 
>chr1 TP:circular 
CTCAACCACTTGAGCAAACTCCAAGAC
has md5 of de8fd4b0606b285e11251b538b27ce76

jkbonfield · 2019-05-07T08:42:55Z

Hmm, I'd say not. What other than CRAM is using the md5 as a way of obtaining the reference? The sequence itself isn't changing anyway, just our interpretation of it.

Unless we modified the CRAM spec to take into account the circular statement with regards to reference based compression, then it'll make no difference there anyway (which is a good reason for not doing that). Right now bases hanging off the end of the reference just get encoded verbatim.

I don't know what other tools will make of this, but it seems to me as long as the sequence doesn't change (which it doesn't), then it's simply the meta-data in the header which could differ per file. I don't really see how using two different md5 calculation methods helps us here. It's more likely to break things as old code that doesn't honour TP:circular generates the wrong md5.

jkbonfield · 2019-05-07T08:55:13Z

Actually, that's not true for Picard. It swallows CRAM files made with Scramble that have CIGARs off the end of the reference and it converts back to SAM properly, however it cannot create such a file itself.

@HD	VN:1.4	SO:coordinate
@SQ	SN:MT	LN:16569	UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz	AS:NCBI37	M5:c68f52674c9fb33aef52dcf399755519	SP:Human
@RG	ID:1#49	PL:ILLUMINA	PU:130508_HS25_09827_B_C1PJLACXX_2#49	LB:7098543	DS:ERP002385: 4X coverage for the Mandinka from Gambia Western Division (GWD). This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/ 	DT:2013-05-08T00:00:00+0100	SM:ERS220911	CN:SC
HS25_09827:2:2216:2067:86061#49	147	MT	16499	29	2S98M	=	16105	-464	CTATCTGGTTCCTCCTTCAGGGTCATACATCCTAAATAGCACACACGTTCTCTTCAAATAAGCCACCACGATCGATAACAGCTATTTCACCCTCTTAACA	(4:E5C:+.EF'>;*;FGE<5(=25F()(*48&+;(?G6=-&6;-):'-G,D'9',C(-1,1$:,$-'3*6A'%+8++C%D05+-,-8<H,/,(7B%'*'	MD:Z:11A13A1G10C9C1C1T7A2T6	RG:Z:1#49	XG:i:0	AM:i:29	NM:i:9	SM:i:29	XM:i:9	XO:i:0	XT:A:M

/usr/bin/java -jar ~/work/picard-2.20.0.jar SamFormatConverter I=MT.sam O=MT2.cram R=$HREF
INFO	2019-05-07 09:53:37	SamFormatConverter	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    SamFormatConverter -I MT.sam -O MT2.cram -R /nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa
**********


09:53:38.054 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/users/nfs_j/jkb/work/picard-2.20.0.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue May 07 09:53:38 BST 2019] SamFormatConverter INPUT=MT.sam OUTPUT=MT2.cram REFERENCE_SEQUENCE=/nfs/srpipe_references/references/Human/1000Genomes_hs37d5/all/fasta/hs37d5.fa    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue May 07 09:53:38 BST 2019] Executing as jkb@deskpro107386 on Linux 4.15.0-47-generic amd64; OpenJDK 64-Bit Server VM 11.0.2+9-Ubuntu-3ubuntu118.04.3; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.0-SNAPSHOT
[Tue May 07 09:53:38 BST 2019] picard.sam.SamFormatConverter done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=132120576
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Read name HS25_09827:2:2216:2067:86061#49, Read CIGAR M operator maps off end of reference
	at htsjdk.samtools.SAMUtils.processValidationErrors(SAMUtils.java:455)
	at htsjdk.samtools.SAMRecord.getCigar(SAMRecord.java:820)
	at htsjdk.samtools.SAMRecord.getCigarLength(SAMRecord.java:831)
	at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:2002)
	at htsjdk.samtools.SAMRecord.isValid(SAMRecord.java:1883)
	at htsjdk.samtools.SAMLineParser.parseLine(SAMLineParser.java:352)
	at htsjdk.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:268)
	at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:255)
	at htsjdk.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:228)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:574)
	at htsjdk.samtools.SamReader$AssertingIterator.next(SamReader.java:553)
	at picard.sam.SamFormatConverter.convert(SamFormatConverter.java:93)
	at picard.sam.SamFormatConverter.doWork(SamFormatConverter.java:71)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
	at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
	at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Somewhere in that verbal diarrhoea is the key bit: "Read CIGAR M operator maps off end of reference". So it looks like it is forbidden within htsjdk. How do you deal with all that legacy broken bwa output? Just flat out refuse to accept it?

nh13 · 2019-05-07T09:26:24Z

@jkbonfield

Somewhere in that verbal diarrhoea is the key bit

^ this is fantastic

The legacy broken output is dealt with by Picard's MergeBamAlignment tool. I've also made a PR to htsjdk so that if the TP tag is present for a reference sequence the "off the end" error isn't thrown:
samtools/htsjdk#1363

@nh13

This is to support annotating reference sequences as circular, e.g., for bacterial organisms or the human mitochondrial chromosome. [Summarise @nh13's footnote text so it fits on one line, so `@RG-SM` is not pushed off to the next page as an orphan. Remove now unneeded pagebreak hint.] Fixes #403.

yfarjoun · 2019-05-07T12:57:13Z

I think you are missing my point about obtaining a reference (from via refGet for example). For example, we now have in the "immutable" record a chrM sequence without TP:circular. However, it would be nice to also include a chrM with TP:circular. But those two sequences will have the same md5 and thus one would not be able to distinguish between them. I don't see how this is resolved without changing the definition of md5 in some way.

jmarshall · 2019-05-07T13:14:47Z

But refget /sequence/<id> (by design) returns just the sequence text, which is 16569 characters that are identical regardless of what TP value some non-attached header might contain.

Similar information is returned by the /sequence/<id>/metadata endpoint, and that could be augmented to have a topology item in its response (there would be a conversation to be had about whether that should be nested under aliases or not), but that is not “immutable” as different servers can and will have different amounts of metadata.

I see your point that it would be nice if circularity was something that you could inquire from the oracle given just an MD5 id token, but IMHO the answer is one of (a) that's part of the metadata or (b) it's too late to change what goes into the MD5, too much incompatibility.

To be sure, TP is new and you are proposing only to change the MD5 value for reference sequences that specify this new TP value. So the compatibility issue might be surmountable. Then it comes down to: how are you proposing that refget or similar should report the TP value to the client?

yfarjoun · 2019-05-07T13:28:35Z

Ah yeas..I forgot that we only get back the sequence and not the metadata as well. I rescind my objection.

@nh13

This is to support annotating reference sequences as circular, e.g., for bacterial organisms or the human mitochondrial chromosome. [Summarise @nh13's footnote text so it fits on one line, so `@RG-SM` is not pushed off to the next page as an orphan. Remove now unneeded pagebreak hint.] Fixes #403.

hts-specs-bot · 2019-05-14T09:38:30Z

Changed PDFs as of 690e758: SAMv1 (diff).

jmarshall · 2019-05-14T09:49:51Z

The @SQ-TP header has now been merged to master and the published PDF has been updated.

Let's keep this PR open and continue to discuss the eventual stage 2 (cross-join representation) here.

peterjc · 2019-05-14T10:18:16Z

I was looking at this a few years back, see https://sourceforge.net/p/samtools/mailman/message/28372847/ etc.

I was mapping to circular references (mitochondria). Given the tool limitations at the time, I ending up running my mappings using doubled-linear-references (giving mappings in the range 0 < POS <= 2*LENGTH), post processed to ensure 0 < POS <= LENGTH with a little modulo arithmetic (if a read chanced to map in the duplicated region, move it into the first half).

i.e. I was in effect using a single primary alignment for origin wrapping reads (as @jkbonfield suggests above), with additional work needed for computing coverage near the origin (having to look at reads spilling off the right and end, and doing modulo arithmetic).

This also had implications for the index bins too - but things seemed to work. For example, if you were doing SNP calling near the origin, you'd need to look in the region 0 < POS < WINDOW as now, and also off the right hand end in region LENGTH < POS < LENGTH+WINDOW.

For visualisation, in increasing utility:
(1) Crop reads spilling over the right hand edge
(2) Draw reads which spill off the right hand edge.
(3) Split reads spilling over the right hand edge, so their tail appears from the left edge.

With the new @SQ-TP header, I think this could all be made rigorous.

Note that with small plasmids or other circular structures, it is not out of the question that a modern long read could circle the origin more than once (i.e. sequence a PCR product longer than the circle's length). In this case again a single mapping entry would work, with 0 < POS < LENGTH and an appropriate long CIGAR string.

nh13 · 2019-07-24T17:31:36Z

@peterjc I like your suggestion of recommending having 0 < POS < LENGTH and an appropriate long CIGAR string, with the requirement that the SQ entry in the header must have TP with value in circular. We can then say an alternative representation is to have primary with supplementary alignments (as @colinhercus suggested).

I'd be happy to write a short tool in https://github.com/fulcrumgenomics/fgbio that would convert the former single alignment to the latter (primary+supplementaries) for folks to enable use with existing variant callers.

hts-specs-bot · 2019-08-19T16:41:21Z

Changed PDFs as of acf3209: SAMv1 (diff).

nh13 · 2019-08-19T16:41:23Z

@jmarshall I have tried to add a recommended practice for cross-joins. I tried to add language similar to the rest of the spec, but feel free to change as you see fit. Once we agree, I'll write a quick tool to split records that map across the origin of a circular reference sequence into primer+supplementaries.

jkbonfield · 2019-08-20T08:25:54Z

Change "should" to "may" when talking about the alignment to a circular reference as the sentence doesn't limit itself to describing only alignments that spans the end/start boundary.

Other than that, approved.

peterjc · 2019-08-20T09:08:11Z

I agree with the should -> may (since most reads will not wrap the origin).

We have the pain of SAM vs BAM counting, but should this be clearer what the exact bounds of POS are? i.e. 1, ..., N for mapped reads with SAM, equivalently 0, ..., N-1 for mapped reads with BAM counting.

Otherwise the text looks good.

nh13 · 2019-08-21T16:54:44Z

@jmarshall @peterjc I changed two of the three "should"s to "may". I think the first should remain (I am guessing that's what you both meant).

hts-specs-bot · 2019-08-21T16:55:40Z

Changed PDFs as of 5de445b: SAMv1 (diff).

yfarjoun · 2019-11-14T12:40:08Z

SAMv1.tex

@@ -658,6 +659,19 @@ \section{Recommended Practice for the SAM Format}
    should be present.
  \item The {\tt NM} tag should be present.
  \end{enumerate}
+\item Alignment to circular reference sequences


need a latex newline here?

yfarjoun · 2019-11-14T12:41:21Z

SAMv1.tex

+  This applies to circular reference sequences that are annotated with
+  {\tt TP:circular} in the {\tt @SQ} header line.
+  \begin{enumerate}[label=\arabic*]
+  \item (preferred) {\sf POS} should be between 0 and the length specified 


POS is described as 1-based in this document.

yfarjoun · 2019-11-14T12:45:36Z

SAMv1.tex

+    in {\tt LN} field of the {\tt @SQ} header line with an {\tt SN} equal to
+    {\sf RNAME}. {\sf POS} plus the sum of lengths of {\tt M/=/X/D/N} 
+    operations in {\sf CIGAR} may exceed the length specified in the {\tt LN}
+    field of the {\tt @SQ} header line (if exists) with an SN equal to 


it must exist in this case, since we are saying that the @SN line has a TP:circular....so let's clean this up a little.

yfarjoun · 2019-11-14T12:47:07Z

SAMv1.tex

+    operations in {\sf CIGAR} may exceed the length specified in the {\tt LN}
+    field of the {\tt @SQ} header line (if exists) with an SN equal to 
+    {\sf RNAME}.
+  \item (alternative) The alignment may be split across multiple records,


here we should explicitly say that one of the record should "start" at POS=1 and another should "end" at POS=LN. otherwise, we are not really saying much in this case.

What about a CDS where the origin is in an intron? In that case a GenBank/EMBL style CDS record would be multi-p

Update: I thought I discarded this half written comment, such already multi-part records don't need to do anything new, so the comment above about POS=1 and LN stands.

yfarjoun

Some clarifications required.

jmarshall · 2019-11-14T12:52:26Z

Yes. As discussed in previous meetings, I've already been fixing up all those issues. Good point about the split representation though.

hts-specs-bot · 2019-12-09T16:41:52Z

Changed PDFs as of b8d7f0a: SAMv1 (diff).

jmarshall · 2019-12-09T16:43:33Z

Non-trivially reworded, and in particular I've added details of how the two representations are intended to be interpreted.

I think that's ready for merging (unless others think the ((p-1) mod LN)+1 malarkey is too much…).

nh13 · 2019-12-09T16:50:48Z

SAMv1.tex

+
+  \item
+Alternatively, such alignments may be split across several records: one record representing the initial portion of the segment ending at~{\tt LN}, one representing the final portion starting from~1, and any other records representing additional portions in between spanning the entire reference sequence.
+One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set.


small suggestion based on line 678

Suggested change

One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set.

One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} {\sf FLAG} bit (0x800) set.

Shrug… I was following the similar lowercase ‘flag bit’ phrasing in footnote 12 and in the Unmapped reads and Multiple mapping sections just above…

…part 2) Add recommended practice for records spanning the origin of circular reference sequences. [Added details of what the two representations *mean* -- JM] Co-authored-by: John Marshall <[email protected]>

hts-specs-bot · 2020-01-06T10:07:42Z

Changed PDFs as of e5f1cf4: SAMv1 (diff).

jmarshall · 2020-01-06T10:14:08Z

Squashed and merged as discussed in the last meeting. Closing this PR this time as it's getting unwieldy and we're now more or less “feature complete” though further tweaks will probably be needed as experience is gained.

nh13 commented May 1, 2019

View reviewed changes

jmarshall added the sam label May 1, 2019

nh13 mentioned this pull request May 2, 2019

Add support for the TP (topology) tag in the sequence dictionary samtools/htsjdk#1363

Open

jmarshall self-assigned this May 3, 2019

jmarshall force-pushed the nh_topology branch from 0fb8f61 to 1287d52 Compare May 3, 2019 14:56

jmarshall force-pushed the nh_topology branch from 1287d52 to 690e758 Compare May 14, 2019 09:36

jmarshall mentioned this pull request May 14, 2019

Circular chromosomes in @SQ header #403

Closed

jmarshall changed the title ~~Add the TP tag to the SQ in the SAM header~~ Representation of cross-join mappings on circular reference sequences May 14, 2019

jmarshall removed their assignment May 29, 2019

nh13 force-pushed the nh_topology branch from 690e758 to acf3209 Compare August 19, 2019 16:39

yfarjoun reviewed Nov 14, 2019

View reviewed changes

yfarjoun requested changes Nov 14, 2019

View reviewed changes

jmarshall self-assigned this Nov 14, 2019

nh13 commented Dec 9, 2019

View reviewed changes

Describe cross-join circular mapping representations (PR samtools#405 …

e5f1cf4

…part 2) Add recommended practice for records spanning the origin of circular reference sequences. [Added details of what the two representations *mean* -- JM] Co-authored-by: John Marshall <[email protected]>

jmarshall force-pushed the nh_topology branch from b8d7f0a to e5f1cf4 Compare January 6, 2020 10:05

jmarshall merged commit e5f1cf4 into samtools:master Jan 6, 2020

github-pages bot temporarily deployed to github-pages January 6, 2020 10:09 Inactive

tfenne mentioned this pull request Dec 14, 2020

Specify "is autosomal" or provide ploidy of a sequence record #535

Open

	One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} flag bit (0x800) set.
	One record (chosen arbitrarily) is considered primary and the remainder have their {\sf supplementary} {\sf FLAG} bit (0x800) set.

Representation of cross-join mappings on circular reference sequences #405

Representation of cross-join mappings on circular reference sequences #405

Conversation

nh13 commented May 1, 2019

hts-specs-bot commented May 1, 2019

jmarshall commented May 1, 2019

nh13 May 1, 2019

Choose a reason for hiding this comment

jmarshall May 1, 2019

Choose a reason for hiding this comment

jmarshall commented May 1, 2019

jmarshall commented May 1, 2019

nh13 commented May 1, 2019

jkbonfield commented May 2, 2019

colinhercus commented May 2, 2019 via email

jmarshall commented May 2, 2019

nh13 commented May 2, 2019

jmarshall commented May 2, 2019

colinhercus commented May 3, 2019 via email

jmarshall commented May 3, 2019

yfarjoun commented May 6, 2019

jkbonfield commented May 7, 2019 • edited Loading

jkbonfield commented May 7, 2019

nh13 commented May 7, 2019

yfarjoun commented May 7, 2019

jmarshall commented May 7, 2019

yfarjoun commented May 7, 2019

hts-specs-bot commented May 14, 2019

jmarshall commented May 14, 2019

peterjc commented May 14, 2019

nh13 commented Jul 24, 2019

hts-specs-bot commented Aug 19, 2019

nh13 commented Aug 19, 2019

jkbonfield commented Aug 20, 2019

peterjc commented Aug 20, 2019

nh13 commented Aug 21, 2019

hts-specs-bot commented Aug 21, 2019

yfarjoun Nov 14, 2019

Choose a reason for hiding this comment

yfarjoun Nov 14, 2019

Choose a reason for hiding this comment

yfarjoun Nov 14, 2019

Choose a reason for hiding this comment

yfarjoun Nov 14, 2019 • edited Loading

Choose a reason for hiding this comment

peterjc Nov 15, 2019 • edited Loading

Choose a reason for hiding this comment

yfarjoun left a comment

Choose a reason for hiding this comment

jmarshall commented Nov 14, 2019

hts-specs-bot commented Dec 9, 2019

jmarshall commented Dec 9, 2019

nh13 Dec 9, 2019

Choose a reason for hiding this comment

jmarshall Dec 9, 2019

Choose a reason for hiding this comment

hts-specs-bot commented Jan 6, 2020

jmarshall commented Jan 6, 2020

jkbonfield commented May 7, 2019 •

edited

Loading

yfarjoun Nov 14, 2019 •

edited

Loading

peterjc Nov 15, 2019 •

edited

Loading