-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Circular chromosomes in @SQ header #403
Comments
Sounds like a nice addition. I don't know offhand of other topologies that we would want to include...unless we want to eventually allow knot notations... (http://katlas.math.toronto.edu/wiki/The_Rolfsen_Knot_Table) I would be in favor of a boolean option. |
It's a nice idea and I don't see a problem with adding it, even if downstream tools don't yet support it. Add it and there's a chance that'll happen. Ancient history: Way back in earlier job we found this notion useful; eg see page 274 (actual 294) of http://nebc.nerc.ac.uk/bioinformatics/documentation/staden/doc/manual_unix.pdf. In gap4 if a contig reference was marked as circular then the editor would permit scrolling beyond the end and wrapping around to the start. You could also define where the starting point was. Both of these were feature requests from people working on studying the Mitochondrial genomes, for which the standard reference sequence at the time just happened to have base number 1 in the hyper variable region, so people often rotated the genome so alignments to that variable region (which was the thing under study) worked. |
Can we use tp:[circlar|linear] to be consistent with grch38 fasta tags
chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion AS:GRCh38
tp:circular
…On Thu, 25 Apr 2019 at 23:29, James Bonfield ***@***.***> wrote:
It's a nice idea and I don't see a problem with adding it, even if
downstream tools don't *yet* support it. Add it and there's a chance
that'll happen.
Ancient history: Way back in earlier job we found this notion useful; eg
see page 274 (actual 294) of
http://nebc.nerc.ac.uk/bioinformatics/documentation/staden/doc/manual_unix.pdf.
In gap4 if a contig reference was marked as circular then the editor would
permit scrolling beyond the end and wrapping around to the start. You could
also define where the starting point was.
Both of these were feature requests from people working on studying the
Mitochondrial genomes, for which the standard reference sequence at the
time just happened to have base number 1 in the hyper variable region, so
people often rotated the genome so alignments to that variable region
(which was the thing under study) worked.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#403 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALRZ6RPJ4MBW2ZSPOTRGRTPSHE5JANCNFSM4HINYZQQ>
.
|
@colinhercus: I guess this is from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/ GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ which contains README_analysis_sets.txt describing the various tags used (see §4), several of which are SAM
For a definition in SAM ¹ “Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification.” |
Novoalign copies FASTA lower case tags across to the @sq so we already have
tp:circular.
…On Fri, 26 Apr 2019 at 18:56, John Marshall ***@***.***> wrote:
>chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion AS:GRCh38 tp:circular
@colinhercus <https://github.com/colinhercus>: I guess this is from
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/
GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ which
contains README_analysis_sets.txt
<http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/README_analysis_sets.txt>
describing the various tags used (see §4), several of which are SAM @sq
fields brought across to FASTA:
…AC, gi, LN, rg, rl, M5, AS, hm…
tp: topology
- circular for chrM and chrEBV
- not present for linear chromosomes and scaffolds
For a definition in SAM @sq we'd want to make the tag uppercase,¹ but
this is certainly motivation for the TP / Topology terminology.
------------------------------
¹ “Tags containing lowercase letters are reserved for local use and will
not be formally defined in any future version of this specification.”
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#403 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALRZ6WVRTXLPS3X3BQ4GS3PSLNVXANCNFSM4HINYZQQ>
.
|
I’d love this addition to made to the spec. I think an upper case is appropriate as lower case tags are reserved for local use (as @jmarshall cites in the spec). I have an aligned that I am working on for a very long read application that will need to map across the origin. I’ll leave any changes to the spec about how to represent alignments across the origin in a circular reference for later (currently split the alignment into two). |
This is to support annotating reference sequences as circular, e.g., for bacterial organisms or the human mitochondrial chromosome. [Summarise @nh13's footnote text so it fits on one line, so `@RG-SM` is not pushed off to the next page as an orphan. Remove now unneeded pagebreak hint.] Fixes samtools#403.
I commented in the PR...but I'll comment here too: One thing is troubling me: If two people have a two different versions of a reference sequence, one which is TP:circular and the other is not, they will have the same md5 which will mean that refGet will clash, and other sanity checks will fail. Should we redefine md5 for TP:circular in some way to avoid this? For example:
|
@yfarjoun seems like long term I'd want the meta-data clash to be be spotted and treated as an error. In the short term (since the circular tag will take a while to be adopted), have the sanity checker treat this as a warning only? Am I missing something (not familiar with the implementation details of refGet)? |
Closed as the |
It is occasionally suggested that it would be useful to have a convention for annotating reference sequences as being circular. See for example this 2011/12 samtools-devel thread and this tweet. There are further questions about how to represent mappings across the “join” in a circular chromosome (as mentioned in that thread), but being able to represent the concept in SAM at all is a useful first step.
For example, this could be
for Circular—true, or perhaps more self-explanatorily and flexibly something along the lines of
for Molecule Topology—circular (which would have a default implied value of linear).
The text was updated successfully, but these errors were encountered: