-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write compound (segmented) sequence locations #1438
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1438 +/- ##
==========================================
- Coverage 68.73% 68.65% -0.08%
==========================================
Files 69 69
Lines 7551 7580 +29
Branches 1851 1858 +7
==========================================
+ Hits 5190 5204 +14
- Misses 2083 2094 +11
- Partials 278 282 +4 ☔ View full report in Codecov by Sentry. |
41a1984
to
0c7a4d3
Compare
A note on GFF: Our parsing of GFF files (i.e. upstream of the changes here) doesn't parse segmented features encoded such as this example from the GFF3 spec:
When parsed by GFF works for genes which cross the origin (e.g. HBV) when encoded as a single CDS with coords |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jameshadfield! It's good to know the translations are fine and we just needed to correct the annotations consumed by Auspice! Changes look good to me by inspection.
I think it'd be helpful to validate the annotations output by genome_features_to_auspice_annotation
. Thoughts on always validating the annotations with validate_json
within the function itself?
Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333
0c7a4d3
to
e099711
Compare
Force pushed to resolve conflicts in
Validation of the node-data we are about to write out seems sensible rather than (only) validation when we read the node-data file. Validation within the actual function added in this PR has some difficulties because of how it's used in |
Our current parsing for GenBank files would correctly read in complex CDSs (e.g. the V gene in measles) but we didn't correctly export this CDS in the output JSON. This is fixed here.
Note that because we use BioPython's SeqFeature.extract for both VCF and JSON/FASTA inputs the actual translations (via
augur translate
) don't need updating.Screenshot of Measles V gene:
Screenshot of HBV, with the nexclade+GFF translations (top) and the new
augur translate
+ GenBank translations (below) - data is identical except for the CDS names: