Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods to fix mate info on non-primaries and templates #204

Open
wants to merge 42 commits into
base: main
Choose a base branch
from

Conversation

clintval
Copy link
Member

@clintval clintval commented Dec 27, 2024

Closes #83
Closes #182

I need a set of functions to make it easier to set mate info on secondary, supplemental, "secondary supplemental", and templates. The work in this PR builds on the work in:

This PR introduces:

All "mate" fields are set using the primary of the opposite read ordinal.

However, I am seeking advice on how we should set:

  • is_proper_pair
  • template_length
For Secondary Alignments

For secondary alignments, I personally am curious about those values not as they are computed on the two primary alignments, but between the secondary alignment and the primary of the opposite ordinal (e.g. if this secondary was primary, then what would it look like paired with the primary of the opposite ordinal).

For Supplementary Alignments

For supplementary alignments of the primary alignment, it makes sense to have these two values point to the primary alignment of the mate. However, for secondary supplemental alignments, I think these values are undefined without knowing more about the specific alignment situation, so I am choosing to leave them as-is.

@clintval clintval requested review from nh13 and tfenne as code owners December 27, 2024 01:03
@clintval clintval changed the title Cv fix mates secondary supplementary @clintval Add methods to fix mate info on non-primaries and templates Dec 27, 2024
@clintval clintval marked this pull request as draft December 27, 2024 01:03
@clintval clintval changed the title @clintval Add methods to fix mate info on non-primaries and templates Add methods to fix mate info on non-primaries and templates Dec 27, 2024
@clintval clintval force-pushed the cv_fix_mates_secondary_supplementary branch 2 times, most recently from 30dbc8a to 1c4f647 Compare December 27, 2024 01:13
Copy link

codecov bot commented Dec 27, 2024

Codecov Report

Attention: Patch coverage is 94.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.44%. Comparing base (b0b4227) to head (0649c67).

Files with missing lines Patch % Lines
fgpyo/sam/__init__.py 94.00% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #204      +/-   ##
==========================================
+ Coverage   91.06%   91.44%   +0.37%     
==========================================
  Files          18       18              
  Lines        2283     2314      +31     
  Branches      337      349      +12     
==========================================
+ Hits         2079     2116      +37     
+ Misses        133      124       -9     
- Partials       71       74       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@clintval clintval force-pushed the cv_better_mate_info branch from 491777b to c755f6b Compare December 27, 2024 01:32
@clintval clintval force-pushed the cv_fix_mates_secondary_supplementary branch from 1c4f647 to b441d30 Compare December 27, 2024 01:33
@clintval clintval force-pushed the cv_better_mate_info branch from c755f6b to 56d9692 Compare December 27, 2024 01:37
@clintval clintval force-pushed the cv_fix_mates_secondary_supplementary branch from b441d30 to 32024b9 Compare December 27, 2024 01:46
@clintval clintval changed the base branch from cv_better_mate_info to cv_isize December 27, 2024 01:47
@clintval
Copy link
Member Author

If early review is favorable, I'll go and finish the PR with unit tests!

@clintval clintval force-pushed the cv_fix_mates_secondary_supplementary branch from 32024b9 to 8c69815 Compare December 27, 2024 02:51
@clintval clintval requested a review from tfenne December 27, 2024 22:08
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the code for setting values on supplementaries is a good addition. I'm still a little wary of the choices for secondaries ... if anyone is using an aligner that is better than bwa at reporting secondary alignments, this might override more useful information with arguably incorrect values.

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Base automatically changed from cv_isize to main December 28, 2024 14:48
@clintval clintval marked this pull request as ready for review December 30, 2024 15:34
@clintval clintval requested a review from tfenne December 30, 2024 15:34
Copy link
Contributor

coderabbitai bot commented Dec 30, 2024

Walkthrough

The changes enhance the handling of mate information in SAM (Sequence Alignment/Map) records. A new private function _set_common_mate_fields is added to centralize the logic for setting mate fields. The set_mate_info function is updated to use this helper, simplifying its implementation. Two new functions, set_mate_info_on_secondary and set_mate_info_on_supplementary, are introduced to manage mate information for secondary and supplementary alignments. The Template class is modified to include a set_mate_info method that resets mate information for all records within a template. These alterations improve the organization and validation of mate information handling across different alignment types.

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fgpyo/sam/__init__.py (2)

867-895: Good extraction of common logic into _set_common_mate_fields.

This helper function is clean and well-documented. One minor point: consider clarifying in the docstring whether source may be unmapped or not, given that the function still sets mate-score tags unconditionally regardless of mapping status.


1239-1263: Template-level set_mate_info usage is well structured.

The stepwise approach for applying mate info to R1, R2, secondaries, and supplementals is logical. If you add additional flags or references in the future, consider factoring out repeated sections for clarity.

tests/fgpyo/sam/test_sam.py (1)

713-714: Minor spelling nitpick in the test name.

It should be "mismatched" rather than "mimatched" in test_set_mate_info_raises_mimatched_query_names.

-def test_set_mate_info_raises_mimatched_query_names() -> None:
+def test_set_mate_info_raises_mismatched_query_names() -> None:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd38898 and ec56350.

📒 Files selected for processing (3)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (3 hunks)
  • tests/fgpyo/sam/test_template_iterator.py (1 hunks)
🔇 Additional comments (16)
fgpyo/sam/__init__.py (3)

181-181: Imports approved.

Using Self and deprecated here is acceptable to maintain compatibility with older versions of Python while offering newer features.


897-935: Solid pair info update in set_mate_info.

The approach of calling _set_common_mate_fields twice (once per direction) and then setting TLEN and proper-pair bits is clear. No issues found with error handling and logic flow.


927-971: Supplementary and secondary info updates look good.

Both set_mate_info_on_secondary and set_mate_info_on_supplementary handle relevant checks thoroughly. The approach of disallowing a supplementary mate source is consistent with your docstring. No direct code issues spotted.

tests/fgpyo/sam/test_template_iterator.py (2)

222-224: No issues at these lines.


225-389: Thorough testing in test_template_set_mate_info.

Your test covers both primary and secondary, supplementary, and secondary-supp combos. The coverage looks comprehensive and confirms correctness of the new methods.

tests/fgpyo/sam/test_sam.py (11)

24-26: Import statements are fine.


689-698: Correct check for read ordinal mismatch in test_set_mate_info_raises_not_opposite_read_ordinals.

This test correctly ensures error-raising if both reads are flagged R1 or both R2.


700-708: Proper error test for supplementary second record.


832-860: test_set_mate_info_on_secondary successfully verifies mate info for a secondary record.


862-889: Robust negative tests in test_set_mate_info_on_secondary_raises_for_secondary_or_supp_rec2.

These do help confirm that we only set mate info from a primary mate. No issues spotted.


891-919: test_set_mate_info_on_supplementary logic mirrors the secondary approach.


921-948: Validation checks for a primary mate are consistent in test_set_mate_info_on_supplementary_raises_for_secondary_or_supp_rec2.


950-964: Additional field-setting for primary supplements is well-tested.


966-981: Secondary + supplementary check is thorough.

Ensures that no primary-based fields are overwritten on secondaries. Good.


983-994: test_set_pair_info_raises_exception_for_mismatched_query_names is a fine negative test.


996-1028: Verifies legacy set_pair_info for two mapped records.

Nicely ensures backward compatibility for older usage patterns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/fgpyo/sam/test_sam.py (1)

832-889: Add test for mapped secondary alignment.

Current tests only cover unmapped case. Add test with mapped secondary alignment to ensure complete coverage.

def test_set_mate_info_on_secondary_mapped():
    """Test set_mate_info_on_secondary with mapped secondary alignment."""
    builder = SamBuilder()
    secondary, primary = builder.add_pair(chrom="chr1", start1=100, start2=200)
    secondary.is_secondary = True
    set_mate_info_on_secondary(secondary, primary)
    assert secondary.next_reference_start == primary.reference_start
    assert secondary.template_length == primary.template_length
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec56350 and 7116964.

📒 Files selected for processing (1)
  • tests/fgpyo/sam/test_sam.py (3 hunks)
🔇 Additional comments (2)
tests/fgpyo/sam/test_sam.py (2)

689-708: LGTM: Validation tests for set_mate_info are thorough.

Tests cover essential validation cases for read ordinals and supplementary alignments.


891-981: LGTM: Comprehensive supplementary alignment tests.

Tests cover both primary and secondary supplementary cases with proper validation of mate info fields.

tests/fgpyo/sam/test_sam.py Show resolved Hide resolved
@tfenne
Copy link
Member

tfenne commented Jan 6, 2025

@nh13 can you take a look through this one and give your opinions please?

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
dest.mate_is_mapped = source.is_mapped
dest.set_tag("MC", source.cigarstring)
dest.set_tag("MQ", source.mapping_quality)
dest.set_tag("ms", sum_of_base_qualities(source))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank-you!

Comment on lines +969 to 972
if not supp.is_secondary:
supp.is_proper_pair = mate_primary.is_proper_pair
supp.template_length = -mate_primary.template_length

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think template-length is correct, since the mate_primary.template_length is calculated from the primary relative to supp. Should we recalculate it, or should we set it to zero since it's not the primary-primary alignment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm having a hard time seeing how this is wrong. Really wish we could whiteboard!

Here, "mate primary" is the primary alignment of the supplemental's mate. What we want to do is have the supplement's template length be equal to the primary of the supplement (which we unfortunately do not have access to). However, the implementation in fgpyo (and elsewhere) flips the sign of the template length for mates. So, if we flip the sign of the supplemental's mate primary, then we should have the value that would have been set upon the supplemental's primary.

For secondary supplementals (and for secondaries in general) I am not setting template length anywhere because I feel the specification leaves this ambiguous, and a smarter aligner (or custom code) might set specific template lengths on the secondaries and I don't want to override that info. I feel the spec. is ambiguous for secondaries because I don't see reference to RNAME/RNEXT in the definition of TLEN.

Comment on lines +1250 to +1262
if self.r1 is not None and self.r2 is not None:
set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
if self.r1 is not None:
for rec in self.r2_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
for rec in self.r2_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
if self.r2 is not None:
for rec in self.r1_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
for rec in self.r1_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
return self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this:

Suggested change
if self.r1 is not None and self.r2 is not None:
set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
if self.r1 is not None:
for rec in self.r2_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
for rec in self.r2_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
if self.r2 is not None:
for rec in self.r1_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
for rec in self.r1_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
return self
if self.r1 is not None:
for rec in self.all_r2s:
_set_common_mate_fields(dest=rec, source=self.r1)
if self.r2 is not None:
for rec in self.all_r1s:
_set_common_mate_fields(dest=rec, source=self.r2)
return self

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my last commit, I still have two separate functions for secondaries and supplementaries. They aren't that different from _set_common_mate_fields except for:

  • Additional checking for if the record is secondary or supplementary for safety sake
  • Proper pair and template length settings on supplementaries (but only for non-secondary supplementaries).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/fgpyo/sam/test_sam.py (4)

694-696: Make error message more descriptive.

Current message could be clearer about what "different read ordinals" means.

-        ValueError, match="mate_primary and dest records must be of different read ordinals!"
+        ValueError, match="mate_primary and dest records must be from different reads (R1/R2)!"

834-862: Add test cases for mapped records.

Current test only covers unmapped records. Add cases for mapped records to ensure proper handling of reference coordinates and flags.


894-922: Add test cases for mapped records.

Similar to secondary test, add cases for mapped records to verify reference coordinates and flags.


Line range hint 1246-1258: Consider alternative implementation.

Per PR comments, consider restructuring to:

  1. Process all R2s with R1 primary
  2. Process all R1s with R2 primary

This approach might be more maintainable.

-        if self.r1 is not None and self.r2 is not None:
-            set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
-        if self.r1 is not None:
-            for rec in self.r2_secondaries:
-                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
-            for rec in self.r2_supplementals:
-                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
-        if self.r2 is not None:
-            for rec in self.r1_secondaries:
-                set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
-            for rec in self.r1_supplementals:
-                set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
+        if self.r1 is not None:
+            for rec in self.all_r2s():
+                _set_common_mate_fields(dest=rec, source=self.r1)
+        if self.r2 is not None:
+            for rec in self.all_r1s():
+                _set_common_mate_fields(dest=rec, source=self.r2)
fgpyo/sam/__init__.py (1)

899-900: Add type hints for callback parameters.

Add return type hints for is_proper_pair and isize callbacks.

-    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
-    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
+    is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,  # type: -> bool
+    isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,  # type: -> int
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d2617c and 0649c67.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: testing (3.10)
  • GitHub Check: testing (3.9)
🔇 Additional comments (2)
tests/fgpyo/sam/test_sam.py (1)

965-967: Verify template length calculation.

Per discussion in PR comments, template length might need recalculation or zeroing since it's not a primary-primary alignment.

fgpyo/sam/__init__.py (1)

965-967: Review template length handling.

Verify if negating mate's template length is correct for supplementary alignments. Consider:

  1. Recalculating based on alignment positions
  2. Setting to zero for non-primary alignments

@clintval clintval requested a review from nh13 January 10, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a method on Template to fixmate information
3 participants