Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a function for determining read pair orientation #201

Merged
merged 13 commits into from
Dec 27, 2024
Merged

Conversation

clintval
Copy link
Member

@clintval clintval commented Dec 24, 2024

I have a project where I am re-building a complex alignment situation for a template and it involves an assessment of read pair orientation and whether the new pair is "proper" or not.

The functions in this PR allow for an optional R2 such as when you are iterating through a BAM in coordinate order and don't have easy access to the R2. When this happens, R1 is required to have enough mate information to derive the pair status and whether or not the read pair is "proper".

The implementation honors what is implemented in HTSJDK/fgbio but with:

  1. Fewer exceptions by using None instead of an exception, when possible
  2. Proper pair status also considers template length / insert size

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Dec 24, 2024

Codecov Report

Attention: Patch coverage is 96.22642% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.31%. Comparing base (ae54c9e) to head (bcf321d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
fgpyo/sam/__init__.py 96.22% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #201      +/-   ##
==========================================
+ Coverage   90.32%   91.31%   +0.99%     
==========================================
  Files          18       18              
  Lines        2232     2281      +49     
  Branches      322      334      +12     
==========================================
+ Hits         2016     2083      +67     
+ Misses        144      126      -18     
  Partials       72       72              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fulcrumgenomics fulcrumgenomics deleted a comment from coderabbitai bot Dec 24, 2024
Copy link
Contributor

coderabbitai bot commented Dec 27, 2024

Walkthrough

The pull request introduces a new enumeration PairOrientation for representing read pair orientations (FR, RF, TANDEM) in SAM file processing. It adds a class method from_recs to determine pair orientation based on two AlignedSegment instances, checking their mapping status and reference alignment. A new function, is_proper_pair, is implemented to evaluate if a read pair is properly paired based on alignment and orientation criteria. The set_pair_info function is modified to enhance error handling by replacing assertions with ValueError for mismatched query names and updating the handling of the mate_is_unmapped property. Additionally, new tests are added to validate these functionalities, covering various scenarios for determining read pair orientations and proper pairing conditions.

Tip

CodeRabbit's docstrings feature is now available as part of our Early Access Program! Simply use the command @coderabbitai generate docstrings to have CodeRabbit automatically generate docstrings for your pull request. We would love to hear your feedback on Discord.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
fgpyo/sam/__init__.py (2)

669-712: Consider addressing the TODO for insert size consistency.
You are already noting in the comments that using abs(isize(r1, r2)) might be more consistent than relying on template_length. It might help ensure correctness when set_pair_info hasn’t been invoked but the read pair is in FR orientation.


814-815: Add test coverage for different query names.
Currently, raising a ValueError if r1.query_name != r2.query_name is untested according to static analysis. Add a targeted test to ensure this error path is covered.

Would you like me to propose a new unit test to ensure coverage?

🧰 Tools
🪛 GitHub Check: codecov/patch

[warning] 815-815: fgpyo/sam/init.py#L815
Added line #L815 was not covered by tests

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3680b4f and 4ee5878.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (2 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 815-815: fgpyo/sam/init.py#L815
Added line #L815 was not covered by tests

🔇 Additional comments (4)
fgpyo/sam/__init__.py (3)

268-269: Logic looks consistent.
The approach to auto-detect the file type when it is None is clear and aligns with your design of delegating to SamFileType.from_path().


610-667: Orientation logic is clear and closely follows established conventions.
The PairOrientation enum and its build() method handle forward/reverse logic well, including fallback to None for unmapped or cross-chromosomal reads. The checks are clear and match typical orientation definitions.


830-832: Ensure MC and MQ tags are valid when copying from an unmapped mate.
When one read is unmapped, its cigar might be None. It’s good practice to confirm that src.cigarstring is valid before setting MC. Otherwise, consider skipping it or clearing the tag on the mate.

tests/fgpyo/sam/test_sam.py (1)

20-22: Imports look good.
Importing PairOrientation and is_properly_paired ensures that the new functionality is fully tested.

tests/fgpyo/sam/test_sam.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fgpyo/sam/__init__.py (1)

669-711: Good logic in is_proper_pair.
Consider letting isize(r1, r2) accept an optional r2 to avoid using template_length directly if you want more consistency.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ee5878 and 8bca675.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_sam.py (2 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 815-815: fgpyo/sam/init.py#L815
Added line #L815 was not covered by tests

🔇 Additional comments (15)
tests/fgpyo/sam/test_sam.py (13)

20-20: Nice import.


22-22: Good addition to test coverage.


305-342: Solid test case for pair orientation with R2.


344-361: Good check for unmapped reads.


363-386: Broad coverage with missing R2.


388-408: Comprehensive scenario for unmapped reads without explicit R2.


410-419: Exception testing is well done.


421-432: Verifies proper pairing in typical alignment.


434-445: Checks proper pairing with only R1.


447-467: Correctly confirms improper pair status for mismatched orientation.


469-489: Extensions of the orientation checks with no R2.


491-497: Properly validates pairs that exceed distance limits.


Line range hint 814-815: Add a negative test for query-name mismatch.

The existing coverage is good, but there’s no scenario testing a mismatch in query names. Please add a test that calls set_pair_info with two different query names to confirm the ValueError is raised.

fgpyo/sam/__init__.py (2)

610-667: Clean enumeration for pair orientation.


830-832: Mate info tags updated properly.

fgpyo/sam/__init__.py Show resolved Hide resolved
Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved so long as you agree with my comments, otherwise let's discuss any differences!

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
fgpyo/sam/__init__.py (2)

624-668: Consider renaming build() method.

Previous discussion suggested alternatives to build(). Consider from_recs() for clarity.

The implementation is solid:

  • Handles missing R2 case
  • Validates MC tag presence
  • Follows HTSJDK logic

674-712: Consider implementing TODO for isize().

Function is well-implemented but could be improved:

  • Replace template_length check with isize() call
  • Would need isize() modification to handle optional R2

Want me to implement the isize() modification?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8bca675 and 4e64644.

📒 Files selected for processing (1)
  • fgpyo/sam/__init__.py (5 hunks)
🔇 Additional comments (4)
fgpyo/sam/__init__.py (4)

161-161: LGTM: Import needed for type hints.


611-622: LGTM: Well-documented enum for pair orientations.

Clear documentation of FR ("innie"), RF ("outie"), and TANDEM orientations.


670-671: LGTM: Clear default for proper pair orientation.


815-816: LGTM: Improved error handling and tag management.

  • Better error handling with ValueError
  • Added mate_is_unmapped property
  • Added MQ tag for mapping quality

Also applies to: 831-833

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fgpyo/sam/__init__.py (1)

611-676: Enum logic is robust but somewhat dense.

You might consider splitting out orientation checks for clarity. It’s only a suggestion; the flow is already correct.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e64644 and dc17e0d.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (5 hunks)
  • tests/fgpyo/sam/test_sam.py (2 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 824-824: fgpyo/sam/init.py#L824
Added line #L824 was not covered by tests

🔇 Additional comments (18)
tests/fgpyo/sam/test_sam.py (14)

20-20: Import confirmation.

Thanks for importing PairOrientation. Looks good.


22-22: Separate import is clean.

Bringing in is_proper_pair is also fine. Nice separation of concerns.


305-336: Comprehensive test coverage for R1/R2 orientation.

The multiple orientation checks (FR, RF, TANDEM) offer solid confidence in correctness.


338-354: Overlapping reads scenario well-tested.

Tests confirm opposite directions produce an FR orientation. Good job.


356-363: Edge case for single base alignment.

This ensures minimal overlap is still recognized as FR. Nicely done.


365-388: Unmapped cases properly handled.

Verifies that None is returned when either read is unmapped. This is thorough.


390-413: Partial pair tests for R1 only.

This confirms orientation derivation from a single read plus embedded mate info. Useful coverage.


415-429: Non-existent MC tag leads to exception.

Excellent negative test verifying an exception when MC tag is missing. Good boundary check.


431-446: Ensures mate cigar tag is required for RF orientation.

You covered the raising of an error if MC is missing again. Nicely consistent.


448-459: Proving proper pairing for FR.

Testing the “retrieved” orientation correctness. Nice demonstration of actual real pairing logic.


461-472: Proper pairing check without explicit R2 reference.

Another valid scenario ensuring coverage. Straightforward and clear.


474-494: Invalid orientation test.

Good to see you confirm how incorrectly oriented reads fail is_proper_pair. No concerns here.


496-516: Orientation test with missing second read reference.

Same negative scenario for a single read. Great consistency.


518-524: Distance-based negative test.

Checks that large insert sizes violate “proper pair” status. This completes the coverage nicely.

fgpyo/sam/__init__.py (4)

161-161: New import for type hinting.

Using Collection clarifies the function signatures. Minimal overhead.


269-270: File type inference logic is concise.

Autodetection for read paths is tidy and remains consistent with the rest of the method’s flow.


678-720: Proper pair criteria well-defined.

Your authority checks and orientation gating are in line with standard SAM specs. No issues here.


Line range hint 823-841: Missing test coverage for different query names exception.

A separate test for the ValueError would boost coverage for line 824.
This was pointed out previously as well.

@clintval
Copy link
Member Author

@tfenne if you have time to re-review:

  • The function has more branches, but fetches the mate cigar as a last resort
  • I renamed the classmethod to from_recs(rec1: AlignedSegment, rec2: AlignedSegment | None)
  • I typed orientations as Collection instead of set
  • Instead of r1 and r2 I'm calling the new parameters rec1 and rec2
    • I did this because they don't even technically need to be mates

Copy link
Member

@tfenne tfenne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
fgpyo/sam/__init__.py (1)

Test coverage exists, but not for query name validation

While set_pair_info is well tested for orientation, proper pairing, and mate info, the new query name validation is untested.

  • Add test in tests/fgpyo/sam/test_sam.py to verify ValueError is raised when query names differ
🔗 Analysis chain

Line range hint 823-841: Add test coverage for query name validation.

Line 824 raising ValueError for mismatched query names lacks test coverage.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check existing test coverage
rg -B 2 -A 2 "set_pair_info.*different.*names" tests/fgpyo/sam/test_sam.py

Length of output: 76


Script:

#!/bin/bash
# Search for test cases related to set_pair_info
rg -B 3 -A 3 "set_pair_info" tests/fgpyo/sam/test_sam.py

# Find all test files that might contain relevant tests
fd "test.*\.py" tests/fgpyo/sam/

# Check implementation details of set_pair_info
rg -B 3 -A 3 "def set_pair_info" fgpyo/sam/

Length of output: 5747

🧹 Nitpick comments (1)
fgpyo/sam/__init__.py (1)

682-720: Address TODO for insert size calculation.

The TODO suggests replacing template_length check with isize() function.

Would you like me to help implement the suggested isize() modification to handle optional R2?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc17e0d and bcf321d.

📒 Files selected for processing (1)
  • fgpyo/sam/__init__.py (4 hunks)
🧰 Additional context used
🪛 GitHub Check: codecov/patch
fgpyo/sam/__init__.py

[warning] 824-824: fgpyo/sam/init.py#L824
Added line #L824 was not covered by tests

🔇 Additional comments (4)
fgpyo/sam/__init__.py (4)

161-161: LGTM!

The Collection import is appropriately used for type hinting.


611-622: LGTM!

The enum clearly defines the three possible pair orientations with descriptive docstrings.


678-679: LGTM!

FR as default proper pair orientation aligns with Illumina standards.


624-676: Verify edge cases in from_recs implementation.

The implementation handles both single record and paired record cases correctly. Let's verify the edge cases:

✅ Verification successful

Let me check more test cases to verify the implementation thoroughly.


Edge cases in from_recs are properly handled

Tests confirm correct handling of:

  • Unmapped reads (both/either)
  • Missing R2 with mapped mate
  • Overlapping reads
  • FR/RF orientations
  • Missing mate CIGAR tag
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for tests covering edge cases
rg -A 5 "from_recs.*None" tests/fgpyo/sam/test_sam.py

Length of output: 911


Script:

#!/bin/bash
# Search for more test cases related to pair orientation
rg -A 10 "test_pair_orientation" tests/fgpyo/sam/test_sam.py

Length of output: 3391

@clintval clintval merged commit f9aa849 into main Dec 27, 2024
8 checks passed
@clintval clintval deleted the cv_properly_paired branch December 27, 2024 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants