Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a class for parsing secondary alignments from XA/XB #206

Closed
wants to merge 8 commits into from

Conversation

clintval
Copy link
Member

I have a project where I am "reconstituting" the secondary alignments from bwa into actual SAM records. I created a helper class to do this for data stored in the XB SAM tag and extended it to also parse the more common XA SAM tag.

Copy link

codecov bot commented Dec 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.48%. Comparing base (bd38898) to head (2b4197c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #206      +/-   ##
==========================================
+ Coverage   91.04%   91.48%   +0.44%     
==========================================
  Files          18       18              
  Lines        2277     2385     +108     
  Branches      336      350      +14     
==========================================
+ Hits         2073     2182     +109     
  Misses        133      133              
+ Partials       71       70       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
fgpyo/sam/__init__.py (1)

Line range hint 1265-1475: Suggestion for advanced read transformation
Handling hard-clipped segments is currently minimal, with the sequence assigned to *. If time allows, consider a more detailed approach that pads/clips sequences to reflect the new CIGAR.

tests/fgpyo/sam/test_secondary_alignment.py (1)

1-444: Potential extra test
Consider adding a specific test case with a 7-field part to confirm the code rejects unexpected field counts and fully covers boundary conditions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd38898 and 18ba007.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_secondary_alignment.py (1 hunks)
🔇 Additional comments (7)
fgpyo/sam/__init__.py (6)

156-180: Docstring clarity is good
This docstring offers clear examples of parsing XA and XB tags. Great job demonstrating real usage scenarios.


187-188: Check Python version requirements
dataclasses and cached_property imply Python 3.8+. Verify your library’s minimum Python version supports these.


212-212: Straightforward import
This import is concise and self-explanatory.


226-228: Constant usage
Defining NO_QUERY_BASES as a named constant is a neat approach for readability and consistency.


Line range hint 1265-1475: Validate read orientation reversal
In many_sam_from_rec, reversing the query sequence occurs only when rec.is_forward and not hit.is_forward. Consider also handling the case where the original record is reversed but the new alignment is forward.


Line range hint 1265-1475: Concise post-init field checks
The negative-value checks in __post_init__ are thorough. Helps avert subtle data inconsistencies.

tests/fgpyo/sam/test_secondary_alignment.py (1)

1-444: Excellent coverage
These tests thoroughly check invalid inputs, parsing correctness, and reconstituted alignments, reflecting robust unit test design.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/fgpyo/sam/test_secondary_alignment.py (4)

14-71: Add edge case tests for maximum values.

Consider adding test cases for maximum allowed values to ensure bounds checking.

# Add these test cases to the parametrize
(
    {
        "reference_name": "chr9",
        "reference_start": sys.maxsize,
        "is_forward": False,
        "cigar": Cigar.from_cigarstring("49M"),
        "edit_distance": 4,
        "alignment_score": 0,
        "mapq": 30,
    },
    "Start value too large",
),

73-133: Add test cases for empty strings and whitespace.

Add test cases for empty reference names and whitespace handling.

# Add these test cases to the parametrize
[
    # Test empty reference name
    ",-104599381,49M,4",
    # Test whitespace in reference name
    "chr 9,-104599381,49M,4",
]

135-151: Add tests for invalid CIGAR strings.

Add test cases for malformed CIGAR strings in the tag.

def test_many_from_tag_raises_for_invalid_cigar() -> None:
    """Test that we raise an exception when CIGAR string is invalid."""
    with pytest.raises(ValueError, match="Invalid CIGAR string"):
        SecondaryAlignment.from_tag_part("chr3,+1,49X,4")

153-183: Add tests for complex CIGAR strings.

Add test cases with more complex CIGAR patterns including skips and padding.

# Add to parametrize
(
    SecondaryAlignment(
        reference_name="chr3",
        reference_start=170653466,
        is_forward=True,
        cigar=Cigar.from_cigarstring("10M2N10M"),
        edit_distance=4,
    ),
    170653466 + 22,
),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18ba007 and aaa9132.

📒 Files selected for processing (1)
  • tests/fgpyo/sam/test_secondary_alignment.py (1 hunks)
🔇 Additional comments (3)
tests/fgpyo/sam/test_secondary_alignment.py (3)

1-12: LGTM! Imports are well-organized.


185-233: LGTM! Tag parsing integration tests are comprehensive.


235-462: LGTM! Record integration tests are thorough and well-structured.

Tests cover all essential SAM record attributes and edge cases.

@fulcrumgenomics fulcrumgenomics deleted a comment from coderabbitai bot Dec 28, 2024
Copy link
Contributor

coderabbitai bot commented Dec 29, 2024

Walkthrough

The pull request introduces a new SecondaryAlignment class in the fgpyo.sam module to handle secondary alignments in SAM/BAM files. This class provides functionality for parsing and managing secondary alignment information from SAM records, specifically focusing on the XA and XB tags.

The implementation includes methods for creating secondary alignment instances from tag parts and records, with validation of input parameters. Key features include parsing alignment details like reference name, start position, orientation, CIGAR string, edit distance, alignment score, and mapping quality. The class also supports converting secondary alignments to SAM records and integrating them into existing template structures.

Additionally, a comprehensive suite of unit tests for the SecondaryAlignment class is added, covering validation, construction from tag parts, and handling of multiple secondary alignments.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
fgpyo/sam/__init__.py (3)

1312-1330: Add validations for reference_name and mapq.

Add these validations:

  1. Ensure reference_name is not empty or NO_REF_NAME
  2. Ensure mapq doesn't exceed 255 (SAM spec limit)
 def __post_init__(self) -> None:
     """Perform post-initialization validation on this dataclass."""
+    if not self.reference_name or self.reference_name == NO_REF_NAME:
+        raise ValueError(f"Invalid reference name: {self.reference_name}")
     if self.reference_start < 0:
         raise ValueError(f"Start cannot be <0! Found: {self.reference_start}")
     if self.edit_distance < 0:
         raise ValueError(f"Edit distance cannot be <0! Found: {self.edit_distance}")
     if self.alignment_score is not None and self.alignment_score < 0:
         raise ValueError(f"Alignment score cannot be <0! Found: {self.alignment_score}")
     if self.mapq is not None and self.mapq < 0:
         raise ValueError(f"Mapping quality cannot be <0! Found: {self.mapq}")
+    if self.mapq is not None and self.mapq > 255:
+        raise ValueError(f"Mapping quality cannot be >255! Found: {self.mapq}")

1411-1415: Address TODO for hard clips handling.

Current implementation sets bases and quals to "*" when hard clips exist. Consider implementing the suggested padding/clipping logic.

Would you like me to help implement the padding/clipping logic for hard-clipped sequences?


1386-1462: Refactor long method for better maintainability.

Split many_sam_from_rec into smaller methods:

  1. Query sequence/quality handling
  2. SAM record creation
  3. Tag management

Example structure:

def _get_query_sequence_and_qualities(self, rec: AlignedSegment, hit: Self) -> tuple[str, Optional[array.array]]:
    """Handle query sequence and qualities based on hard clips and strand."""

def _create_secondary_alignment(self, rec: AlignedSegment, hit: Self) -> AlignedSegment:
    """Create the basic secondary alignment."""

def _set_alignment_tags(self, secondary: AlignedSegment, rec: AlignedSegment, hit: Self) -> None:
    """Set all the SAM tags on the secondary alignment."""
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 914379d and e8a1f75.

📒 Files selected for processing (1)
  • fgpyo/sam/__init__.py (4 hunks)
🔇 Additional comments (1)
fgpyo/sam/__init__.py (1)

1268-1310: Well-structured class declaration and documentation.

The frozen dataclass with comprehensive documentation, examples, and references provides a solid foundation.

Copy link
Contributor

@msto msto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice and I'd love to see it merged in. I've run into a few cases where it'd be helpful to have a SecondaryAlignment class in addition to the existing SupplementaryAlignment class.

The only blocking issue I've identified is the handling of rehydrated secondary and supplementary alignments. I'm not certain that supplementary alignments (from the XA tag) should be handled by this class, but if there are use cases where it would be helpful to do so, we should maintain the distinction (and correct SAM flags) between the two types of alignment when rehydrating. (edit: s/XA/SA/ 🙃 )

I'd also suggest that the SecondaryAlignment and SupplementaryAlignment classes standardize on attribute and method names, where possible. (e.g. SecondaryAlignment.many_from_rec() and SupplementaryAlignment.from_read() have analogous signatures and behaviors - IMO they should have the same name.) I don't have a strong opinion on what the standardized names are, just that they should match between the two classes.

fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Outdated Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
@clintval
Copy link
Member Author

I'd also suggest that the SecondaryAlignment and SupplementaryAlignment classes standardize on attribute and method names, where possible. (e.g. SecondaryAlignment.many_from_rec() and SupplementaryAlignment.from_read() have analogous signatures and behaviors - IMO they should have the same name.) I don't have a strong opinion on what the standardized names are, just that they should match between the two classes.

I agree, @tfenne I favor the new method names since it is more clear when "many" of a class are being built by a classmethod. I also like rec better than read, because I sometimes have SAM records which are not actually "reads" in the pedantic sense. Would you be OK if I added similarly named methods onto SupplementaryAlignment and deprecated the older methods with a @deprecated() annotation?

@clintval clintval requested a review from msto December 30, 2024 15:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tests/fgpyo/sam/test_secondary_alignment.py (1)

452-467: TODO needs implementation.

The TODO comment indicates need for smarter handling of hard clips.

Would you like me to help implement the padding/clipping logic for hard-clipped sequences?

fgpyo/sam/__init__.py (2)

1272-1294: Documentation needs format examples.

Add examples showing complete tag values with multiple alignments.

Add these examples to the docstring:

+    Example of a complete `XA` tag:
+    ```
+    XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;
+    ```
+
+    Example of a complete `XB` tag:
+    ```
+    XB:Z:chr9,-104599381,49M,4,0,30;chr3,+170653467,49M,4,0,20;
+    ```

1433-1435: TODO needs implementation.

The TODO comment indicates need for smarter handling of hard clips.

Would you like me to help implement the padding/clipping logic for hard-clipped sequences?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8a1f75 and 2b4197c.

📒 Files selected for processing (2)
  • fgpyo/sam/__init__.py (4 hunks)
  • tests/fgpyo/sam/test_secondary_alignment.py (1 hunks)
🔇 Additional comments (13)
tests/fgpyo/sam/test_secondary_alignment.py (12)

14-70: LGTM! Comprehensive validation test cases.

Test cases thoroughly validate input parameters with clear error messages.


73-132: LGTM! Thorough test coverage for tag parsing.

Test cases cover XA/XB formats, strands, positions, and edge cases.


135-140: LGTM! Clear error handling test.

Test validates error handling for malformed tag parts.


143-149: LGTM! Good error handling for invalid strands.

Test validates error handling for malformed strand indicators.


152-181: LGTM! Good coverage of reference_end calculations.

Test cases handle various CIGAR scenarios including hard clips and indels.


184-206: LGTM! Good test for XA tag parsing.

Test validates parsing of multiple XA alignments with trailing delimiters.


209-231: LGTM! Good test for XB tag parsing.

Test validates parsing of multiple XB alignments with all fields.


234-239: LGTM! Good edge case test.

Test validates handling of unmapped records.


242-253: LGTM! Good error context test.

Test ensures error messages include read name for debugging.


256-285: LGTM! Thorough test of XA to SAM conversion.

Test validates all SAM fields and tags in converted records.


390-449: LGTM! Thorough test of XB to SAM conversion.

Test validates all SAM fields and tags in converted records, including XB-specific fields.


469-483: LGTM! Good template integration test.

Test validates secondary alignment integration into templates.

fgpyo/sam/__init__.py (1)

1268-1496: LGTM! Well-implemented class.

The SecondaryAlignment class correctly handles:

  • XA/XB tag parsing
  • Coordinate conversion
  • Input validation
  • SAM record generation

@tfenne
Copy link
Member

tfenne commented Dec 30, 2024

What would you think about combining SupplementaryAlignment and SecondaryAlignment into a single class? They are nearly identical (as is the parsing code for the various tags). And then e.g. having SupplementaryAlignment being a deprecated sublcass of the common class?

I feel like the hardest part would just be in naming the single class, some suggestions:

  • AuxiliaryAlignment
  • AlignedSegmentLite
  • SamAlignment

@clintval
Copy link
Member Author

What would you think about combining SupplementaryAlignment and SecondaryAlignment into a single class?

I think I like the idea. It would definitely help clean up the two interfaces.

I might need to try a refactor and see what it looks like which shouldn't be too hard. I'll sketch something out with a PR on top of this PR (maybe later in the week though).

@clintval
Copy link
Member Author

clintval commented Jan 6, 2025

Draft alternate PR opened here (branched off this branch):

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on renaming the SupplementaryAlignment class to AlternativeAlignment, then making an empty sub-class of AlternativeAlignment called SupplementaryAlignment and deprecating the latter.

fgpyo/sam/__init__.py Show resolved Hide resolved
)

@classmethod
def many_from_tag(cls, value: str) -> list[Self]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If from_tag_part is made private, we can simplify this to

Suggested change
def many_from_tag(cls, value: str) -> list[Self]:
def from_tag(cls, value: str) -> list[Self]:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, from_tag reads like it will only build one instance of the object. But we are building many instances. We don't have it codified anywhere yet but I really like the convention of using from_ for class methods that build a single instance and many_from_ for class methods that build a collection of the instance.

fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
Comment on lines +1398 to +1400
secondaries.extend(cls.many_from_tag(cast(str, rec.get_tag("XA"))))
if rec.has_tag("XB"):
secondaries.extend(cls.many_from_tag(cast(str, rec.get_tag("XB"))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we add an issue to pysam to add type-specific accessors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like:

  • from_tag_string
  • from_tag_int
  • etc. etc.

Or do you mean a generic function? Generic functions only got support in Python 3.12 so I doubt pysam will be quick to implement it. For example, this will soon be possible, but only if pysam drops old Python support.

def get_tag[T](self, tag: str) -> T:

fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
fgpyo/sam/__init__.py Show resolved Hide resolved
@clintval
Copy link
Member Author

I'm closing this PR since we certainly won't be going in this direction. I moved all your comments over:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants