Add a class for parsing secondary alignments from XA/XB #206

clintval · 2024-12-28T20:28:51Z

I have a project where I am "reconstituting" the secondary alignments from bwa into actual SAM records. I created a helper class to do this for data stored in the XB SAM tag and extended it to also parse the more common XA SAM tag.

codecov · 2024-12-28T20:29:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.48%. Comparing base (bd38898) to head (2b4197c).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #206      +/-   ##
==========================================
+ Coverage   91.04%   91.48%   +0.44%     
==========================================
  Files          18       18              
  Lines        2277     2385     +108     
  Branches      336      350      +14     
==========================================
+ Hits         2073     2182     +109     
  Misses        133      133              
+ Partials       71       70       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

fgpyo/sam/__init__.py (1)

Line range hint 1265-1475: Suggestion for advanced read transformation
Handling hard-clipped segments is currently minimal, with the sequence assigned to *. If time allows, consider a more detailed approach that pads/clips sequences to reflect the new CIGAR.

tests/fgpyo/sam/test_secondary_alignment.py (1)

1-444: Potential extra test
Consider adding a specific test case with a 7-field part to confirm the code rejects unexpected field counts and fully covers boundary conditions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd38898 and 18ba007.

📒 Files selected for processing (2)

fgpyo/sam/__init__.py (4 hunks)
tests/fgpyo/sam/test_secondary_alignment.py (1 hunks)

🔇 Additional comments (7)

fgpyo/sam/__init__.py (6)

156-180: Docstring clarity is good
This docstring offers clear examples of parsing XA and XB tags. Great job demonstrating real usage scenarios.

187-188: Check Python version requirements
dataclasses and cached_property imply Python 3.8+. Verify your library’s minimum Python version supports these.

212-212: Straightforward import
This import is concise and self-explanatory.

226-228: Constant usage
Defining NO_QUERY_BASES as a named constant is a neat approach for readability and consistency.

Line range hint 1265-1475: Validate read orientation reversal
In many_sam_from_rec, reversing the query sequence occurs only when rec.is_forward and not hit.is_forward. Consider also handling the case where the original record is reversed but the new alignment is forward.

Line range hint 1265-1475: Concise post-init field checks
The negative-value checks in __post_init__ are thorough. Helps avert subtle data inconsistencies.

tests/fgpyo/sam/test_secondary_alignment.py (1)

1-444: Excellent coverage
These tests thoroughly check invalid inputs, parsing correctness, and reconstituted alignments, reflecting robust unit test design.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

tests/fgpyo/sam/test_secondary_alignment.py (4)

14-71: Add edge case tests for maximum values.

Consider adding test cases for maximum allowed values to ensure bounds checking.

# Add these test cases to the parametrize
(
    {
        "reference_name": "chr9",
        "reference_start": sys.maxsize,
        "is_forward": False,
        "cigar": Cigar.from_cigarstring("49M"),
        "edit_distance": 4,
        "alignment_score": 0,
        "mapq": 30,
    },
    "Start value too large",
),

73-133: Add test cases for empty strings and whitespace.

Add test cases for empty reference names and whitespace handling.

# Add these test cases to the parametrize
[
    # Test empty reference name
    ",-104599381,49M,4",
    # Test whitespace in reference name
    "chr 9,-104599381,49M,4",
]

135-151: Add tests for invalid CIGAR strings.

Add test cases for malformed CIGAR strings in the tag.

def test_many_from_tag_raises_for_invalid_cigar() -> None:
    """Test that we raise an exception when CIGAR string is invalid."""
    with pytest.raises(ValueError, match="Invalid CIGAR string"):
        SecondaryAlignment.from_tag_part("chr3,+1,49X,4")

153-183: Add tests for complex CIGAR strings.

Add test cases with more complex CIGAR patterns including skips and padding.

# Add to parametrize
(
    SecondaryAlignment(
        reference_name="chr3",
        reference_start=170653466,
        is_forward=True,
        cigar=Cigar.from_cigarstring("10M2N10M"),
        edit_distance=4,
    ),
    170653466 + 22,
),

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 18ba007 and aaa9132.

📒 Files selected for processing (1)

tests/fgpyo/sam/test_secondary_alignment.py (1 hunks)

🔇 Additional comments (3)

tests/fgpyo/sam/test_secondary_alignment.py (3)

1-12: LGTM! Imports are well-organized.

185-233: LGTM! Tag parsing integration tests are comprehensive.

235-462: LGTM! Record integration tests are thorough and well-structured.

Tests cover all essential SAM record attributes and edge cases.

coderabbitai · 2024-12-29T15:39:20Z

Walkthrough

The pull request introduces a new SecondaryAlignment class in the fgpyo.sam module to handle secondary alignments in SAM/BAM files. This class provides functionality for parsing and managing secondary alignment information from SAM records, specifically focusing on the XA and XB tags.

The implementation includes methods for creating secondary alignment instances from tag parts and records, with validation of input parameters. Key features include parsing alignment details like reference name, start position, orientation, CIGAR string, edit distance, alignment score, and mapping quality. The class also supports converting secondary alignments to SAM records and integrating them into existing template structures.

Additionally, a comprehensive suite of unit tests for the SecondaryAlignment class is added, covering validation, construction from tag parts, and handling of multiple secondary alignments.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

fgpyo/sam/__init__.py (3)

1312-1330: Add validations for reference_name and mapq.

Add these validations:

Ensure reference_name is not empty or NO_REF_NAME
Ensure mapq doesn't exceed 255 (SAM spec limit)

 def __post_init__(self) -> None:
     """Perform post-initialization validation on this dataclass."""
+    if not self.reference_name or self.reference_name == NO_REF_NAME:
+        raise ValueError(f"Invalid reference name: {self.reference_name}")
     if self.reference_start < 0:
         raise ValueError(f"Start cannot be <0! Found: {self.reference_start}")
     if self.edit_distance < 0:
         raise ValueError(f"Edit distance cannot be <0! Found: {self.edit_distance}")
     if self.alignment_score is not None and self.alignment_score < 0:
         raise ValueError(f"Alignment score cannot be <0! Found: {self.alignment_score}")
     if self.mapq is not None and self.mapq < 0:
         raise ValueError(f"Mapping quality cannot be <0! Found: {self.mapq}")
+    if self.mapq is not None and self.mapq > 255:
+        raise ValueError(f"Mapping quality cannot be >255! Found: {self.mapq}")

1411-1415: Address TODO for hard clips handling.

Current implementation sets bases and quals to "*" when hard clips exist. Consider implementing the suggested padding/clipping logic.

Would you like me to help implement the padding/clipping logic for hard-clipped sequences?

1386-1462: Refactor long method for better maintainability.

Split many_sam_from_rec into smaller methods:

Query sequence/quality handling
SAM record creation
Tag management

Example structure:

def _get_query_sequence_and_qualities(self, rec: AlignedSegment, hit: Self) -> tuple[str, Optional[array.array]]:
    """Handle query sequence and qualities based on hard clips and strand."""

def _create_secondary_alignment(self, rec: AlignedSegment, hit: Self) -> AlignedSegment:
    """Create the basic secondary alignment."""

def _set_alignment_tags(self, secondary: AlignedSegment, rec: AlignedSegment, hit: Self) -> None:
    """Set all the SAM tags on the secondary alignment."""

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 914379d and e8a1f75.

📒 Files selected for processing (1)

fgpyo/sam/__init__.py (4 hunks)

🔇 Additional comments (1)

fgpyo/sam/__init__.py (1)

1268-1310: Well-structured class declaration and documentation.

The frozen dataclass with comprehensive documentation, examples, and references provides a solid foundation.

msto

This is really nice and I'd love to see it merged in. I've run into a few cases where it'd be helpful to have a SecondaryAlignment class in addition to the existing SupplementaryAlignment class.

The only blocking issue I've identified is the handling of rehydrated secondary and supplementary alignments. I'm not certain that supplementary alignments (from the XA tag) should be handled by this class, but if there are use cases where it would be helpful to do so, we should maintain the distinction (and correct SAM flags) between the two types of alignment when rehydrating. (edit: s/XA/SA/ 🙃 )

I'd also suggest that the SecondaryAlignment and SupplementaryAlignment classes standardize on attribute and method names, where possible. (e.g. SecondaryAlignment.many_from_rec() and SupplementaryAlignment.from_read() have analogous signatures and behaviors - IMO they should have the same name.) I don't have a strong opinion on what the standardized names are, just that they should match between the two classes.

fgpyo/sam/__init__.py

clintval · 2024-12-30T15:54:44Z

I'd also suggest that the SecondaryAlignment and SupplementaryAlignment classes standardize on attribute and method names, where possible. (e.g. SecondaryAlignment.many_from_rec() and SupplementaryAlignment.from_read() have analogous signatures and behaviors - IMO they should have the same name.) I don't have a strong opinion on what the standardized names are, just that they should match between the two classes.

I agree, @tfenne I favor the new method names since it is more clear when "many" of a class are being built by a classmethod. I also like rec better than read, because I sometimes have SAM records which are not actually "reads" in the pedantic sense. Would you be OK if I added similarly named methods onto SupplementaryAlignment and deprecated the older methods with a @deprecated() annotation?

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

tests/fgpyo/sam/test_secondary_alignment.py (1)

452-467: TODO needs implementation.

The TODO comment indicates need for smarter handling of hard clips.

Would you like me to help implement the padding/clipping logic for hard-clipped sequences?
fgpyo/sam/__init__.py (2)
1272-1294: Documentation needs format examples.

Add examples showing complete tag values with multiple alignments.

Add these examples to the docstring:
+    Example of a complete `XA` tag:
+    ```
+    XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4;
+    ```
+
+    Example of a complete `XB` tag:
+    ```
+    XB:Z:chr9,-104599381,49M,4,0,30;chr3,+170653467,49M,4,0,20;
+    ```
1433-1435: TODO needs implementation.

The TODO comment indicates need for smarter handling of hard clips.

Would you like me to help implement the padding/clipping logic for hard-clipped sequences?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8a1f75 and 2b4197c.

📒 Files selected for processing (2)

fgpyo/sam/__init__.py (4 hunks)
tests/fgpyo/sam/test_secondary_alignment.py (1 hunks)

🔇 Additional comments (13)

tests/fgpyo/sam/test_secondary_alignment.py (12)

14-70: LGTM! Comprehensive validation test cases.

Test cases thoroughly validate input parameters with clear error messages.

73-132: LGTM! Thorough test coverage for tag parsing.

Test cases cover XA/XB formats, strands, positions, and edge cases.

135-140: LGTM! Clear error handling test.

Test validates error handling for malformed tag parts.

143-149: LGTM! Good error handling for invalid strands.

Test validates error handling for malformed strand indicators.

152-181: LGTM! Good coverage of reference_end calculations.

Test cases handle various CIGAR scenarios including hard clips and indels.

184-206: LGTM! Good test for XA tag parsing.

Test validates parsing of multiple XA alignments with trailing delimiters.

209-231: LGTM! Good test for XB tag parsing.

Test validates parsing of multiple XB alignments with all fields.

234-239: LGTM! Good edge case test.

Test validates handling of unmapped records.

242-253: LGTM! Good error context test.

Test ensures error messages include read name for debugging.

256-285: LGTM! Thorough test of XA to SAM conversion.

Test validates all SAM fields and tags in converted records.

390-449: LGTM! Thorough test of XB to SAM conversion.

Test validates all SAM fields and tags in converted records, including XB-specific fields.

469-483: LGTM! Good template integration test.

Test validates secondary alignment integration into templates.

fgpyo/sam/__init__.py (1)

1268-1496: LGTM! Well-implemented class.

The SecondaryAlignment class correctly handles:

XA/XB tag parsing
Coordinate conversion
Input validation
SAM record generation

tfenne · 2024-12-30T16:48:47Z

What would you think about combining SupplementaryAlignment and SecondaryAlignment into a single class? They are nearly identical (as is the parsing code for the various tags). And then e.g. having SupplementaryAlignment being a deprecated sublcass of the common class?

I feel like the hardest part would just be in naming the single class, some suggestions:

AuxiliaryAlignment
AlignedSegmentLite
SamAlignment

clintval · 2024-12-30T16:55:12Z

What would you think about combining SupplementaryAlignment and SecondaryAlignment into a single class?

I think I like the idea. It would definitely help clean up the two interfaces.

I might need to try a refactor and see what it looks like which shouldn't be too hard. I'll sketch something out with a PR on top of this PR (maybe later in the week though).

clintval · 2025-01-06T17:08:10Z

Draft alternate PR opened here (branched off this branch):

DRAFT of AuxAlignment for parsing secondary/supplementary alignments #208

nh13

Agreed on renaming the SupplementaryAlignment class to AlternativeAlignment, then making an empty sub-class of AlternativeAlignment called SupplementaryAlignment and deprecating the latter.

fgpyo/sam/__init__.py

nh13 · 2025-01-10T03:55:22Z

fgpyo/sam/__init__.py

+        )
+
+    @classmethod
+    def many_from_tag(cls, value: str) -> list[Self]:


If from_tag_part is made private, we can simplify this to

Suggested change

def many_from_tag(cls, value: str) -> list[Self]:

def from_tag(cls, value: str) -> list[Self]:

To me, from_tag reads like it will only build one instance of the object. But we are building many instances. We don't have it codified anywhere yet but I really like the convention of using from_ for class methods that build a single instance and many_from_ for class methods that build a collection of the instance.

fgpyo/sam/__init__.py

nh13 · 2025-01-10T03:57:46Z

fgpyo/sam/__init__.py

+                secondaries.extend(cls.many_from_tag(cast(str, rec.get_tag("XA"))))
+            if rec.has_tag("XB"):
+                secondaries.extend(cls.many_from_tag(cast(str, rec.get_tag("XB"))))


Perhaps we add an issue to pysam to add type-specific accessors?

Do you mean like:

from_tag_string

from_tag_int

etc. etc.

Or do you mean a generic function? Generic functions only got support in Python 3.12 so I doubt pysam will be quick to implement it. For example, this will soon be possible, but only if pysam drops old Python support.

def get_tag[T](self, tag: str) -> T:

fgpyo/sam/__init__.py

clintval · 2025-01-10T14:34:19Z

I'm closing this PR since we certainly won't be going in this direction. I moved all your comments over:

DRAFT of AuxAlignment for parsing secondary/supplementary alignments #208 (comment)

Add a class for parsing secondary alignments from XA/XB

18ba007

clintval assigned nh13 and tfenne Dec 28, 2024

clintval requested review from nh13 and tfenne as code owners December 28, 2024 20:28

coderabbitai bot reviewed Dec 28, 2024

View reviewed changes

coderabbitai bot approved these changes Dec 28, 2024

View reviewed changes

chore: write last few unit tests

aaa9132

coderabbitai bot reviewed Dec 28, 2024

View reviewed changes

clintval added 2 commits December 28, 2024 15:42

chore: write last few unit tests

53130cf

docs: fix small docs typo

619b147

fulcrumgenomics deleted a comment from coderabbitai bot Dec 28, 2024

chore: tiny fixups

914379d

chore: type using self

e8a1f75

coderabbitai bot reviewed Dec 29, 2024

View reviewed changes

msto requested changes Dec 30, 2024

View reviewed changes

clintval requested a review from msto December 30, 2024 15:54

chore: fixup from review

5cb47f5

msto approved these changes Dec 30, 2024

View reviewed changes

chore: add one more test

2b4197c

coderabbitai bot reviewed Dec 30, 2024

View reviewed changes

clintval assigned clintval and unassigned nh13 and tfenne Dec 30, 2024

clintval mentioned this pull request Dec 31, 2024

DRAFT of AuxAlignment for parsing secondary/supplementary alignments #208

Closed

nh13 reviewed Jan 10, 2025

View reviewed changes

clintval closed this Jan 10, 2025

clintval mentioned this pull request Jan 10, 2025

Add AuxAlignment for parsing secondary/supplementary alignments in SAM tags #209

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a class for parsing secondary alignments from XA/XB #206

Add a class for parsing secondary alignments from XA/XB #206

clintval commented Dec 28, 2024

codecov bot commented Dec 28, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot commented Dec 29, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

msto left a comment •

edited

Loading

clintval commented Dec 30, 2024

coderabbitai bot left a comment

tfenne commented Dec 30, 2024

clintval commented Dec 30, 2024

clintval commented Jan 6, 2025

nh13 left a comment

nh13 Jan 10, 2025

clintval Jan 10, 2025

nh13 Jan 10, 2025

clintval Jan 10, 2025

clintval commented Jan 10, 2025

	def many_from_tag(cls, value: str) -> list[Self]:
	def from_tag(cls, value: str) -> list[Self]:

Add a class for parsing secondary alignments from XA/XB #206

Add a class for parsing secondary alignments from XA/XB #206

Conversation

clintval commented Dec 28, 2024

codecov bot commented Dec 28, 2024 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot commented Dec 29, 2024 • edited Loading

Walkthrough

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

msto left a comment • edited Loading

Choose a reason for hiding this comment

clintval commented Dec 30, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

tfenne commented Dec 30, 2024

clintval commented Dec 30, 2024

clintval commented Jan 6, 2025

nh13 left a comment

Choose a reason for hiding this comment

nh13 Jan 10, 2025

Choose a reason for hiding this comment

clintval Jan 10, 2025

Choose a reason for hiding this comment

nh13 Jan 10, 2025

Choose a reason for hiding this comment

clintval Jan 10, 2025

Choose a reason for hiding this comment

clintval commented Jan 10, 2025

codecov bot commented Dec 28, 2024 •

edited

Loading

coderabbitai bot commented Dec 29, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

msto left a comment •

edited

Loading