-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a class for parsing secondary alignments from XA/XB #206
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #206 +/- ##
==========================================
+ Coverage 91.04% 91.48% +0.44%
==========================================
Files 18 18
Lines 2277 2385 +108
Branches 336 350 +14
==========================================
+ Hits 2073 2182 +109
Misses 133 133
+ Partials 71 70 -1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
fgpyo/sam/__init__.py (1)
Line range hint
1265-1475
: Suggestion for advanced read transformation
Handling hard-clipped segments is currently minimal, with the sequence assigned to*
. If time allows, consider a more detailed approach that pads/clips sequences to reflect the new CIGAR.tests/fgpyo/sam/test_secondary_alignment.py (1)
1-444
: Potential extra test
Consider adding a specific test case with a 7-field part to confirm the code rejects unexpected field counts and fully covers boundary conditions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fgpyo/sam/__init__.py
(4 hunks)tests/fgpyo/sam/test_secondary_alignment.py
(1 hunks)
🔇 Additional comments (7)
fgpyo/sam/__init__.py (6)
156-180
: Docstring clarity is good
This docstring offers clear examples of parsing XA
and XB
tags. Great job demonstrating real usage scenarios.
187-188
: Check Python version requirements
dataclasses
and cached_property
imply Python 3.8+. Verify your library’s minimum Python version supports these.
212-212
: Straightforward import
This import is concise and self-explanatory.
226-228
: Constant usage
Defining NO_QUERY_BASES
as a named constant is a neat approach for readability and consistency.
Line range hint 1265-1475
: Validate read orientation reversal
In many_sam_from_rec
, reversing the query sequence occurs only when rec.is_forward and not hit.is_forward
. Consider also handling the case where the original record is reversed but the new alignment is forward.
Line range hint 1265-1475
: Concise post-init field checks
The negative-value checks in __post_init__
are thorough. Helps avert subtle data inconsistencies.
tests/fgpyo/sam/test_secondary_alignment.py (1)
1-444
: Excellent coverage
These tests thoroughly check invalid inputs, parsing correctness, and reconstituted alignments, reflecting robust unit test design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/fgpyo/sam/test_secondary_alignment.py (4)
14-71
: Add edge case tests for maximum values.Consider adding test cases for maximum allowed values to ensure bounds checking.
# Add these test cases to the parametrize ( { "reference_name": "chr9", "reference_start": sys.maxsize, "is_forward": False, "cigar": Cigar.from_cigarstring("49M"), "edit_distance": 4, "alignment_score": 0, "mapq": 30, }, "Start value too large", ),
73-133
: Add test cases for empty strings and whitespace.Add test cases for empty reference names and whitespace handling.
# Add these test cases to the parametrize [ # Test empty reference name ",-104599381,49M,4", # Test whitespace in reference name "chr 9,-104599381,49M,4", ]
135-151
: Add tests for invalid CIGAR strings.Add test cases for malformed CIGAR strings in the tag.
def test_many_from_tag_raises_for_invalid_cigar() -> None: """Test that we raise an exception when CIGAR string is invalid.""" with pytest.raises(ValueError, match="Invalid CIGAR string"): SecondaryAlignment.from_tag_part("chr3,+1,49X,4")
153-183
: Add tests for complex CIGAR strings.Add test cases with more complex CIGAR patterns including skips and padding.
# Add to parametrize ( SecondaryAlignment( reference_name="chr3", reference_start=170653466, is_forward=True, cigar=Cigar.from_cigarstring("10M2N10M"), edit_distance=4, ), 170653466 + 22, ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/fgpyo/sam/test_secondary_alignment.py
(1 hunks)
🔇 Additional comments (3)
tests/fgpyo/sam/test_secondary_alignment.py (3)
1-12
: LGTM! Imports are well-organized.
185-233
: LGTM! Tag parsing integration tests are comprehensive.
235-462
: LGTM! Record integration tests are thorough and well-structured.
Tests cover all essential SAM record attributes and edge cases.
WalkthroughThe pull request introduces a new The implementation includes methods for creating secondary alignment instances from tag parts and records, with validation of input parameters. Key features include parsing alignment details like reference name, start position, orientation, CIGAR string, edit distance, alignment score, and mapping quality. The class also supports converting secondary alignments to SAM records and integrating them into existing template structures. Additionally, a comprehensive suite of unit tests for the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fgpyo/sam/__init__.py (3)
1312-1330
: Add validations for reference_name and mapq.Add these validations:
- Ensure
reference_name
is not empty orNO_REF_NAME
- Ensure
mapq
doesn't exceed 255 (SAM spec limit)def __post_init__(self) -> None: """Perform post-initialization validation on this dataclass.""" + if not self.reference_name or self.reference_name == NO_REF_NAME: + raise ValueError(f"Invalid reference name: {self.reference_name}") if self.reference_start < 0: raise ValueError(f"Start cannot be <0! Found: {self.reference_start}") if self.edit_distance < 0: raise ValueError(f"Edit distance cannot be <0! Found: {self.edit_distance}") if self.alignment_score is not None and self.alignment_score < 0: raise ValueError(f"Alignment score cannot be <0! Found: {self.alignment_score}") if self.mapq is not None and self.mapq < 0: raise ValueError(f"Mapping quality cannot be <0! Found: {self.mapq}") + if self.mapq is not None and self.mapq > 255: + raise ValueError(f"Mapping quality cannot be >255! Found: {self.mapq}")
1411-1415
: Address TODO for hard clips handling.Current implementation sets bases and quals to "*" when hard clips exist. Consider implementing the suggested padding/clipping logic.
Would you like me to help implement the padding/clipping logic for hard-clipped sequences?
1386-1462
: Refactor long method for better maintainability.Split
many_sam_from_rec
into smaller methods:
- Query sequence/quality handling
- SAM record creation
- Tag management
Example structure:
def _get_query_sequence_and_qualities(self, rec: AlignedSegment, hit: Self) -> tuple[str, Optional[array.array]]: """Handle query sequence and qualities based on hard clips and strand.""" def _create_secondary_alignment(self, rec: AlignedSegment, hit: Self) -> AlignedSegment: """Create the basic secondary alignment.""" def _set_alignment_tags(self, secondary: AlignedSegment, rec: AlignedSegment, hit: Self) -> None: """Set all the SAM tags on the secondary alignment."""
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fgpyo/sam/__init__.py
(4 hunks)
🔇 Additional comments (1)
fgpyo/sam/__init__.py (1)
1268-1310
: Well-structured class declaration and documentation.
The frozen dataclass with comprehensive documentation, examples, and references provides a solid foundation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice and I'd love to see it merged in. I've run into a few cases where it'd be helpful to have a SecondaryAlignment
class in addition to the existing SupplementaryAlignment
class.
The only blocking issue I've identified is the handling of rehydrated secondary and supplementary alignments. I'm not certain that supplementary alignments (from the XA tag) should be handled by this class, but if there are use cases where it would be helpful to do so, we should maintain the distinction (and correct SAM flags) between the two types of alignment when rehydrating. (edit: s/XA/SA/
🙃 )
I'd also suggest that the SecondaryAlignment
and SupplementaryAlignment
classes standardize on attribute and method names, where possible. (e.g. SecondaryAlignment.many_from_rec()
and SupplementaryAlignment.from_read()
have analogous signatures and behaviors - IMO they should have the same name.) I don't have a strong opinion on what the standardized names are, just that they should match between the two classes.
I agree, @tfenne I favor the new method names since it is more clear when "many" of a class are being built by a classmethod. I also like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tests/fgpyo/sam/test_secondary_alignment.py (1)
452-467
: TODO needs implementation.The TODO comment indicates need for smarter handling of hard clips.
Would you like me to help implement the padding/clipping logic for hard-clipped sequences?
fgpyo/sam/__init__.py (2)
1272-1294
: Documentation needs format examples.Add examples showing complete tag values with multiple alignments.
Add these examples to the docstring:
+ Example of a complete `XA` tag: + ``` + XA:Z:chr9,-104599381,49M,4;chr3,+170653467,49M,4; + ``` + + Example of a complete `XB` tag: + ``` + XB:Z:chr9,-104599381,49M,4,0,30;chr3,+170653467,49M,4,0,20; + ```
1433-1435
: TODO needs implementation.The TODO comment indicates need for smarter handling of hard clips.
Would you like me to help implement the padding/clipping logic for hard-clipped sequences?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fgpyo/sam/__init__.py
(4 hunks)tests/fgpyo/sam/test_secondary_alignment.py
(1 hunks)
🔇 Additional comments (13)
tests/fgpyo/sam/test_secondary_alignment.py (12)
14-70
: LGTM! Comprehensive validation test cases.
Test cases thoroughly validate input parameters with clear error messages.
73-132
: LGTM! Thorough test coverage for tag parsing.
Test cases cover XA/XB formats, strands, positions, and edge cases.
135-140
: LGTM! Clear error handling test.
Test validates error handling for malformed tag parts.
143-149
: LGTM! Good error handling for invalid strands.
Test validates error handling for malformed strand indicators.
152-181
: LGTM! Good coverage of reference_end calculations.
Test cases handle various CIGAR scenarios including hard clips and indels.
184-206
: LGTM! Good test for XA tag parsing.
Test validates parsing of multiple XA alignments with trailing delimiters.
209-231
: LGTM! Good test for XB tag parsing.
Test validates parsing of multiple XB alignments with all fields.
234-239
: LGTM! Good edge case test.
Test validates handling of unmapped records.
242-253
: LGTM! Good error context test.
Test ensures error messages include read name for debugging.
256-285
: LGTM! Thorough test of XA to SAM conversion.
Test validates all SAM fields and tags in converted records.
390-449
: LGTM! Thorough test of XB to SAM conversion.
Test validates all SAM fields and tags in converted records, including XB-specific fields.
469-483
: LGTM! Good template integration test.
Test validates secondary alignment integration into templates.
fgpyo/sam/__init__.py (1)
1268-1496
: LGTM! Well-implemented class.
The SecondaryAlignment class correctly handles:
- XA/XB tag parsing
- Coordinate conversion
- Input validation
- SAM record generation
What would you think about combining I feel like the hardest part would just be in naming the single class, some suggestions:
|
I think I like the idea. It would definitely help clean up the two interfaces. I might need to try a refactor and see what it looks like which shouldn't be too hard. I'll sketch something out with a PR on top of this PR (maybe later in the week though). |
Draft alternate PR opened here (branched off this branch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on renaming the SupplementaryAlignment
class to AlternativeAlignment
, then making an empty sub-class of AlternativeAlignment
called SupplementaryAlignment
and deprecating the latter.
) | ||
|
||
@classmethod | ||
def many_from_tag(cls, value: str) -> list[Self]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If from_tag_part
is made private, we can simplify this to
def many_from_tag(cls, value: str) -> list[Self]: | |
def from_tag(cls, value: str) -> list[Self]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, from_tag
reads like it will only build one instance of the object. But we are building many instances. We don't have it codified anywhere yet but I really like the convention of using from_
for class methods that build a single instance and many_from_
for class methods that build a collection of the instance.
secondaries.extend(cls.many_from_tag(cast(str, rec.get_tag("XA")))) | ||
if rec.has_tag("XB"): | ||
secondaries.extend(cls.many_from_tag(cast(str, rec.get_tag("XB")))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we add an issue to pysam to add type-specific accessors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like:
from_tag_string
from_tag_int
- etc. etc.
Or do you mean a generic function? Generic functions only got support in Python 3.12 so I doubt pysam will be quick to implement it. For example, this will soon be possible, but only if pysam drops old Python support.
def get_tag[T](self, tag: str) -> T:
I'm closing this PR since we certainly won't be going in this direction. I moved all your comments over: |
I have a project where I am "reconstituting" the secondary alignments from
bwa
into actual SAM records. I created a helper class to do this for data stored in theXB
SAM tag and extended it to also parse the more commonXA
SAM tag.