Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SamBuilder.add_single() cannot create a record without sequence or base qualities #211

Open
msto opened this issue Jan 13, 2025 · 5 comments

Comments

@msto
Copy link
Contributor

msto commented Jan 13, 2025

Summary

When attempting to construct a test record with no query sequence or qualities, I found that SamBuilder.add_single() constructs a record with the default base/quality values instead of a record with None in these attributes.

In [1]: from fgpyo.sam.builder import SamBuilder

In [2]: builder = SamBuilder()

In [3]: record = builder.add_single(bases=None, quals=None)

In [4]: record.query_sequence
Out[4]: 'AGGGCGACCTTCGATTCGGATGTGACATTTCATTACATTACGCTCAGGACTGCGAACGAAAGATTAAGAATGCTTAACCCGGTACCTAACCCATCTGATT'

In [5]: record.query_qualities
Out[5]: array('B', [30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30])

Suggested solution

I think this originates from using None as a sentinel value for "use the defaults" in this method.

Instead, we could:

  • Change the default value for these parameters to the actual default query sequence and qualities, rather than None
  • When None is specified as a value for these parameters, return a record where query_sequence and query_qualities are None
@clintval
Copy link
Member

@msto I've been confused by this too. Although the SAM spec has a "null value" for these attributes (*), pysam additionally allows setting them as None. I haven't yet found any documentation indicating if * or None mean different things.

Screenshot 2025-01-13 at 5 59 59 AM

@msto
Copy link
Contributor Author

msto commented Jan 13, 2025

I have never seen pysam create an AlignedSegment with the string literal "*" in either the query_sequence or query_qualities attribute. Additionally, the query_qualities attribute is typed as array | None, so I would not expect to see a string literal in that attribute.

Though not well documented, my understanding is that pysam represents missing base qualities or sequences as None, rather than the ASCII character *.

I can verify this behavior with an example read that lacks both sequence and qualities:

In [1]: from pysam import AlignmentFile

In [2]: sam = AlignmentFile("51510.sam")

In [3]: record = next(sam)

In [4]: record.query_qualities is None
Out[4]: True

There are a number of differences between the in-memory representation of an alignment record and the ASCII rendering of a record in a SAM or BAM file, and I believe this is one of them.

(FWIW I find this to be sensible behavior, otherwise there would be an excessive amount of boilerplate and easy-to-miss bugs, e.g. len(record.query_sequence) would become unsafe)

@msto
Copy link
Contributor Author

msto commented Jan 13, 2025

Here's the code in question.

Query sequence

https://github.com/pysam-developers/pysam/blob/0eae5be21ac3ab3ac7aa770a3931e2977e37b909/pysam/libcalignedsegment.pyx#L544-L545

If the length of the query sequence on the corresponding struct is zero, query_sequence is set to None

Query qualities

https://github.com/pysam-developers/pysam/blob/0eae5be21ac3ab3ac7aa770a3931e2977e37b909/pysam/libcalignedsegment.pyx#L567-L569

As far as I can tell (I'll admit to having some trouble tracing everything through htslib), if the first element of the query qualities array is the byte 0xFF, this is a sentinel value indicating missing query qualities, and query_qualities is also set to None

https://github.com/pysam-developers/pysam/blob/0eae5be21ac3ab3ac7aa770a3931e2977e37b909/htslib/htslib/sam.h#L302-L306

@msto
Copy link
Contributor Author

msto commented Jan 13, 2025

And to the question at hand, I like the suggestion that instead of None we could pass "*" to the builder to get the desired outcome. But that doesn't appear to be supported at the moment

In [4]: record = builder.add_single(bases="*")

In [5]: record.query_sequence is None
Out[5]: False

In [6]: record.query_sequence
Out[6]: '*'

In [7]: record.query_qualities
Out[7]: array('B', [30])

@clintval
Copy link
Member

clintval commented Jan 13, 2025

^ Writing a array('B', [9]) to a pysam alignment that is then written to the SAM/BAM file will roundtrip back as a None. I've accidentally blended the concepts in my in-progress PR here (which I now need to fix):

fgpyo/fgpyo/sam/__init__.py

Lines 1468 to 1469 in 7055cb5

query_sequence = NO_QUERY_BASES
query_qualities = None

Personally, in the future I'll probably be defensive when accessing these attributes and probably do both a None-check and then a *-check (in case someone manually constructs a pysam alignment with a *). For example, this commit: e271e6c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants