Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name tokenizer codec clarification #802

Open
cmnbroad opened this issue Dec 10, 2024 · 1 comment · May be fixed by #803
Open

Name tokenizer codec clarification #802

cmnbroad opened this issue Dec 10, 2024 · 1 comment · May be fixed by #803
Labels

Comments

@cmnbroad
Copy link
Contributor

I'm trying to finish up Yash's CRAM 3.1 codecs for htsjdk, and I came across an ambiguity in the name tokenizer spec that could use clarification. It currently states "The serialised data stream starts with two unsigned little endian 32-bit integers holding the total size of uncompressed name buffer and the number of read names".

Based on the values I see when inspecting samtools CRAMs/code, the uncompressed name buffer size assumes that the uncompressed names are formatted into a buffer the way htslib formats them, i.e., on decode, it would match the length of the read name data that is reconstructed from the stream(s), PLUS one byte for a separator for each read name, including a terminal separator. I don't think that's stated anywhere in the spec, so if correct, it would be useful to state explicitly.

@cmnbroad cmnbroad added the cram label Dec 10, 2024
@jkbonfield
Copy link
Contributor

Thanks for this. That's a good point and it's worth being explicit given in CRAM they could have been encoded with either a BYTE_ARRAY_LEN having an explicit second data series to mark the ends of each read name, or with a BYTE_ARRAY_STOP where an inline termination symbol is used (like C-strings).

I'll double check the code and make a PR to clarify accordingly.

jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802
@jkbonfield jkbonfield added the sam label Jan 7, 2025
@jkbonfield jkbonfield moved this to New items in GA4GH File Formats Jan 7, 2025
@jkbonfield jkbonfield removed the sam label Jan 7, 2025
jkbonfield added a commit to jkbonfield/hts-specs that referenced this issue Jan 7, 2025
This includes all visible read name bytes plus 1 termination byte per
name (e.g. '\0').

Fixes samtools#802
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New items
Development

Successfully merging a pull request may close this issue.

2 participants