Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.crai index improvements #137

Open
jkbonfield opened this issue Mar 15, 2016 · 2 comments
Open

.crai index improvements #137

jkbonfield opened this issue Mar 15, 2016 · 2 comments
Labels

Comments

@jkbonfield
Copy link
Contributor

This is just a list of things that could be improved, for whenever we next revise the format. (So we don't forget any). I'm not suggesting an immediate update, but to gather ideas in one place.

  • Magic number with version string.
  • Add number of reads / bases as columns. This will make very approximate coverage plots trivial as well as improve tools like samtools idxstats so they work on both BAM and CRAM. What else in idxstats needs replicating?
  • A generation UUID. If coupled with an identical UUID in the SAM header then we can use this to spot cases where the CRAM file has been updated without rebuilding the index. (We want to add this same feature to .BAI and .CSI too.)
  • Check the utility of container size column. I think currently it is the number of remaining bytes after decoding the container header (and perhaps compression header?). More useful for random slicing would simply by the size of the entire container.
  • Consider whether gzipped text is the right format. We could provide for random access on compressed index by self-indexing the index, but that's a far larger change.
@droazen
Copy link

droazen commented Mar 24, 2016

@jkbonfield This proposal seems very relevant to the following crai-related bug report in htsjdk: samtools/htsjdk#531

@jkbonfield
Copy link
Contributor Author

For completeness sake, so we don't forget at least, also consider adding the "missing" meta-information to CRAM indices. Re: pysam-developers/pysam#556.

I say "missing" because at the time of writing CRAM those extra fields in BAI were non-standard and undocumented anyway.

(NB: No planned time line of this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants