Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Enable merged sigs, sequence range selection in urlsketch #161

Merged
merged 62 commits into from
Jan 11, 2025

Conversation

bluegenes
Copy link
Collaborator

@bluegenes bluegenes commented Dec 18, 2024

Adds the following to urlsketch:

  • allow merged signatures. When urls are provided with ';' separation in the url column, download each url and sketch together in a single file. If saving the fasta files, write all data to the single download_filename provided in the input csv.
  • range: Allow sketching of a range within one or more FASTA entry. If providing multiple urls, the number of ranges provided must be equal to the number of URLs provided. If there is more than one contig/read in the incoming FASTA, the range is applied to all contigs/reads. This matches behavior of SeqKit subseq. The range will be applied both for sketching and for saving the downloaded file(s).

To generate the FASTA range test files:
seqkit subseq --region 1:50000 GCA_000175535.1_ASM17553v1_genomic.fna -o GCA_000175535.1_ASM17553v1_genomic.1-50000.fna

seqkit subseq --region 50000:100000 GCA_000175535.1_ASM17553v1_genomic.fna -o GCA_000175535.1_ASM17553v1_genomic.50000-100000.fna

to do:

  • add docs
  • change failures handling to handle printing the multiple md5s/urls/ranges in failure files if provided
  • bug to handle: changing fasta to append mode means we append to the file if it already exists --> make sure we can't append to a file built during previous iteration of urlsketch
  • additional testing:
    • parse multiple md5sum
    • test parsing multiple ranges
    • test failure reporting for merged, ranged entries
    • test failure conditions for multi-url/md5sum/range parsing
    • test --keep-fasta for merged
    • bug to handle: if passing --keep-fasta and a directory, we can check and/or build that directory, but if the filename has a path specified that doesn't exist in that folder, we will currently get an error. Fix by building out the path as needed; add test
  • rm parentheses from range specification

@bluegenes bluegenes changed the base branch from main to mod-permits December 21, 2024 18:08
@bluegenes bluegenes changed the base branch from mod-permits to skipmer December 21, 2024 22:28
Base automatically changed from skipmer to main December 27, 2024 02:09
@ctb
Copy link
Contributor

ctb commented Dec 28, 2024

curious - what's the use case for ranges?

@bluegenes
Copy link
Collaborator Author

curious - what's the use case for ranges?

Some of the virus genomes in the ICTV VMR are provided as ranges of larger genomes.

example: Salmonella phage Fels2 genomic reference is specified as: AE006468 (2844298.2877981)

The Fels2 sequence is within the Salmonella enterica genome : (https://www.ncbi.nlm.nih.gov/nuccore/AE006468.2/)

here is the online feature description for Fels2 in that genome:

2844431..2879237
/organism="Salmonella virus Fels2"
/mol_type="genomic DNA"
/db_xref="taxon:194701"

The reference range on the genbank link doesn't exactly match the range given in the ICTV VMR, but I assume the VMR has potentially tuned their range for a reason.

@ctb
Copy link
Contributor

ctb commented Dec 29, 2024

cool, thx!

@bluegenes bluegenes changed the title WIP: Enable merged sigs, sequence range selection in urlsketch MRG: Enable merged sigs, sequence range selection in urlsketch Jan 9, 2025
@bluegenes
Copy link
Collaborator Author

@ctb - not sure if you want to look at this again, a lot of testing + a few bugfixes happened since you reviewed

@ctb
Copy link
Contributor

ctb commented Jan 11, 2025

@ctb - not sure if you want to look at this again, a lot of testing + a few bugfixes happened since you reviewed

all good :)

@bluegenes bluegenes merged commit f3b063a into main Jan 11, 2025
1 check passed
@bluegenes bluegenes deleted the merged-sigs branch January 11, 2025 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants