Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: Enable merged sigs, sequence range selection in urlsketch #161

Merged
merged 62 commits into from
Jan 11, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
5bf9113
make permits modifiable
bluegenes Dec 6, 2024
9b258bd
add permit option; use BuildCollection directly to avoid clone
bluegenes Dec 10, 2024
d2d90f0
integrate multiselect, etc change from branchwater buildutils work
bluegenes Dec 10, 2024
bd24b26
upd batch info to better reflect method
bluegenes Dec 10, 2024
7ac3650
fix #s
bluegenes Dec 10, 2024
ae1bf4e
try adding xz-utils
bluegenes Dec 10, 2024
bfd3c37
clean up
bluegenes Dec 10, 2024
57dadd7
try again
bluegenes Dec 10, 2024
71d699f
another!
bluegenes Dec 10, 2024
4df4fdd
try pyo3 default
bluegenes Dec 17, 2024
075d798
revert pyo3 version
bluegenes Dec 17, 2024
3879514
v3 cargo lock
bluegenes Dec 17, 2024
0d45ca9
Merge branch 'main' into mod-permits
bluegenes Dec 17, 2024
6190609
add skipmers; update as needed for core changes
bluegenes Dec 17, 2024
927916c
add skipmer tests
bluegenes Dec 17, 2024
85b2212
allow merging urls into single sketch
bluegenes Dec 18, 2024
7b0e8b7
allow specification of a range to sketch within a sequence file
bluegenes Dec 18, 2024
0bfd0a1
add test for range; test with subseqs generated via seqkit, then sket…
bluegenes Dec 19, 2024
681b795
add range to urlsketch download
bluegenes Dec 20, 2024
2fddefb
try adding liblzma-devel
bluegenes Dec 21, 2024
27238bb
take out rust tests for now
bluegenes Dec 21, 2024
c403bac
upd sourmash
bluegenes Dec 21, 2024
04efc39
comment out tests that need to change, just to asses ci
bluegenes Dec 21, 2024
9f6ba34
add back md5sum row counting; clippy fixes
bluegenes Dec 21, 2024
9894545
Merge branch 'main' into mod-permits
bluegenes Dec 21, 2024
c5708df
Merge branch 'mod-permits' into skipmer
bluegenes Dec 21, 2024
7e0d20a
Merge branch 'skipmer' into merged-sigs
bluegenes Dec 21, 2024
33eaf83
Merge branch 'main' into mod-permits
bluegenes Dec 21, 2024
85a6694
Merge branch 'main' into mod-permits
bluegenes Dec 22, 2024
e5407fe
upd smash
bluegenes Dec 23, 2024
f0045bb
fix restart bug in urlsketch!
bluegenes Dec 24, 2024
bc6a499
fix restart
bluegenes Dec 24, 2024
a323229
rustfmt
bluegenes Dec 24, 2024
02f9405
rm unused line
bluegenes Dec 24, 2024
be43b73
changes needed for zipfile fixes, buildcollection change
bluegenes Dec 24, 2024
f9abbbd
Merge branch 'main' into mod-permits
bluegenes Dec 24, 2024
fb70fb5
add n downloads to usage
bluegenes Dec 24, 2024
d0160a2
Merge branch 'mod-permits' into skipmer
bluegenes Dec 24, 2024
189386a
commit cargo lock
bluegenes Dec 24, 2024
c2fe82a
Merge branch 'skipmer' into merged-sigs
bluegenes Dec 24, 2024
319ab3b
upd readme
bluegenes Dec 24, 2024
51a145e
Merge branch 'skipmer' into merged-sigs
bluegenes Dec 24, 2024
00559f6
return empty for any fail
bluegenes Dec 24, 2024
365b7aa
add test keep fasta merged
bluegenes Dec 24, 2024
3d1a6f8
create file subdirs if needed + test
bluegenes Dec 24, 2024
96f54e7
test merge with md5sums
bluegenes Dec 24, 2024
fd60a01
Merge branch 'main' into skipmer
bluegenes Dec 27, 2024
3a038db
add version to pyproject toml
bluegenes Dec 27, 2024
a46cdf9
Merge branch 'skipmer' into merged-sigs
bluegenes Dec 27, 2024
985581f
Merge branch 'main' into merged-sigs
bluegenes Dec 27, 2024
fa17b75
init test merged md5sum fail
bluegenes Dec 27, 2024
767387e
fix fasta append issue
bluegenes Jan 8, 2025
f86e964
update failure reporting for merged files, ranged files
bluegenes Jan 8, 2025
aeaef3b
if checksum file not provided, make FailedDownload directly
bluegenes Jan 8, 2025
bb3b713
for merged files, write to main failures file even if checksum failure
bluegenes Jan 8, 2025
3fc1e11
simplify range specification
bluegenes Jan 8, 2025
5a80d5c
test url,md5sum,range parsing
bluegenes Jan 9, 2025
03e853c
allow empty range vals
bluegenes Jan 9, 2025
00ff77e
test merged/range failure outputs
bluegenes Jan 9, 2025
66364d2
add documentation
bluegenes Jan 9, 2025
2fef892
move FailedDownload,FailedChecksum to utils, add unit tests for these
bluegenes Jan 9, 2025
599487d
Merge branch 'main' into merged-sigs
bluegenes Jan 9, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,8 @@ jobs:
- name: Run cargo fmt
run: cargo fmt --all -- --check --verbose

# - name: rust tests
# run: cargo test --verbose --no-fail-fast
# - name: rust tests
# run: cargo test --verbose --no-fail-fast

- name: build
shell: bash -l {0}
Expand Down
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,29 +153,34 @@ options:
```

## `urlsketch`
download and sketch directly from a url
download and sketch directly from URL(s)

### Create an input file

First, create a file, e.g. `acc-url.csv` with identifiers, sketch names, and other required info.
```
accession,name,moltype,md5sum,download_filename,url
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,dna,47b9fb20c51f0552b87db5d44d5d4566,GCA_000961135.2_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_genomic.fna.gz
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,protein,fb7920fb8f3cf5d6ab9b6b754a5976a4,GCA_000961135.2_protein.urlsketch.faa.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_protein.faa.gz
GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,dna,a1a8f1c6dc56999c73fe298871c963d1,GCA_000175535.1_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/175/535/GCA_000175535.1_ASM17553v1/GCA_000175535.1_ASM17553v1_genomic.fna.gz
accession,name,moltype,md5sum,download_filename,url,range
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,dna,47b9fb20c51f0552b87db5d44d5d4566,GCA_000961135.2_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_genomic.fna.gz,
GCA_000961135.2,GCA_000961135.2 Candidatus Aramenus sulfurataquae isolate AZ1-454,protein,fb7920fb8f3cf5d6ab9b6b754a5976a4,GCA_000961135.2_protein.urlsketch.faa.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/961/135/GCA_000961135.2_ASM96113v2/GCA_000961135.2_ASM96113v2_protein.faa.gz,
GCA_000175535.1,GCA_000175535.1 Chlamydia muridarum MopnTet14 (agent of mouse pneumonitis) strain=MopnTet14,dna,a1a8f1c6dc56999c73fe298871c963d1,GCA_000175535.1_genomic.urlsketch.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/175/535/GCA_000175535.1_ASM17553v1/GCA_000175535.1_ASM17553v1_genomic.fna.gz,
```
> Six columns must be present:
> - `accession` - an accession or unique identifier. Ideally no spaces.
> - `name` - full name for the sketch.
> - `moltype` - is the file 'dna' or 'protein'?
> - `md5sum` - expected md5sum (optional, will be checked after download if provided)
> - `md5sum` - expected md5sum(s). Optional, will be checked after download if provided.
> - `download_filename` - filename for FASTA download. Required if `--keep-fastas`, but useful for signatures, too (saved in sig data).
> - `url` - direct link for the file
> - `url` - direct link(s) for the file(s)
> - `range` - if desired, include base pair range(s), e.g 500-10000. This range will be selected from the record(s) and sketched (and/or saved to the download_filename). If there are multiple records in a FASTA file, the range will be applied to each record.

#### Note: Merging Files into the same signature
As of v0.5.0, `urlsketch` allows specification of multiple URLs to be downloaded and sketched into a single signature. If providing multiple URLs for a single accession/name, you must either provide no `md5sum` or `range`, or the number of entries in these columns must match the number of URLs. In each case, separate the entries with ';' -- e.g. "abc;def" for two md5sums.

### Run:

To run the test accession file at `tests/test-data/acc-url.csv`, run:
To run after creating file above:
```
sourmash scripts urlsketch tests/test-data/acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
sourmash scripts urlsketch acc-url.csv -o test-urlsketch.zip -f out_fastas -k --failed test.failed.csv -p dna,k=21,k=31,scaled=1000,abund -p protein,k=10,scaled=100,abund -r 1
```

Full Usage:
Expand Down
Loading
Loading