Skip to content

Commit

Permalink
Merge branch 'main' into try-fastagather
Browse files Browse the repository at this point in the history
  • Loading branch information
bluegenes authored Jan 17, 2025
2 parents 0c5a3d7 + cc09db8 commit 7cf8698
Show file tree
Hide file tree
Showing 59 changed files with 5,145 additions and 416 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -80,3 +80,7 @@ docs/_build/

# pixi
.pixi/

*.csv
*.zip
*.rocksdb
51 changes: 26 additions & 25 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 9 additions & 9 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sourmash_plugin_branchwater"
version = "0.9.12"
version = "0.9.14-dev"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Expand All @@ -9,22 +9,22 @@ name = "sourmash_plugin_branchwater"
crate-type = ["cdylib"]

[dependencies]
pyo3 = { version = "0.23.3", features = ["extension-module", "anyhow"] }
pyo3 = { version = "0.23.4", features = ["extension-module", "anyhow"] }
rayon = "1.10.0"
serde = { version = "1.0.216", features = ["derive"] }
serde = { version = "1.0.217", features = ["derive"] }
sourmash = { version = "0.18.0", features = ["branchwater"] }
serde_json = "1.0.134"
serde_json = "1.0.135"
niffler = "2.4.0"
log = "0.4.22"
env_logger = { version = "0.11.5" }
env_logger = { version = "0.11.6" }
simple-error = "0.3.1"
anyhow = "1.0.94"
anyhow = "1.0.95"
zip = { version = "2.0", default-features = false }
tempfile = "3.14"
tempfile = "3.15"
needletail = "0.5.1"
csv = "1.3.1"
camino = "1.1.9"
glob = "0.3.1"
glob = "0.3.2"
rustworkx-core = "0.15.1"
streaming-stats = "0.2.3"
rust_decimal = { version = "1.36.0", features = ["maths"] }
Expand All @@ -35,7 +35,7 @@ getset = "0.1"
assert_cmd = "2.0.16"
assert_matches = "1.5.0"
predicates = "3.1.3"
tempfile = "3.14.0"
tempfile = "3.15.0"

[profile.release]
#target-cpu=native
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

tl;dr Do faster and lower-memory sourmash functions via this plugin.

<p align="center"><img src="https://raw.githubusercontent.com/sourmash-bio/sourmash_plugin_branchwater/main/doc/_static/logo.png" height="256" /></p>

## Details

[sourmash](https://sourmash.readthedocs.io/en/latest/) is a
Expand Down
44 changes: 21 additions & 23 deletions doc/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# The branchwater plugin for sourmash

<p align="center"><img src="https://raw.githubusercontent.com/sourmash-bio/sourmash_plugin_branchwater/main/doc/_static/logo.png" height="256" /></p>

| command | functionality | docs |
| -------- | -------- | -------- |
| `manysketch` | Rapidly build sketches for many input files | [link](#Running-manysketch) |
| `singlesketch` | Sketch a single sequence file | [link](#Running-singlesketch)
| `singlesketch` | Sketch a single sample | [link](#Running-singlesketch)
| `fastgather` | Multithreaded `gather` of **one** metagenome against a database| [link](#Running-fastgather)
| `fastmultigather` | Multithreaded `gather` of **multiple** metagenomes against a database | [link](#Running-fastmultigather)
| `manysearch` | Multithreaded containment search for many queries in many large metagenomes | [link](#Running-manysearch)
Expand Down Expand Up @@ -257,19 +259,21 @@ In this case, three sketches of `protein`, `dayhoff`, and `hp` moltypes were mad

## Running `singlesketch`

The `singlesketch` command generates a sketch for a single sequence file.
The `singlesketch` command generates a sketch for a single sample, with one or more input FASTA/FASTQ files.

### Basic Usage

```bash
sourmash scripts singlesketch input.fa -p k=21,scaled=1000,dna -o output.sig --name signature_name
```

### Using `stdin/stdout`

You can use `-` for `stdin` and output the result to `stdout`:
```bash
cat input.fa | sourmash scripts singlesketch - -o -
```


### Running `multisearch` and `pairwise`

The `multisearch` command compares one or more query genomes, and one or more subject genomes. It differs from `manysearch` because it loads everything into memory.
Expand Down Expand Up @@ -310,9 +314,9 @@ sourmash scripts fastgather query.sig.gz database.zip -o results.csv --cores 4

### Running `fastmultigather`

`fastmultigather` takes a collection of query metagenomes and a collection of sketches as a database, and outputs many CSVs:
`fastmultigather` takes a collection of query metagenomes and a collection of sketches as a database, and outputs a CSV file containing all the matches.
```
sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4 --save-matches
sourmash scripts fastmultigather queries.manifest.csv database.zip --cores 4 --save-matches -o results.csv
```

We suggest using standalone manifest CSVs wherever possible, especially if
Expand All @@ -325,32 +329,26 @@ this can be a significant time savings for large databases.

#### Output files for `fastmultigather`

On a database of sketches (but not on RocksDB indexes)
`fastmultigather` will output two CSV files for each query, a
`prefetch` file containing all overlapping matches between that query
and the database, and a `gather` file containing the minimum
metagenome cover for that query in the database.
`fastmultigather` will output a gather file containing all results in
one file, specified with `-o/--output`. `fastmultigather` gather CSVs
provide the same columns as `fastgather`, above.

The prefetch CSV will be named `{signame}.prefetch.csv`, and the
gather CSV will be named `{signame}.gather.csv`. Here, `{signame}` is
the name of your sourmash signature.
In addition, on a database of sketches (but not on RocksDB indexes)
`fastmultigather` will output a `prefetch` file containing all
overlapping matches between that query and the database. The prefetch
CSV will be named `{signame}.prefetch.csv`, where `{signame}` is the
name of your sourmash signature.

`--save-matches` is an optional flag that will save the matched hashes
for each query in a separate sourmash signature
`{signame}.matches.sig`. This can be useful for debugging or for
further analysis.

When searching against a RocksDB index, `fastmultigather` will output
a single file containing all gather results, specified with
`-o/--output`. No prefetch results will be output.

`fastmultigather` gather CSVs provide the same columns as `fastgather`, above.

**Warning:** At the moment, if two different queries have the same
`{signame}`, the CSVs for one of the queries will be overwritten by
the other query. The behavior here is undefined in practice, because
of multithreading: we don't know what queries will be executed when
or files will be written first.
`{signame}`, the output files for one query will be overwritten by
the results from the other query. The behavior here is undefined in
practice, because of multithreading: we don't know what queries will
be executed when or files will be written first.

### Running `manysearch`

Expand Down
Binary file added doc/_static/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ channels:
- bioconda
- defaults
dependencies:
- sourmash>=4.8.3,<5
- sourmash-minimal>=4.8.14,<5
- pip
- rust
- maturin>=1,<2
Expand Down
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@
name = "sourmash_plugin_branchwater"
description = "fast command-line extensions for sourmash"
readme = "README.md"
version = "0.9.12"
version = "0.9.14-dev"
requires-python = ">=3.10"
classifiers = [
"Programming Language :: Rust",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = ["sourmash>=4.8.10,<5"]
dependencies = ["sourmash>=4.8.14,<5"]

authors = [
{ name="N. Tessa Pierce-Ward", orcid="0000-0002-2942-5331" },
Expand Down Expand Up @@ -60,7 +60,7 @@ libclang = ">=16.0.6,<16.1"
python = "3.10.*"
rust = ">=1.80.0,<1.81"
maturin = ">=1.7.4,<2"
sourmash-minimal = ">=4.8.10,<5"
sourmash-minimal = ">=4.8.14,<5"

[tool.pixi.feature.build.target.linux-64.dependencies]
patchelf = ">=0.17.2,<0.18"
Expand Down
15 changes: 12 additions & 3 deletions src/check.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
use crate::utils::is_revindex_database;
use anyhow::Result;

use sourmash::index::revindex::{RevIndex, RevIndexOps};

pub fn check(index: camino::Utf8PathBuf, quick: bool) -> Result<(), Box<dyn std::error::Error>> {
pub fn check(index: camino::Utf8PathBuf, quick: bool, rw: bool) -> Result<()> {
if !is_revindex_database(&index) {
bail!("'{}' is not a valid RevIndex database", index);
}

println!("Opening DB");
let db = RevIndex::open(index, true, None)?;
println!("Opening DB (rw mode? {})", rw);
let db = match RevIndex::open(index, !rw, None) {
Ok(db) => db,
Err(e) => {
return Err(anyhow::anyhow!(
"cannot open RocksDB database. Error is: {}",
e
))
}
};

println!("Starting check");
db.check(quick);
Expand Down
Loading

0 comments on commit 7cf8698

Please sign in to comment.