Skip to content

Commit

Permalink
[MRG] add --include-db-pattern and --exclude-db-pattern to search…
Browse files Browse the repository at this point in the history
…/gather (#1871)

* upgrade 'manifest' documentation, cli help

* alias fileinfo to summarize

* flakes cleanup

* rescue shadowed tests

* rescue shadowed tests

* rescue shadowed tests

* add 'sig grep' command

* add some basic tests

* fix get manifest stuff

* fail on no manifest

* check manifest req't

* test various combinations of zip, -v, -i

* update with CSV output/manifest

* added -c/--count

* adjust output

* test fail extract

* comment tests better

* add test for count

* update docs

* remove warnings

* cleanup; create CollectionManifest.filter_rows

* create CollectionManifest.filter_on_columns

* minor cleanup

* add --include and --exclude to search

* add --include and --exclude to search and gather

* add --include and --exclude to prefetch

* add args to most set of commands

* update docs

* more doc

* add --include fn to sig cat

* add pattern tests for search and gather

* add pattern include/exclude to prefetch

* implement sig extract w/patterns

* add pattern search to sig rename

* add --include/--exclude to sourmash compare

* update docs

* refactor picklist/pattern selection

* finish refactoring out picklist foo

* much refactoring wow

* check for various argument incompatibility

* test what happens when no manifest

* fix grep

* cleanup and simplify

* change to load_include_exclude_db_patterns

* adjust error message

* remove -f comment
  • Loading branch information
ctb authored Mar 10, 2022
1 parent dff5309 commit f3ae570
Show file tree
Hide file tree
Showing 19 changed files with 551 additions and 118 deletions.
116 changes: 72 additions & 44 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,13 @@ species, while the third is from a completely different genus.

To get a list of subcommands, run `sourmash` without any arguments.

Please use the command line option `--help` to get more detailed usage
information for each command.

All signature saving commands can save to a variety of formats (we
suggest `.zip` files) and all signature loading commands can load
signatures from any of these formats.

There are seven main subcommands: `sketch`, `compare`, `plot`,
`search`, `gather`, `index`, and `prefetch`. See
[the tutorial](tutorials.md) for a walkthrough of these commands.
Expand Down Expand Up @@ -1409,12 +1416,33 @@ signatures with multiple ksizes or moltypes at the same time; you need
to pick the ksize and moltype to use for your search. Where possible,
scaled values will be made compatible.

### Selecting signatures

(sourmash v4.3.0 and later)

sourmash is built to work with very large collections of signatures,
and you may want to select (or exclude) specific signatures from
search or other operations, based on their name. This can be done
without modifying the collections themselves via the
`--include-db-pattern` and `--exclude-db-pattern` arguments to many
sourmash commands, including `search`, `gather`, `compare`, `prefetch`,
and `sig extract`.

In brief, `sourmash search ... --include <pattern>` will search only
those database signatures that match `<pattern>` in their `name`,
`filename`, or `md5` strings. Here, `<pattern>` can be either a
substring or a regular expression. Likewise, `sourmash search
... --exclude <pattern>` will search only those database signatures
that _don't_ match pattern in their `name`, `filename`, or `md5` strings.

### Using picklists to subset large collections of signatures

As of sourmash 4.2.0, many commands support *picklists*, a feature by
which you can select or "pick out" signatures based on values in a CSV
file. This is typically used to index, extract, or search a subset of
a large collection where modifying the collection itself isn't desired.
(sourmash v4.2.0 and later)

Many commands support *picklists*, a feature by which you can select
or "pick out" signatures based on values in a CSV file. This is
typically used to index, extract, or search a subset of a large
collection where modifying the collection itself isn't desired.

For example,
```
Expand Down Expand Up @@ -1452,11 +1480,16 @@ The following `coltype`s are currently supported by `sourmash sig extract`:
Identifiers are constructed by using the first space delimited word in
the signature name.

One way to build a picklist is to use `sourmash sig describe --csv
out.csv <signatures>` or `sourmash sig manifest -o out.csv
<filename_or_db>` to construct an initial CSV file that you can then
edit further; after editing, these can be passed in via the picklist
argument `--picklist out.csv::manifest`.
One way to build a picklist is to use `sourmash sig grep <pattern>
<collection> --csv out.csv` to construct a CSV file containing a list
of all sketches that match the pattern (which can be a string or
regexp). The `out.csv` file can be used as a picklist via the picklist
manifest format with `--picklist out.csv::manifest`.

You can also use `sourmash sig describe --csv out.csv <signatures>` or
`sourmash sig manifest -o out.csv <filename_or_db>` to construct an
initial CSV file that you can then edit further and use as a picklist
as above.

The picklist functionality also supports excluding (rather than
including) signatures matching the picklist arguments. To specify a
Expand Down Expand Up @@ -1497,32 +1530,38 @@ signatures using `zip -r collection.zip *.sig` and then specify

### Saving signatures, more generally

As of sourmash 4.1, most signature saving arguments (`--save-matches`
for `search` and `gather`, `-o` for `sourmash sketch`, and most of the
`sourmash signature` commands) support flexible saving of collections of
(sourmash v4.1 and later)

All signature saving arguments (`--save-matches` for `search` and
`gather`, `-o` for `sourmash sketch`, and `-o` for the `sourmash
signature` commands) support flexible saving of collections of
signatures into JSON text, Zip files, and/or directories.

This behavior is triggered by the requested output filename --

* to save to JSON signature files, use `.sig`; `-` will send JSON to stdout.
* to save to JSON signature files, use `.sig`; using the filename `-`
will send JSON to stdout.
* to save to gzipped JSON signature files, use `.sig.gz`;
* to save to a Zip file collection, use `.zip`;
* to save signature files to a directory, use a name ending in `/`; the directory will be created if it doesn't exist;

If none of these file extensions is detected, output will be written in the JSON `.sig` format, either to the provided output filename or to stdout.
If none of these file extensions is detected, output will be written
in the JSON `.sig` format, either to the provided output filename or
to stdout.

All of these save formats can be loaded by sourmash commands, too.
All of these save formats can be loaded by sourmash commands.

### Loading many signatures

### Loading all signatures under a directory
#### Loading signatures within a directory hierarchy

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

#### Passing in lists of files

Most sourmash commands will also take `--from-file` or
`--query-from-file`, which will take a path to a text file containing
Most sourmash commands will also take a `--from-file` or
`--query-from-file`, which will take the location of a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.
Expand All @@ -1534,36 +1573,30 @@ databases are low memory and disk-intensive databases that allow for
fast searches using a tree structure, while LCA databases are higher
memory and (after a potentially significant load time) are quite fast.

(LCA databases also permit taxonomic searches using `sourmash lca` functions.)
(LCA databases also directly permit taxonomic searches using `sourmash lca`
functions.)

Commands that take multiple signatures or collections of signatures
will also work with databases.

The main point is that since all of these databases contain signatures,
as of sourmash 3.4, any command that takes more than one signature will
also automatically load all of the signatures in the database.
One limitation of indexed databases is that both SBT and LCA database
can only contain one "type" of signature (one ksize/one moltype at one
scaled value). If the database signature type is incompatible with the
other signatures, sourmash will complain appropriately.

Note that, for now, both SBT and LCA database can only contain one
"type" of signature (one ksize, one moltype, etc.) If the database
signature type is incompatible with the other signatures, sourmash
will complain. In contrast, signature files can
contain many different types of signatures, and compatible ones will
be discovered automatically.
In contrast, signature files, zip collections, and directory
hierarchies can contain many different types of signatures, and
compatible ones will be selected automatically.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
combine multiple databases and signatures on the command line and get
the same answer as if you built a single large database from all of
them. The only caveat to this rule is that if you have multiple
identical matches, the first one to be found will differ depending on
the order that the files are passed in on the command line.

This can actually be pretty convenient for speeding up searches - for
example, if you're using `sourmash gather` and you want to find any
new results after a database update, you can provide a file containing
the previously found matches on the command line before the updated
database. Then `gather` will automatically "find" the previously found
matches before anything else, but only if there are no better matches to
be found in the updated database. (OK, it's a bit of a niche case, but it's
been useful. :)
identical matches present across the databases, the order in which
they are found will differ depending on the order that the files are
passed in on the command line.

### Using stdin

Expand All @@ -1573,8 +1606,3 @@ sig` commands will output to stdout. So, for example,

`sourmash sketch ... -o - | sourmash sig describe -` will describe the
signatures that were just created.

(This is a relatively new feature as of 3.4 and our testing may need
some work, so please
[let us know](https://github.com/sourmash-bio/sourmash/issues) if there's
something that doesn't work and we will fix it :).
3 changes: 2 additions & 1 deletion src/sourmash/cli/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -75,6 +75,7 @@ def subparser(subparsers):
'-p', '--processes', metavar='N', type=int, default=None,
help='Number of processes to use to calculate similarity')
add_picklist_args(subparser)
add_pattern_args(subparser)


def main(args):
Expand Down
5 changes: 3 additions & 2 deletions src/sourmash/cli/gather.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,8 @@
"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args, add_scaled_arg)
add_picklist_args, add_scaled_arg,
add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -130,10 +131,10 @@ def subparser(subparsers):
'--prefetch', dest="prefetch", action='store_true',
help="use prefetch before gather; see documentation",
)

add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)
add_pattern_args(subparser)
add_scaled_arg(subparser, 0)


Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/prefetch.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
"""search a signature against dbs, find all overlaps"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args, add_scaled_arg)
add_picklist_args, add_scaled_arg,
add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -61,6 +62,7 @@ def subparser(subparsers):
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)
add_pattern_args(subparser)
add_scaled_arg(subparser, 0)


Expand Down
4 changes: 3 additions & 1 deletion src/sourmash/cli/search.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@
"""

from sourmash.cli.utils import (add_ksize_arg, add_moltype_args,
add_picklist_args, add_scaled_arg)
add_picklist_args, add_scaled_arg,
add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -95,6 +96,7 @@ def subparser(subparsers):
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)
add_pattern_args(subparser)
add_scaled_arg(subparser, 0)


Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/cat.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""concatenate signature files"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -29,6 +29,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
add_picklist_args(subparser)


Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/extract.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""extract one or more signatures"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -34,6 +34,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
add_picklist_args(subparser)


Expand Down
4 changes: 2 additions & 2 deletions src/sourmash/cli/sig/grep.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@
regexp searching. See https://docs.python.org/3/howto/regex.html and
https://docs.python.org/3/library/re.html for details.
The '-v' (exclude), '-i' (case-insensitive), and `-c` (count) options of 'grep' are
supported.
The '-v' (exclude), '-i' (case-insensitive), and `-c` (count) options
of 'grep' are supported.
'-o/--output' can be used to output matching signatures to a specific
location.
Expand Down
3 changes: 2 additions & 1 deletion src/sourmash/cli/sig/rename.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""rename signature"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)
add_picklist_args, add_pattern_args)


def subparser(subparsers):
Expand Down Expand Up @@ -31,6 +31,7 @@ def subparser(subparsers):
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_pattern_args(subparser)
add_picklist_args(subparser)


Expand Down
15 changes: 14 additions & 1 deletion src/sourmash/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,19 @@ def add_picklist_args(parser):
)


def add_pattern_args(parser):
parser.add_argument(
'--include-db-pattern',
default=None,
help='search only signatures that match this pattern in name, filename, or md5'
)
parser.add_argument(
'--exclude-db-pattern',
default=None,
help='search only signatures that do not match this pattern in name, filename, or md5'
)


def opfilter(path):
return not path.startswith('__') and path not in ['utils']

Expand All @@ -108,4 +121,4 @@ def add_num_arg(parser, default=0):
parser.add_argument(
'-n', '--num-hashes', '--num', metavar='N', type=check_num_bounds, default=default,
help='num value should be between 50 and 50000'
)
)
Loading

0 comments on commit f3ae570

Please sign in to comment.