Skip to content

Commit

Permalink
MRG: support proper manifest creation with --relpath for `sig check…
Browse files Browse the repository at this point in the history
…` and `sig collect` (#3054)

This PR updates `sig collect` and `sig check` so that they can produce
standalone manifests that work properly with default sourmash loading
behavior. The default behavior produces broken manifests in some
situations and is not changed, but will be deprecated in v5.

## Details

Currently, `sig collect` and `sig check` default to producing standalone
manifests with internal path locations relative to the current working
directory. This conflicts with the default `StandaloneManifest` behavior
implemented in `save_load.py` that loads path locations relative to the
manifest location. As a result, whenever the manifest was in a
subdirectory, the standalone manifests output by `sig check` and `sig
collect` were broken. The only way to make good manifests in this
situation was to use `sig collect --abspath`, but `sig check` didn't
support `--abspath`, and using absolute paths is brittle in situations
where you want to distribute manifests.

This PR adds `--relpath` to both `sig check` and `sig collect`, and adds
`--abspath` to `sig check`. It also demonstrates the bad behavior in
tests and annotates the tests appropriately.

See
#3008 (comment)
for more detailed discussion of why I think `--relpath` is the right
behavior for the future.

- [x] adds `--abspath` and `--relpath` to `sig check`, to properly
support relative paths;
- [x] adds `--relpath` to `sig collect`, to properly support relative
paths;
- [x] documents this behavior properly for creating standalone
manifests;
- [ ] create issue to change default `sig check` and `sig collect`
behavior for v4, and disable cwd behavior.

Techie TODO:
- [x] explicitly test `relpath` and `abspath` behavior in `sig check`;
- [x] explicitly test `relpath` behavior in `sig collect`
- [x] write some tests for `sig check` and `sig collect` to explore the
relative path loading issue, with all three combinations of relpath: mf
in cwd, sigs in subdir; mf in subdir, sigs in cwd; mf in subdir, sigs in
subdir.

Related issues:
* Addresses #3008
* Addresses issues in
#3048 by updating `sig
check` to support `--relpath`;
* Fixes #3053 -
`--relpath` again

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
ctb and pre-commit-ci[bot] authored Mar 8, 2024
1 parent d578157 commit a90ab01
Show file tree
Hide file tree
Showing 9 changed files with 771 additions and 43 deletions.
21 changes: 20 additions & 1 deletion doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -551,7 +551,7 @@ sourmash multigather --query <queries ...> --db <collections>
```

Note that multigather is single threaded, so it offers no substantial
efficiency gains over just running gather multiple times! Nontheless, it
efficiency gains over just running gather multiple times! Nonetheless, it
is useful for situations where you have many sketches organized in a
combined file, e.g. sketches built with `sourmash sketch
... --singleton`).
Expand Down Expand Up @@ -1959,6 +1959,16 @@ sourmash database.
`sourmash sig check` is particularly useful when working with large
collections of signatures and identifiers.

With `-m/--save-manifest-matching`, `sig check` creates a standalone
manifest. In these manifests, sourmash v4 will by default write paths
to the matched elements that are relative to the current working
directory. In some cases - when the output manifest is in different
directory - this will create manifests that do not work properly
with sourmash. The `--relpath` argument will rewrite the paths to be
relative to the manifest, while the `--abspath` argument will rewrite
paths to be absolute. The `--relpath` behavior will be the default in
sourmash v5.

### `sourmash signature collect` - collect manifests across databases

Collect manifests from across (many) files and merge into a single
Expand All @@ -1977,6 +1987,15 @@ This manifest file can be loaded directly from the command line by sourmash.
particularly useful when working with large collections of signatures and
identifiers, and has command line options for merging and updating manifests.

As with `sig check`, the standalone manifests created by `sig collect`
in sourmash v4 will by default write paths to the matched elements
relative to the current working directory. When the output manifest
is in a different directory, this will create manifests that do not work
properly with sourmash. The `--relpath` argument will rewrite the
paths to be relative to the manifest, while the `--abspath` argument
will rewrite paths to be absolute. The `--relpath` behavior will be
the default in sourmash v5.

## Advanced command-line usage

### Loading signatures and databases
Expand Down
6 changes: 5 additions & 1 deletion doc/sourmash-internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,11 @@ Thus, while standalone manifests can point at any kind of container,
including JSON files or LCA databases, they are most efficient when
`internal_location` points at a file with either a single sketch in
it, or a manifest that supports direct loading of sketches. Therefore,
we suggest using standalone manifest indices.
we suggest using standalone manifest indices. Note that sourmash
interprets paths to locations in standalone manifests relative to the
manifest filename; see the `--relpath` behavior in `sig check` and
`sig collect` to output manifests that deal with relative filenames
properly.

Note that searching a standalone manifest is currently done through a
linear iteration, and does not use any features of indexed containers
Expand Down
24 changes: 24 additions & 0 deletions src/sourmash/cli/sig/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,30 @@ def subparser(subparsers):
default="csv",
choices=["csv", "sql"],
)
subparser.add_argument(
"--abspath",
"--use-absolute-paths",
help="convert all locations to absolute paths",
action="store_true",
)
subparser.add_argument(
"--no-abspath",
help="do not convert all locations to absolute paths",
action="store_false",
dest="abspath",
)
subparser.add_argument(
"--relpath",
"--use-relative-paths",
help="convert all locations to paths relative to the output manifest",
action="store_true",
)
subparser.add_argument(
"--no-relpath",
help="do not convert all locations to paths relative to the output manifest",
action="store_false",
dest="relpath",
)

add_ksize_arg(subparser)
add_moltype_args(subparser)
Expand Down
23 changes: 22 additions & 1 deletion src/sourmash/cli/sig/collect.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,28 @@ def subparser(subparsers):
help="merge new manifests into existing",
)
subparser.add_argument(
"--abspath", help="convert all locations to absolute paths", action="store_true"
"--abspath",
"--use-absolute-paths",
help="convert all locations to absolute paths",
action="store_true",
)
subparser.add_argument(
"--no-abspath",
help="do not convert all locations to absolute paths",
action="store_false",
dest="abspath",
)
subparser.add_argument(
"--relpath",
"--use-relative-paths",
help="convert all locations to paths relative to the output manifest",
action="store_true",
)
subparser.add_argument(
"--no-relpath",
help="do not convert all locations to paths relative to the output manifest",
action="store_false",
dest="relpath",
)

add_ksize_arg(subparser)
Expand Down
18 changes: 11 additions & 7 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1113,23 +1113,26 @@ class StandaloneManifestIndex(Index):
with many signature collections underneath it, and you don't want to load
every collection each time you run sourmash.
Instead, you can run 'sourmash sig manifest <directory> -o mf.csv' to
output a manifest and then use this class to load 'mf.csv' directly.
Instead, you can run 'sourmash sig collect <directory> -o <manifest>' to
output a manifest and then use this class to load <manifest> directly.
Sketch type selection, picklists, and pattern matching will all work
directly on the manifest and will load signatures only upon demand.
One feature of this class is that absolute paths to sketches in
the 'internal_location' field of the manifests will be loaded properly.
This permits manifests to be constructed for various collections of
signatures that reside elsewhere, and not just below a single directory
prefix.
One feature of this class is that external paths to sketches in
the 'internal_location' field of the manifests will be loaded
properly. This permits manifests to be constructed for various
collections of signatures that reside elsewhere, and not just
below a single directory prefix. By default paths are interpreted
relative to the location of the manifest, unless an absolute path
is provided in the 'internal_location' field.
StandaloneManifestIndex does _not_ store signatures in memory.
This class also overlaps in concept with MultiIndex when
MultiIndex.load_from_pathlist is used to load other Index
objects. However, this class does not store any signatures in
memory, unlike MultiIndex.
"""

is_database = True
Expand All @@ -1155,6 +1158,7 @@ def load(cls, location, *, prefix=None):
m = CollectionManifest.load_from_filename(location)

if prefix is None:
# by default, calculate paths relative to manifest location.
prefix = os.path.dirname(location)

return cls(m, location, prefix=prefix)
Expand Down
63 changes: 57 additions & 6 deletions src/sourmash/sig/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1435,11 +1435,42 @@ def check(args):
else:
debug("sig check: manifest required")

# abspath/relpath checks
if args.abspath and args.relpath:
error("** Cannot specify both --abspath and --relpath; pick one!")
sys.exit(-1)

if args.relpath or args.abspath and not args.save_manifest_matching:
notify(
"** WARNING: --abspath and --relpath only have effects when saving a manifest"
)

relpath = "."
if args.relpath and args.save_manifest_matching:
output_manifest_dir = os.path.dirname(args.save_manifest_matching)
relpath = os.path.relpath(os.curdir, output_manifest_dir)

total_manifest_rows = CollectionManifest([])

# start loading!
total_rows_examined = 0
for filename in args.signatures:
# if saving a manifest, think about how to rewrite locations.
if args.abspath:
# convert to abspath
new_iloc = os.path.abspath(filename)
elif args.relpath:
# interpret paths relative to manifest directory.
if filename.startswith("/"):
notify(
f"** WARNING: cannot convert abspath {filename} into relative path."
)
new_iloc = os.path.join(relpath, filename)
else:
# default: paths are relative to cwd. This breaks when sketches
# are in subdirectories; will be deprecated for v5.
new_iloc = filename

idx = sourmash_args.load_file_as_index(filename, yield_all_files=args.force)

idx = idx.select(ksize=args.ksize, moltype=moltype)
Expand All @@ -1458,7 +1489,7 @@ def check(args):
# rewrite locations so that each signature can be found by filename
# of its container; this follows `sig collect` logic.
for row in sub_manifest.rows:
row["internal_location"] = filename
row["internal_location"] = new_iloc
total_manifest_rows.add_row(row)

# the len(sub_manifest) here should only be run when needed :)
Expand Down Expand Up @@ -1535,6 +1566,11 @@ def collect(args):
f"WARNING: --merge-previous specified, but output file '{args.output}' does not already exist?"
)

# abspath/relpath checks
if args.abspath and args.relpath:
error("** Cannot specify both --abspath and --relpath; pick one!")
sys.exit(-1)

# load previous manifest for --merge-previous. This gets tricky with
# mismatched manifest types, which we forbid.
try:
Expand Down Expand Up @@ -1579,15 +1615,16 @@ def collect(args):
# load from_file
_extend_signatures_with_from_file(args, target_attr="locations")

# convert to abspath
if args.abspath:
args.locations = [os.path.abspath(iloc) for iloc in args.locations]
relpath = None
if args.relpath:
output_manifest_dir = os.path.dirname(args.output)
relpath = os.path.relpath(os.curdir, output_manifest_dir)

# iterate through, loading all the manifests from all the locations.
for n_files, loc in enumerate(args.locations):
notify(f"Loading signature information from {loc}.")

if n_files % 100 == 0:
if n_files and n_files % 100 == 0:
notify(f"... loaded {len(collected_mf)} sigs from {n_files} files")
idx = sourmash.load_file_as_index(loc)
if idx.manifest is None and require_manifest:
Expand All @@ -1600,8 +1637,22 @@ def collect(args):

mf = sourmash_args.get_manifest(idx)

# decide how to rewrite locations to container:
if args.abspath:
# convert to abspath
new_iloc = os.path.abspath(loc)
elif args.relpath:
# interpret paths relative to manifest directory
if loc.startswith("/"):
notify(f"** WARNING: cannot convert abspath {loc} into relative path.")
new_iloc = os.path.join(relpath, loc)
else:
# default: paths are relative to cwd. This breaks when sketches
# are in subdirectories; will be deprecated for v5.
new_iloc = loc

for row in mf.rows:
row["internal_location"] = loc
row["internal_location"] = new_iloc
collected_mf.add_row(row)

if args.manifest_format == "csv":
Expand Down
15 changes: 15 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,21 @@ def sig_save_extension_abund(request):
return request.param


# these should both always succeed for 'sig check' and 'sig collect' output
# manifests.
@pytest.fixture(params=["--abspath", "--relpath"])
def abspath_or_relpath(request):
return request.param


# this will fail if subdirs used; see #3008. but this ensures v4 behavior of
# sig collect/sig check works, where manifest paths are interpreted relative
# to cwd.
@pytest.fixture(params=["--no-abspath", "--abspath", "--relpath"])
def abspath_relpath_v4(request):
return request.param


# --- BEGIN - Only run tests using a particular fixture --- #
# Cribbed from: http://pythontesting.net/framework/pytest/pytest-run-tests-using-particular-fixture/
def pytest_collection_modifyitems(items, config):
Expand Down
Loading

0 comments on commit a90ab01

Please sign in to comment.