Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add signature cat and signature split commands to combine/split signature files #1044

Merged
merged 18 commits into from
Jun 27, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 44 additions & 1 deletion doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ for an example use case.
These commands manipulate signatures from the command line. Currently
supported subcommands are `merge`, `rename`, `intersect`,
`extract`, `downsample`, `subtract`, `import`, `export`, `info`,
`flatten`, and `filter`.
`flatten`, `filter`, `cat`, and `split`.

The signature commands that combine or otherwise have multiple
signatures interacting (`merge`, `intersect`, `subtract`) work only on
Expand All @@ -508,6 +508,49 @@ such as `search`, `gather`, and `compare`.

Note, you can use `sourmash sig` as shorthand for all of these commands.

### `sourmash signature cat`

Concatenate signature files.

For example,
```
sourmash signature cat file1.sig file2.sig -o all.sig
```
will combine all signatures in `file1.sig` and `file2.sig` and put them
in the file `all.sig`.

### `sourmash signature split`

Split each signature in the input file(s) into individual files, with
standardized names. **Note:** unlike the rest of the sourmash sig
commands, `split` can load signatures from LCA and SBT databases as
well.

For example,
```
sourmash signature split tests/test-data/2.fa.sig
```
will create 3 files,

`f372e478.k=21.scaled=1000.DNA.dup=0.2.fa.sig`,
`f3a90d4e.k=31.scaled=1000.DNA.dup=0.2.fa.sig`, and
`43f3b48e.k=51.scaled=1000.DNA.dup=0.2.fa.sig`, representing the three
different DNA signatures at different ksizes created from the input file
`2.fa`.

The format of the names of the output files is standardized and stable
for major versions of sourmash: currently, they are period-separated
with fields:

* `md5sum` - a unique hash value based on the contents of the signature.
* `k=<ksize>` - k-mer size.
* `scaled=<scaled>` or `num=<num>` - scaled or num value for MinHash.
* `<moltype>` - the molecule type (DNA, protein, dayhoff, or hp)
* `dup=<n>` - a non-negative integer that prevents duplicate signatures from colliding.
* `basename` - basename of first input file used to create signature; if none provided, or stdin, this is `none`.

If `--outdir` is specified, all of the signatures are placed in outdir.

### `sourmash signature merge`

Merge two (or more) signatures.
Expand Down
2 changes: 2 additions & 0 deletions sourmash/cli/sig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@
`sourmash sig` operations.
"""

from . import cat
from . import split
from . import describe
from . import downsample
from . import extract
Expand Down
24 changes: 24 additions & 0 deletions sourmash/cli/sig/cat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
"""concatenate signature files"""

import csv

import sourmash
from sourmash.logging import notify, print_results, error


def subparser(subparsers):
subparser = subparsers.add_parser('cat')
subparser.add_argument('signatures', nargs='+')
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output signature to this file (default stdout)'
)


def main(args):
import sourmash
return sourmash.sig.__main__.cat(args)
21 changes: 21 additions & 0 deletions sourmash/cli/sig/split.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""concatenate signature files"""

import csv
import sourmash


def subparser(subparsers):
subparser = subparsers.add_parser('split')
subparser.add_argument('signatures', nargs='+')
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'--outdir', help='output signatures to this directory'
)


def main(args):
import sourmash
return sourmash.sig.__main__.split(args)
2 changes: 2 additions & 0 deletions sourmash/cli/signature/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
'aliases' argument to add_subparser in ../sig/__init__.py.
"""

from ..sig import cat
from ..sig import split
from ..sig import describe
from ..sig import downsample
from ..sig import extract
Expand Down
108 changes: 107 additions & 1 deletion sourmash/sig/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import sys
import csv
import json
import os

import sourmash
import copy
Expand All @@ -19,6 +20,7 @@

** Commands can be:

cat <signature> [<signature> ... ] - concatenate all signatures
describe <signature> [<signature> ... ] - show details of signature
downsample <signature> [<signature> ... ] - downsample one or more signatures
extract <signature> [<signature> ... ] - extract one or more signatures
Expand All @@ -27,6 +29,7 @@
intersect <signature> [<signature> ...] - intersect one or more signatures
merge <signature> [<signature> ...] - merge one or more signatures
rename <signature> <name> - rename signature
split <signatures> [<signature> ...] - split signatures into single files
subtract <signature> <other_sig> [...] - subtract one or more signatures
import [ ... ] - import a mash or other signature
export <signature> - export a signature, e.g. to mash
Expand Down Expand Up @@ -58,6 +61,108 @@ def _set_num_scaled(mh, num, scaled):
##### actual command line functions


def cat(args):
"""
concatenate all signatures into one file.
"""
set_quiet(args.quiet)

siglist = []
for sigfile in args.signatures:
this_siglist = []
try:
this_siglist = sourmash.load_signatures(sigfile, quiet=True, do_raise=True)
except Exception as exc:
error('\nError while reading signatures from {}:'.format(sigfile))
error(str(exc))
error('(continuing)')

this_siglist = list(this_siglist)

notify('loaded {} signatures from {}...', len(this_siglist), sigfile,
end='\r')
siglist.extend(this_siglist)

notify('loaded {} signatures total.', len(siglist))

with FileOutput(args.output, 'wt') as fp:
sourmash.save_signatures(siglist, fp=fp)

notify('saved {} signatures', len(siglist))


def split(args):
"""
split all signatures into individual
"""
set_quiet(args.quiet)

output_names = set()
output_scaled_template = '{md5sum}.k={ksize}.scaled={scaled}.{moltype}.dup={dup}.{basename}.sig'
output_num_template = '{md5sum}.k={ksize}.num={num}.{moltype}.dup={dup}.{basename}.sig'

if args.outdir:
if not os.path.exists(args.outdir):
notify('Creating --outdir {}', args.outdir)
os.mkdir(args.outdir)

total = 0
for sigfile in args.signatures:
# load signatures from input file:
this_siglist = sourmash_args.load_file_as_signatures(sigfile)

# save each file individually --
n_signatures = 0
for sig in this_siglist:
n_signatures += 1
md5sum = sig.md5sum()[:8]
minhash = sig.minhash
basename = os.path.basename(sig.filename)
if not basename or basename == '-':
basename = 'none'

params = dict(basename=basename,
md5sum=md5sum,
scaled=minhash.scaled,
ksize=minhash.ksize,
num=minhash.num,
moltype=minhash.moltype)

if minhash.scaled:
output_template = output_scaled_template
else: # num
assert minhash.num
output_template = output_num_template

# figure out if this is duplicate, build unique filename
n = 0
params['dup'] = n
output_name = output_template.format(**params)
while output_name in output_names:
params['dup'] = n
output_name = output_template.format(**params)
n += 1

output_names.add(output_name)

if args.outdir:
output_name = os.path.join(args.outdir, output_name)

if os.path.exists(output_name):
notify("** overwriting existing file {}".format(output_name))

# save!
with open(output_name, 'wt') as outfp:
sourmash.save_signatures([sig], outfp)
notify('writing sig to {}', output_name)

notify('loaded {} signatures from {}...', n_signatures, sigfile,
end='\r')
total += n_signatures

notify('loaded and split {} signatures total.', total)


def describe(args):
"""
provide basic info on signatures
Expand All @@ -76,7 +181,8 @@ def describe(args):
error(str(exc))
error('(continuing)')

notify('loaded {} signatures from {}...', len(siglist), sigfile,
this_siglist = list(this_siglist)
notify('loaded {} signatures from {}...', len(this_siglist), sigfile,
end='\r')

notify('loaded {} signatures total.', len(siglist))
Expand Down
73 changes: 73 additions & 0 deletions sourmash/sourmash_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@
import sys
import os
import argparse
import itertools
from enum import Enum

from sourmash import load_sbt_index
from sourmash.lca.lca_db import load_single_database

from . import signature
from .logging import notify, error

Expand Down Expand Up @@ -308,6 +314,73 @@ def load_dbs_and_sigs(filenames, query, is_similarity_query, traverse=False):
return databases


class DatabaseType(Enum):
SIGLIST = 1
SBT = 2
LCA = 3


def load_database(filename):
"""Load file as a database - list of signatures, LCA, SBT, etc.

Return DatabaseType enum.

This will (eventually) supersede load_dbs_and_sigs.

TODO:
- add traversal behavior + force load for directories.
- add stdin for reading signatures?
- maybe add file lists?
"""
loaded = False
dbtype = None
try:
# CTB: could make this a generator, with some trickery; but for
# now, just force into list.
with open(filename, 'rt') as fp:
db = sourmash.load_signatures(fp, quiet=True, do_raise=True)
db = list(db)

loaded = True
dbtype = DatabaseType.SIGLIST
except Exception as exc:
pass

if not loaded: # try load as SBT
try:
db = load_sbt_index(filename)
loaded = True
dbtype = DatabaseType.SBT
except:
pass

if not loaded: # try load as LCA
try:
db, _, _ = load_single_database(filename)
loaded = True
dbtype = DatabaseType.LCA
except:
pass

if not loaded:
error('\nError while reading signatures from {}.'.format(filename))
sys.exit(-1)

return db, dbtype


def load_file_as_signatures(filename):
"""Load 'filename' as a collection of signatures. Return an iterable.

If it's an LCA or SBT, call the .signatures() method on it.
"""
db, dbtype = load_database(filename)
if dbtype in (DatabaseType.LCA, DatabaseType.SBT):
return db.signatures()
elif dbtype == DatabaseType.SIGLIST:
return db


class FileOutput(object):
"""A context manager for file outputs that handles sys.stdout gracefully.

Expand Down
Loading