Skip to content
/ ness Public

Heterogeneous graph aggregation and analysis using large-scale, multi-species, biological datasets.

Notifications You must be signed in to change notification settings

treynr/ness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Network Enhanced Similarity Search (NESS)

https://img.shields.io/circleci/build/github/treynr/ness/master?style=flat-square&token=3e277067ea5de25755905e093e40d0e70db4c3cf

NESS aggregates and harmonizes heterogeneous graph types including ontologies and their annotations, biological networks, and bipartite representations of experimental study results across species. The tool employs diffusion metrics, specifically a random walk with restart (RWR), to estimate the relations among entities in the graph and to make data-driven comparisons.

Usage

Usage: ness [OPTIONS] [OUTPUT]

  Network Enhanced Similarity Search (NESS).
  Integrate heterogeneous functional genomics datasets and calculate
  diffusion metrics over the heterogeneous network using a random
  walk with restart.

Options:
  -a, --annotations PATH      ontology annotation files
  -e, --edges PATH            edge list files
  -g, --genesets PATH         gene set files
  -h, --homology PATH         homology mapping files
  -o, --ontologies PATH       ontology relation files
  -s, --seeds TEXT            seed node
  --seed-file PATH            file containing list of seed nodes
  -m, --multiple              if using multiple seed nodes, start from all
                              seeds at once instead of one walk per seed
  -d, --distributed           parallelize the random walk by distributing
                              across available cores
  -c, --cores INTEGER         distribute computations among N cores (default =
                              all available cores)
  -n, --permutations INTEGER  permutations to run
  -r, --restart FLOAT         restart probability
  --verbose                   clutter your screen with output
  --version                   Show the version and exit.
  --help                      Show this message and exit.

For example, NESS can be used calculate all pairwise gene affinities for genes in biological network, with a restart probability of 0.15, and save the results to a file:

$ ness -e network.txt -r 0.15 results.tsv

Or, calculate the affinity between a single gene (e.g., MOBP) and all other genes in the network:

$ ness -e network.txt -s MOBP results.tsv

For large networks, the random walk can be distributed across cores (eight in this case):

$ ness -e network.txt -d -c 8 results.tsv

The significance of any results can be assessed via permutations of the graph:

$ ness -e network.txt -p 500 results.tsv

Inputs

NESS accepts five different input types for contstructing the heterogeneous graph:

edge lists
-e/--edges. Undirected edges from biological networks (e.g., BioGRID).
gene sets
-g/--genesets. Bipartite set-gene associations from gene set resources such as GeneWeaver.
homology mappings
-h/--homology. Bipartite associations between homology clusters and genes. Direct associations between homologous genes in separate species can also be used.
ontologies
-o/--ontologies. Directed acyclic graphs (DAG) used for knowledge representation in ontologies such as the Gene Ontology (GO).
ontology annotations
-a/--annotations. Bipartite term-annotation associations from resources such as the GO or Mammalian Phenotype Ontology.

All input formats must be tab delimited. Header rows are optional--NESS will attempt to detect if one is present or not. Rows that begin with '#' are treated as comments.

Edge lists

Edge list inputs contain four columns, the last two are optional. Each row lists a separate edge. The first two columns specify identifiers for the two nodes that comprise the edge. The last two (optional) columns describe the biotypes for the two nodes.

Example edge list input
node1 node2 biotype1 biotype2
ENSG00000168314 ENSP00000312293 gene protein
ENSG00000168314 ENSG00000184221 gene gene
ENSG00000184221 ENSG00000205927 gene gene

Gene sets

Gene set inputs contain four columns, the last two are optional. Each row lists a single gene contained in a specific set. The first two columns specify identifiers for the sets and genes respectively. The last two (optional) columns describe the biotypes for the sets and genes.

Example gene set input
geneset gene set_biotype gene_biotype
Upregulated in opioid dependence DRD2 geneset gene
Upregulated in opioid dependence OPRM1 geneset gene
GS1337 HDAC1 geneset gene
GS1337 HDAC2 geneset gene

Alternatively, gene sets can also be specified per row by separating genes using the pipe '|' character. The input above can be reformatted as:

Example gene set input
geneset gene set_biotype gene_biotype
Upregulated in opioid dependence DRD2|OPRM1 geneset gene
GS1337 HDAC1|HDAC2 geneset gene

Homology mappings

Homology inputs contain four columns, the last two are optional. Each row lists a cluster of homologous genes and a gene belonging to that cluster. The first two columns specify identifiers for the cluster and gene respectively. The last two (optional) columns describe the biotypes for the cluster and gene.

Example homology input
cluster gene cluster_biotype gene_biotype
1 ENSG00000149295 ortholog gene
1 ENSMUSG00000032259 ortholog gene
1 ENSRNOG00000008428 ortholog gene

Ontologies

Ontology inputs contain four columns, the last two are optional. Each row lists a single term-term edge present in a DAG. The first two columns specify identifiers for the ontology terms comprising the edge. The last two (optional) columns describe the biotypes for the terms.

Example ontology input
term1 term2 biotype1 biotype2
GO:0048149 GO:0045471 go_bp go_bp
GO:0045471 GO:0009636 go_bp go_bp
GO:0030534 GO:0009636 go_bp go_bp

Ontology annotations

Ontology annotation inputs contain four columns, the last two are optional. Each row lists a single term annotation. The first two columns specify identifiers for the ontology term and its annotation. The last two (optional) columns describe the biotypes for the term and annotation.

Example annotation input
term gene term_biotype gene_biotype
GO:0048149 DRD2 go_bp gene
GO:0048149 OPRM1 go_bp gene
GO:0048149 HDAC2 go_bp gene

Installation

The current release is v1.1.0. Install via pip:

$ pip install https://github.com/treynr/ness/releases/download/v1.1.0/ness-1.1.0.tar.gz

Or clone this repo and install using poetry:

$ git clone https://github.com/treynr/ness.git
$ cd ness
$ poetry install
$ poetry run ness

Testing

Run unit and functional tests:

$ PYTHONPATH=. pytest tests -v

..type checks:

$ mypy --show-error-codes --ignore-missing-imports --no-strict-optional ness

..and style checks:

$ flake8 ness

Requirements

See pyproject.toml for a complete list of required Python packages. The major requirements are:

Funding

Part of the GeneWeaver data repository and analysis platform. For a detailed description of GeneWeaver, see this article.

This work has been supported by joint funding from the NIAAA and NIDA, NIH [R01 AA18776]; and The Jackson Laboratory (JAX) Center for Precision Genetics of the NIH [U54 OD020351].

About

Heterogeneous graph aggregation and analysis using large-scale, multi-species, biological datasets.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages