diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml index 65f84f73..df9abd0d 100644 --- a/.github/workflows/checks.yml +++ b/.github/workflows/checks.yml @@ -30,3 +30,21 @@ jobs: - name: ruff uses: chartboost/ruff-action@v1 + docs: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: 3.11 + + - name: Install dependencies + run: | + python3 -m pip install --upgrade pip + python3 -m pip install '.[docs]' + + - name: Attempt docs build + working-directory: ./docs + run: make html diff --git a/.gitignore b/.gitignore index cb44c63d..b676a667 100644 --- a/.gitignore +++ b/.gitignore @@ -69,7 +69,7 @@ instance/ .scrapy # Sphinx documentation -docs/_build/ +docs/build/ # PyBuilder target/ @@ -130,9 +130,14 @@ dmypy.json Pipfile.lock +.DS_Store + # Data files cool_seq_tool/data/seqrepo/ cool_seq_tool/data/*.txt cool_seq_tool/data/LRG_RefSeqGene* cool_seq_tool/data/MANE* cool_seq_tool/data/notebooks/ + +# Autogenerated docs +docs/source/reference/api diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 00000000..24bdd851 --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,16 @@ +version: 2 + +build: + os: "ubuntu-20.04" + tools: + python: "3.11" + +python: + install: + - method: pip + path: . + extra_requirements: + - docs + +sphinx: + configuration: docs/source/conf.py diff --git a/LICENSE b/LICENSE index 3b9aae3f..a718fa73 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright (c) 2021 VICC +Copyright (c) 2021-2023 Wagner Lab Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index b5ddf524..1b49d0eb 100644 --- a/README.md +++ b/README.md @@ -1,153 +1,52 @@ -# **C**ommon **O**perations **O**n **L**ots-of **Seq**uences Tool +

+CoolSeqTool +

-The **cool-seq-tool** provides: +**[Documentation](#)** · [Installation](#) · [Usage](#) · [API reference](#) - - Transcript alignment data from the [UTA](https://github.com/biocommons/uta) database - - Fast access to sequence data using [SeqRepo](https://github.com/biocommons/biocommons.seqrepo) - - Liftover between assemblies (GRCh38 <--> GRCh37) from [PyLiftover](https://github.com/konstantint/pyliftover) - - Lifting over to preferred [MANE](https://www.ncbi.nlm.nih.gov/refseq/MANE/) compatible transcript. See [here](docs/TranscriptSelectionPriority.md) for more information. +## Overview -## Installation + +The **CoolSeqTool** provides: -### pip + - A Pythonic API on top of sequence data of interest to tertiary analysis tools, including mappings between gene names and transcripts, [MANE transcript](https://www.ncbi.nlm.nih.gov/refseq/MANE/) descriptions, and the [Universal Transcript Archive](https://github.com/biocommons/uta) + - Augmented access to the [SeqRepo](https://github.com/biocommons/biocommons.seqrepo) database, including multiple additional methods and tools + - Mapping tools that combine the above to support translation between references sequences, annotation layers, and MANE transcripts + -```commandline -pip install cool-seq-tool[dev,tests] -``` - -### Development - -Clone the repo: - -```commandline -git clone https://github.com/GenomicMedLab/cool-seq-tool -cd cool_seq_tool -``` - -[Install Pipenv](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today) if necessary. - -Install backend dependencies and enter Pipenv environment: - -```commandline -pipenv shell -pipenv update -pipenv install --dev -``` - -### UTA Database Installation - -`cool-seq-tool` uses intalls local UTA database. For other ways to install, visit [biocommons.uta](https://github.com/biocommons/uta). - -#### Local Installation - -_The following commands will likely need modification appropriate for the installation environment._ -1. Install [PostgreSQL](https://www.postgresql.org/) -2. Create user and database. - - ``` - $ createuser -U postgres uta_admin - $ createuser -U postgres anonymous - $ createdb -U postgres -O uta_admin uta - ``` - -3. To install locally, from the _cool_seq_tool/data_ directory: -``` -export UTA_VERSION=uta_20210129b.pgd.gz -curl -O https://dl.biocommons.org/uta/$UTA_VERSION -gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433 -``` - -##### UTA Installation Issues -If you have trouble installing UTA, you can visit [these two READMEs](https://github.com/ga4gh/vrs-python/tree/main/docs/setup_help). - -#### Connecting to the database - -To connect to the UTA database, you can use the default url (`postgresql://uta_admin:uta@localhost:5433/uta/uta_20210129b`). - -If you do not wish to use the default, you must set the environment variable `UTA_DB_URL` which has the format of `driver://user:password@host:port/database/schema`. - -### Data Downloads +--- -#### SeqRepo -`cool-seq-tool` relies on [seqrepo](https://github.com/biocommons/biocommons.seqrepo), which you must download yourself. +## Install -Use the `SEQREPO_ROOT_DIR` environment variable to set the path of an already existing SeqRepo directory. The default is `/usr/local/share/seqrepo/latest`. +CoolSeqTool is available on [PyPI](https://pypi.org/project/cool-seq-tool) -From the _root_ directory: +```shell +python3 -m pip install cool-seq-tool ``` -pip install seqrepo -sudo mkdir /usr/local/share/seqrepo -sudo chown $USER /usr/local/share/seqrepo -seqrepo pull -i 2021-01-29 # Replace with latest version using `seqrepo list-remote-instances` if outdated -``` - -If you get an error similar to the one below: -``` -PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29' -``` - -You will want to do the following:\ -(*Might not be ._fkuefgd, so replace with your error message path*) -```console -sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29 -exit -``` - -#### LRG_RefSeqGene - -`cool-seq-tool` fetches the latest version of `LRG_RefSeqGene` if the environment variable `LRG_REFSEQGENE_PATH` is not set. When `LRG_REFSEQGENE_PATH` is set, `cool-seq-tool` will look at this path and expect the LRG_RefSeqGene file. This file is found can be found [here](https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene). - -#### MANE Summary Data - -`cool-seq-tool` fetches the latest version of `MANE.GRCh38.*.summary.txt.gz` if the environment variable `MANE_SUMMARY_PATH` is not set. When `MANE_SUMMARY_PATH` is set, `cool-seq-tool` will look at this path and expect the MANE Summary Data file. This file is found can be found [here](https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/). - -#### transcript_mapping.tsv -`cool-seq-tool` is packaged with transcript mapping data acquired from [Ensembl BioMart](http://www.ensembl.org/biomart/martview). If the environment variable `TRANSCRIPT_MAPPINGS_PATH` is not set, `cool-seq-tool` will use the built-in file. When `TRANSCRIPT_MAPPINGS_PATH` is set, `cool_seq_tool` will look at this path and expect to find the transcript mapping TSV file. -To acquire this data manually from the [BioMart](https://www.ensembl.org/biomart/martview), select the `Human Genes (GRCh38.p13)` dataset and choose the following attributes: +See the [installation instructions](#) in the documentation for a description of dependency setup requirements. -* Gene stable ID -* Gene stable ID version -* Transcript stable ID -* Transcript stable ID version -* Protein stable ID -* Protein stable ID version -* RefSeq match transcript (MANE Select) -* Gene name +--- -![image](biomart.png) +## Usage -## Starting the UTA Tools Service Locally +All CoolSeqTool resources can be initialized by way of a top-level class instance: -To start the service, run the following: - -```commandline -uvicorn cool_seq_tool.api:app --reload +```pycon +>>> from cool_seq_tool.app import CoolSeqTool +>>> cst = CoolSeqTool() +>>> result = await cst.mane_transcript.get_mane_transcript( +... "NP_004324.2", +... 599, +... AnnotationLayer.PROTEIN, +... residue_mode=ResidueMode.INTER_RESIDUE, +... ) +>>> result.gene, result.refseq, result.status +('EGFR', 'NM_005228.5', ) ``` -Next, view the FastAPI on your local machine: http://127.0.0.1:8000/cool_seq_tool - -## Init coding style tests - -Code style is managed by [Ruff](https://github.com/astral-sh/ruff) and [Black](https://github.com/psf/black), and should be checked prior to commit. - -We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests. - -This ensures: - -* Check code style -* Check for added large files -* Detect AWS Credentials -* Detect Private Key +--- -Before first commit run: +## Feedback and contributing -``` -pre-commit install -``` - -## Testing -From the _root_ directory of the repository: -``` -pytest -``` +We welcome bug reports, feature requests, and code contributions from users and interested collaborators. The [documentation](#) contains guidance for submitting feedback and contributing new code. diff --git a/cool_seq_tool/api.py b/cool_seq_tool/api.py index 89c7efdc..b36604e7 100644 --- a/cool_seq_tool/api.py +++ b/cool_seq_tool/api.py @@ -24,9 +24,9 @@ def custom_openapi() -> Dict: if app.openapi_schema: return app.openapi_schema openapi_schema = get_openapi( - title="The GenomicMedLab Cool Seq Tool", + title="The GenomicMedLab Cool-Seq-Tool", version=__version__, - description="Common Operations On Lots-of Sequences Tool.", + description="Common Operations On Lots of Sequences Tool.", routes=app.routes, ) diff --git a/cool_seq_tool/app.py b/cool_seq_tool/app.py index 23c38584..f039b0e6 100644 --- a/cool_seq_tool/app.py +++ b/cool_seq_tool/app.py @@ -1,4 +1,6 @@ -"""Module for initializing data sources.""" +"""Provides core CoolSeqTool class, which non-redundantly initializes all Cool-Seq-Tool +data handler and mapping resources for straightforward access. +""" import logging from pathlib import Path from typing import Optional @@ -25,7 +27,26 @@ class CoolSeqTool: - """Class to initialize data sources.""" + """Non-redundantly initialize all Cool-Seq-Tool data resources, available under the + following attribute names: + + * ``self.seqrepo_access``: :py:class:`SeqRepoAccess ` + * ``self.transcript_mappings``: :py:class:`TranscriptMappings ` + * ``self.mane_transcript_mappings``: :py:class:`MANETranscriptMappings ` + * ``self.uta_db``: :py:class:`UTADatabase ` + * ``self.alignment_mapper``: :py:class:`AlignmentMapper ` + * ``self.mane_transcript``: :py:class:`MANETranscript ` + * ``self.ex_g_coords_mapper``: :py:class:`ExonGenomicCoordsMapper ` + + Initialization with default resource locations is straightforward: + + .. code-block:: pycon + + >>> from cool_seq_tool.app import CoolSeqTool + >>> cst = CoolSeqTool() + + See the :ref:`configuration ` section for more information. + """ def __init__( self, @@ -37,11 +58,11 @@ def __init__( ) -> None: """Initialize CoolSeqTool class - :param transcript_file_path: The path to transcript_mapping.tsv - :param lrg_refseqgene_path: The path to LRG_RefSeqGene + :param transcript_file_path: The path to ``transcript_mapping.tsv`` + :param lrg_refseqgene_path: The path to the LRG_RefSeqGene file :param mane_data_path: Path to RefSeq MANE summary data :param db_url: PostgreSQL connection URL - Format: `driver://user:password@host/database/schema` + Format: ``driver://user:password@host/database/schema`` :param sr: SeqRepo instance. If this is not provided, will create a new instance """ if not sr: diff --git a/cool_seq_tool/data/data_downloads.py b/cool_seq_tool/data/data_downloads.py index 373636ee..dd42e74e 100644 --- a/cool_seq_tool/data/data_downloads.py +++ b/cool_seq_tool/data/data_downloads.py @@ -1,4 +1,4 @@ -"""Module for handling downloadable data files.""" +"""Handle acquisition of external data.""" import datetime import gzip import logging @@ -15,8 +15,11 @@ class DataDownload: - """Class for managing downloadable data files. Responsible for checking if files - are available under default locations, and fetching them if not. + """Manage downloadable data files. Responsible for checking if files are available + under expected locations, and fetching them if not. + + Relevant methods are called automatically by data classes; users should not have + to interact with this class under normal circumstances. """ def __init__(self) -> None: @@ -25,7 +28,7 @@ def __init__(self) -> None: def get_mane_summary(self) -> Path: """Identify latest MANE summary data. If unavailable locally, download from - source. + `NCBI FTP server `_. :return: path to MANE summary file """ @@ -52,7 +55,7 @@ def get_mane_summary(self) -> Path: def get_lrg_refseq_gene_data(self) -> Path: """Identify latest LRG RefSeq Gene file. If unavailable locally, download from - source. + `NCBI FTP server `_. :return: path to acquired LRG RefSeq Gene data file """ diff --git a/cool_seq_tool/handlers/seqrepo_access.py b/cool_seq_tool/handlers/seqrepo_access.py index d226739e..8f34e3a3 100644 --- a/cool_seq_tool/handlers/seqrepo_access.py +++ b/cool_seq_tool/handlers/seqrepo_access.py @@ -1,4 +1,6 @@ -"""A module for accessing SeqRepo.""" +"""Wrap SeqRepo to provide additional lookup and identification methods on top of basic +dereferencing functions. +""" import logging from os import environ from pathlib import Path @@ -13,7 +15,9 @@ class SeqRepoAccess(SeqRepoDataProxy): - """The SeqRepoAccess class.""" + """Provide a wrapper around the base SeqRepoDataProxy class from ``VRS-Python`` to + provide additional lookup and identification methods. + """ environ["SEQREPO_LRU_CACHE_MAXSIZE"] = "none" @@ -24,14 +28,22 @@ def get_reference_sequence( end: Optional[int] = None, residue_mode: ResidueMode = ResidueMode.RESIDUE, ) -> Tuple[str, Optional[str]]: - """Get reference sequence for an accession given a start and end position. - If `start` and `end` are not given, it will return the entire reference sequence + """Get reference sequence for an accession given a start and end position. If + ``start`` and ``end`` are not given, returns the entire reference sequence. + + >>> from cool_seq_tool.handlers import SeqRepoAccess + >>> from biocommons.seqrepo import SeqRepo + >>> sr = SeqRepoAccess(SeqRepo("/usr/local/share/seqrepo/latest")) + >>> sr.get_reference_sequence("NM_002529.3", 1, 10)[0] + 'TGCAGCTGG' + >>> sr.get_reference_sequence("NP_001341538.1", 1, 10)[0] + 'MAALSGGGG' :param ac: Accession :param start: Start pos change - :param end: End pos change. If `None` assumes both `start` and `end` have same - values, if `start` exists. - :param residue_mode: Residue mode for `start` and `end` + :param end: End pos change. If ``None`` assumes both ``start`` and ``end`` have + same values, if ``start`` exists. + :param residue_mode: Residue mode for ``start`` and ``end`` :return: Sequence at position (if accession and positions actually exist, else return empty string), warning if any """ @@ -88,6 +100,21 @@ def translate_identifier( ) -> Tuple[List[str], Optional[str]]: """Return list of identifiers for accession. + >>> from cool_seq_tool.handlers import SeqRepoAccess + >>> from biocommons.seqrepo import SeqRepo + >>> sr = SeqRepoAccess(SeqRepo("/usr/local/share/seqrepo/latest")) + >>> sr.translate_identifier("NM_002529.3")[0] + ['MD5:18f0a6e3af9e1bbd8fef1948c7156012', + 'NCBI:NM_002529.3', + 'refseq:NM_002529.3', + 'SEGUID:dEJQBkga9d9VeBHTyTbg6JEtTGQ', + 'SHA1:74425006481af5df557811d3c936e0e8912d4c64', + 'VMC:GS_RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA', + 'sha512t24u:RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA', + 'ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA'] + >>> sr.translate_identifier("NM_002529.3", "ga4gh")[0] + ['ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA'] + :param ac: Identifier accession :param target_namespace: The namespace(s) of identifier to return :return: List of identifiers, warning @@ -123,7 +150,7 @@ def chromosome_to_acs( ) -> Tuple[Optional[List[str]], Optional[str]]: """Get accessions for a chromosome - :param str chromosome: Chromosome number. Must be either 1-22, X, or Y + :param chromosome: Chromosome number. Must be either 1-22, X, or Y :return: Accessions for chromosome (ordered by latest assembly) """ acs = [] @@ -160,9 +187,20 @@ def ac_to_chromosome(self, ac: str) -> Tuple[Optional[str], Optional[str]]: def get_fasta_file(self, sequence_id: str, outfile_path: Path) -> None: """Retrieve FASTA file containing sequence for requested sequence ID. - :param sequence_id: accession ID, sans namespace, eg `NM_152263.3` + + >>> from pathlib import Path + >>> from cool_seq_tool.handlers import SeqRepoAccess + >>> from biocommons.seqrepo import SeqRepo + >>> sr = SeqRepoAccess(SeqRepo("/usr/local/share/seqrepo/latest")) + >>> # write to local file tpm3.fasta: + >>> sr.get_fasta_file("NM_002529.3", Path("tpm3.fasta")) + + FASTA file headers will include GA4GH sequence digest, Ensembl accession ID, + and RefSeq accession ID. + + :param sequence_id: accession ID, sans namespace, eg ``NM_152263.3`` :param outfile_path: path to save file to - :return: None, but saves sequence data to `outfile_path` if successful + :return: None, but saves sequence data to ``outfile_path`` if successful :raise: KeyError if SeqRepo doesn't have sequence data for the given ID """ sequence = self.get_reference_sequence(sequence_id)[0] diff --git a/cool_seq_tool/mappers/__init__.py b/cool_seq_tool/mappers/__init__.py index ecad925b..9fbb3306 100644 --- a/cool_seq_tool/mappers/__init__.py +++ b/cool_seq_tool/mappers/__init__.py @@ -2,3 +2,6 @@ from .alignment import AlignmentMapper # noqa: I001 from .mane_transcript import MANETranscript from .exon_genomic_coords import ExonGenomicCoordsMapper + + +__all__ = ["AlignmentMapper", "MANETranscript", "ExonGenomicCoordsMapper"] diff --git a/cool_seq_tool/mappers/alignment.py b/cool_seq_tool/mappers/alignment.py index 5a94e711..627cddad 100644 --- a/cool_seq_tool/mappers/alignment.py +++ b/cool_seq_tool/mappers/alignment.py @@ -19,11 +19,10 @@ def __init__( ) -> None: """Initialize the AlignmentMapper class. - :param SeqRepoAccess seqrepo_access: Access to seqrepo queries - :param TranscriptMappings transcript_mappings: Access to transcript - accession mappings and conversions - :param UTADatabase uta_db: UTADatabase instance to give access to query - UTA database + :param seqrepo_access: Access to seqrepo queries + :param transcript_mappings: Access to transcript accession mappings and + conversions + :param uta_db: UTADatabase instance to give access to query UTA database """ self.seqrepo_access = seqrepo_access self.transcript_mappings = transcript_mappings @@ -38,15 +37,16 @@ async def p_to_c( ) -> Tuple[Optional[Dict], Optional[str]]: """Translate protein representation to cDNA representation. - :param str p_ac: Protein RefSeq accession - :param int p_start_pos: Protein start position - :param int p_end_pos: Protein end position - :param ResidueMode residue_mode: Residue mode for `p_start_pos` and `p_end_pos` + :param p_ac: Protein RefSeq accession + :param p_start_pos: Protein start position + :param p_end_pos: Protein end position + :param residue_mode: Residue mode for ``p_start_pos`` and ``p_end_pos`` :return: Tuple containing: - - cDNA representation (accession, codon range positions for corresponding - change, cds start site) if able to translate. Will return positions as - inter-residue coordinates. If unable to translate, returns `None`. - - Warning, if unable to translate to cDNA representation. Else `None` + + * cDNA representation (accession, codon range positions for corresponding + change, cds start site) if able to translate. Will return positions as + inter-residue coordinates. If unable to translate, returns ``None``. + * Warning, if unable to translate to cDNA representation. Else ``None`` """ # Get cDNA accession temp_c_ac = await self.uta_db.p_to_c_ac(p_ac) @@ -86,10 +86,10 @@ async def p_to_c( async def _get_cds_start(self, c_ac: str) -> Tuple[Optional[int], Optional[str]]: """Get CDS start for a given cDNA RefSeq accession - :param str c_ac: cDNA RefSeq accession + :param c_ac: cDNA RefSeq accession :return: Tuple containing: - - CDS start site if found. Else `None` - - Warning, if unable to get CDS start. Else `None` + - CDS start site if found. Else ``None`` + - Warning, if unable to get CDS start. Else ``None`` """ cds_start_end = await self.uta_db.get_cds_start_end(c_ac) if not cds_start_end: @@ -111,16 +111,17 @@ async def c_to_g( ) -> Tuple[Optional[Dict], Optional[str]]: """Translate cDNA representation to genomic representation - :param str c_ac: cDNA RefSeq accession - :param int c_start_pos: cDNA start position for codon - :param int c_end_pos: cDNA end position for codon - :param Optional[int] coding_start_site: Coding start site. If not provided, - this will be computed. - :param Assembly target_genome_assembly: Genome assembly to get genomic data for + :param c_ac: cDNA RefSeq accession + :param c_start_pos: cDNA start position for codon + :param c_end_pos: cDNA end position for codon + :param coding_start_site: Coding start site. If not provided, this will be + computed. + :param target_genome_assembly: Genome assembly to get genomic data for :return: Tuple containing: - - Genomic representation (ac, positions) if able to translate. Will return - positions as inter-residue coordinates. Else `None`. - - Warning, if unable to translate to genomic representation. Else `None` + + * Genomic representation (ac, positions) if able to translate. Will return + positions as inter-residue coordinates. Else ``None``. + * Warning, if unable to translate to genomic representation. Else ``None`` """ if any( ( @@ -212,17 +213,19 @@ async def p_to_g( residue_mode: ResidueMode = ResidueMode.INTER_RESIDUE, target_genome_assembly: Assembly = Assembly.GRCH38, ) -> Tuple[Optional[Dict], Optional[str]]: - """Translate protein representation to genomic representation - - :param str p_ac: Protein RefSeq accession - :param int p_start_pos: Protein start position - :param int p_end_pos: Protein end position - :param ResidueMode residue_mode: Residue mode for `p_start_pos` and `p_end_pos`. - :param Assembly target_genome_assembly: Genome assembly to get genomic data for + """Translate protein representation to genomic representation, by way of + intermediary conversion into cDNA coordinates. + + :param p_ac: Protein RefSeq accession + :param p_start_pos: Protein start position + :param p_end_pos: Protein end position + :param residue_mode: Residue mode for ``p_start_pos`` and ``p_end_pos``. + :param target_genome_assembly: Genome assembly to get genomic data for :return: Tuple containing: - - Genomic representation (ac, positions) if able to translate. Will return - positions as inter-residue coordinates. Else `None`. - and warnings. The genomic data will always return inter-residue coordinates + + * Genomic representation (ac, positions) if able to translate. Will return + positions as inter-residue coordinates. Else ``None``. + * Warnings, if conversion to cDNA or genomic coordinates fails. """ c_data, warning = await self.p_to_c( p_ac, p_start_pos, p_end_pos, residue_mode=residue_mode diff --git a/cool_seq_tool/mappers/exon_genomic_coords.py b/cool_seq_tool/mappers/exon_genomic_coords.py index de8bcd3b..468d6c9a 100644 --- a/cool_seq_tool/mappers/exon_genomic_coords.py +++ b/cool_seq_tool/mappers/exon_genomic_coords.py @@ -1,4 +1,4 @@ -"""Module for mapping transcript exon to and from genomic coordinates""" +"""Provide mapping capabilities between transcript exon and genomic coordinates.""" import logging from typing import Dict, List, Optional, Tuple, TypeVar, Union @@ -24,12 +24,28 @@ class ExonGenomicCoordsMapper: - """Class for mapping transcript exon representation to/from genomic coordinate - representation + """Provide capabilties for mapping transcript exon representation to/from genomic + coordinate representation. """ def __init__(self, uta_db: UTADatabase, mane_transcript: MANETranscript) -> None: - """Initialize ExonGenomicCoordsMapper class + """Initialize ExonGenomicCoordsMapper class. + + A lot of resources are required for initialization, so when defaults are enough, + it's easiest to let the core CoolSeqTool class handle it for you: + + >>> from cool_seq_tool.app import CoolSeqTool + >>> egc = CoolSeqTool().ex_g_coords_mapper + + Note that this class's methods are all defined as ``async``, so they will + need to be called with ``await``. + + >>> result = egc.transcript_to_genomic_coordinates( + ... transcript="NM_002529.3", + ... exon_end=17 + ... ) + >>> result.genomic_data + AttributeError: 'coroutine' object has no attribute 'genomic_data' :param uta_db: UTADatabase instance to give access to query UTA database :param mane_transcript: Instance to align to MANE or compatible representation @@ -44,8 +60,8 @@ def _return_warnings( """Add warnings to response object :param resp: Response object - :param warning_msg: Warning message on why `transcript_exon_data` or - `genomic_data` field is None + :param warning_msg: Warning message on why ``transcript_exon_data`` or + ``genomic_data`` field is ``None`` :return: Response object with warning message """ logger.warning(warning_msg) @@ -63,7 +79,25 @@ async def transcript_to_genomic_coordinates( **kwargs, ) -> GenomicDataResponse: """Get genomic data given transcript data. - Will use GRCh38 coordinates if possible + + By default, inputs are assumed to be in GRCh38 where they aren't otherwise + precisely defined. + + >>> from cool_seq_tool.app import CoolSeqTool + >>> egc = CoolSeqTool().ex_g_coords_mapper + >>> tpm3 = await egc.transcript_to_genomic_coordinates( + ... gene="TPM3", chr="NC_000001.11", + ... exon_start=1, exon_end=8, + ... transcript="NM_152263.3" + ... ) + >>> (tpm3.genomic_data.chr, tpm3.genomic_data.start, tpm3.genomic_data.end) + ('NC_000001.11', 164192135, 154170399) + >>> ntrk1 = await egc.transcript_to_genomic_coordinates( + ... transcript="NM_002529.3", + ... exon_end=17 + ... ) + >>> (ntrk1.genomic_data.chr, ntrk1.genomic_data.end) + ('NC_000001.11', 156881456) :param gene: Gene symbol :param transcript: Transcript accession @@ -168,22 +202,35 @@ async def genomic_to_transcript_exon_coordinates( residue_mode: ResidueMode = ResidueMode.RESIDUE, **kwargs, ) -> GenomicDataResponse: - """Get transcript data for genomic data. - MANE Transcript data will be returned iff `transcript` is not supplied. - `gene` must be supplied in order to retrieve MANE Transcript data. - Liftovers genomic coordinates to GRCh38 - - :param chromosome: Chromosome. Must either give chromosome number (i.e. `1`) or - accession (i.e. `NC_000001.11`). + """Get transcript data for genomic data, lifted over to GRCh38. + + MANE Transcript data will be returned if and only if ``transcript`` is not + supplied. ``gene`` must be given in order to retrieve MANE Transcript data. + + >>> from cool_seq_tool.app import CoolSeqTool + >>> egc = CoolSeqTool().ex_g_coords_mapper + >>> result = await egc.genomic_to_transcript_exon_coordinates( + ... chromosome="NC_000001.11", + ... start=154192136, + ... end=154170400, + ... strand=-1, + ... transcript="NM_152263.3" + ... ) + >>> (result.genomic_data.exon_start, result.genomic_data.exon_end) + (1, 8) + + :param chromosome: Chromosome. Must either give chromosome number (i.e. ``1``) + or accession (i.e. ``NC_000001.11``). :param start: Start genomic position :param end: End genomic position - :param strand: Strand. Must be either `-1` or `1`. + :param strand: Strand. Must be either ``-1`` or ``1``. :param transcript: The transcript to use. If this is not given, we will try the following transcripts: MANE Select, MANE Clinical Plus, Longest Remaining - Compatible Transcript + Compatible Transcript. See the :ref:`Transcript Selection policy ` + page. :param gene: Gene symbol - :param residue_mode: Default is `resiude` (1-based). Must be either `residue` or - `inter-residue` (0-based). + :param residue_mode: Default is ``residue`` (1-based). Must be either + ``residue`` or ``inter-residue`` (0-based). :return: Genomic data (inter-residue coordinates) """ resp = GenomicDataResponse( @@ -266,26 +313,26 @@ async def _genomic_to_transcript_exon_coordinate( self, chromosome: Union[str, int], pos: int, - strand: int = None, - transcript: str = None, - gene: str = None, + strand: Optional[int] = None, + transcript: Optional[str] = None, + gene: Optional[str] = None, is_start: bool = True, residue_mode: ResidueMode = ResidueMode.RESIDUE, ) -> TranscriptExonDataResponse: """Convert individual genomic data to transcript data - :param chromosome: Chromosome. Must either give chromosome number (i.e. `1`) or - accession (i.e. `NC_000001.11`). + :param chromosome: Chromosome. Must either give chromosome number (i.e. ``1``) + or accession (i.e. ``NC_000001.11``). :param pos: Genomic position - :param strand: Strand. Must be either `-1` or `1`. + :param strand: Strand. Must be either ``-1`` or ``1``. :param transcript: The transcript to use. If this is not given, we will try the following transcripts: MANE Select, MANE Clinical Plus, Longest Remaining Compatible Transcript :param gene: Gene symbol - :param is_start: `True` if `pos` is start position. `False` if `pos` is end - position. - :param residue_mode: Default is `resiude` (1-based). Must be either `residue` - or `inter-residue` (0-based). + :param is_start: ``True`` if ``pos`` is start position. ``False`` if ``pos`` is + end position. + :param residue_mode: Default is ``residue`` (1-based). Must be either ``residue`` + or ``inter-residue`` (0-based). :return: Transcript data (inter-residue coordinates) """ resp = TranscriptExonDataResponse( @@ -406,9 +453,9 @@ async def _set_mane_genomic_data( :param alt_ac: Genomic accession :param pos: Genomic position :param strand: Strand - :param is_start: `True` if `pos` is start position. `False` if `pos` is end - position. - :param residue_mode: Residue mode for `pos` + :param is_start: ``True`` if ``pos`` is start position. ``False`` if ``pos`` is + end position. + :param residue_mode: Residue mode for ``pos`` :return: Warnings if found """ mane_data: Optional[ @@ -488,12 +535,12 @@ async def _set_mane_genomic_data( async def _set_genomic_data( self, params: Dict, strand: int, is_start: bool ) -> Optional[str]: - """Set genomic data in `params` + """Set genomic data in ``params`` :param params: Parameters for response :param strand: Strand - :param is_start: `True` if `pos` is start position. `False` if `pos` is end - position. + :param is_start: ``True`` if ``pos`` is start position. ``False`` if ``pos`` is + end position. :return: Warnings if found """ # We should always try to liftover @@ -568,14 +615,14 @@ async def _set_genomic_data( def _set_exon_offset( params: Dict, start: int, end: int, pos: int, is_start: bool, strand: int ) -> None: - """Set `exon_offset` in params. + """Set ``exon_offset`` in params. :param params: Parameters for response :param start: Start exon coord (can be transcript or genomic) :param end: End exon coord (can be transcript or genomic) :param pos: Position change (can be transcript or genomic) - :param is_start: `True` if `pos` is start position. `False` if `pos` is end - position + :param is_start: ``True`` if ``pos`` is start position. ``False`` if ``pos`` is + end position :param int strand: Strand """ if is_start: diff --git a/cool_seq_tool/mappers/mane_transcript.py b/cool_seq_tool/mappers/mane_transcript.py index ab5fd44a..5c8fda09 100644 --- a/cool_seq_tool/mappers/mane_transcript.py +++ b/cool_seq_tool/mappers/mane_transcript.py @@ -1,11 +1,15 @@ -"""Module for retrieving MANE Transcript from variation on p/c/g coordinate. +"""Retrieve MANE transcript from a location on p./c./g. coordinates. + Steps: -1. Map annotation layer to genome -2. Liftover to preferred genome - We want to liftover to GRCh38. We do not support getting MANE transcripts - for GRCh36 and earlier assemblies. -3. Select preferred compatible annotation -4. Map back to correct annotation layer + +#. Map annotation layer to genome +#. Liftover to preferred genome (GRCh38). GRCh36 and earlier assemblies are not supported + for fetching MANE transcripts. +#. Select preferred compatible annotation (see :ref:`transcript compatibility `) +#. Map back to correct annotation layer + +In addition to a mapper utility class, this module also defines several vocabulary +constraints and data models for coordinate representation. """ import logging import math @@ -44,7 +48,7 @@ class EndAnnotationLayer(StrEnum): class DataRepresentation(BaseModel): - """Create model for final output representation""" + """Define object model for final output representation""" gene: Optional[str] = None refseq: str @@ -55,7 +59,7 @@ class DataRepresentation(BaseModel): class CdnaRepresentation(DataRepresentation): - """Create model for coding dna representation""" + """Define object model for coding DNA representation""" coding_start_site: int coding_end_site: int @@ -63,7 +67,7 @@ class CdnaRepresentation(DataRepresentation): class GenomicRepresentation(BaseModel): - """Create model for genomic representation""" + """Define object model for genomic representation""" refseq: str pos: Tuple[int, int] @@ -72,7 +76,7 @@ class GenomicRepresentation(BaseModel): class ProteinAndCdnaRepresentation(BaseModel): - """Create model for protein and coding dna representation""" + """Define object model for protein and cDNA representation""" protein: DataRepresentation cdna: CdnaRepresentation @@ -90,6 +94,19 @@ def __init__( ) -> None: """Initialize the MANETranscript class. + A handful of resources are required for initialization, so when defaults are + enough, it's easiest to let the core CoolSeqTool class handle it for you: + + >>> from cool_seq_tool.app import CoolSeqTool + >>> mane_mapper = CoolSeqTool().mane_transcript + + Note that most methods are defined as Python coroutines, so they must be called + with ``await``: + + >>> result = mane_mapper.g_to_grch38("NC_000001.11", 100, 200) + >>> result['ac'] + TypeError: 'coroutine' object is not subscriptable + :param seqrepo_access: Access to seqrepo queries :param transcript_mappings: Access to transcript accession mappings and conversions @@ -104,10 +121,9 @@ def __init__( @staticmethod def _get_reading_frame(pos: int) -> int: - """Return reading frame number. - Only used on c. coordinate + """Return reading frame number. Only used on c. coordinate. - :param int pos: cDNA position + :param pos: cDNA position :return: Reading frame """ pos_mod_3 = pos % 3 @@ -119,8 +135,8 @@ def _get_reading_frame(pos: int) -> int: def _p_to_c_pos(start: int, end: int) -> Tuple[int, int]: """Return cDNA position given a protein position. - :param int start: Start protein position - :param int end: End protein position + :param start: Start protein position + :param end: End protein position :return: cDNA position start, cDNA position end """ start_pos = start * 3 - 1 @@ -136,9 +152,9 @@ async def _p_to_c( ) -> Optional[Tuple[str, Tuple[int, int]]]: """Convert protein (p.) annotation to cDNA (c.) annotation. - :param str ac: Protein accession - :param int start_pos: Protein start position - :param int end_pos: Protein end position + :param ac: Protein accession + :param start_pos: Protein start position + :param end_pos: Protein end position :return: [cDNA transcript accession, [cDNA pos start, cDNA pos end]] """ # TODO: Check version mappings 1 to 1 relationship @@ -164,8 +180,8 @@ async def _p_to_c( async def _c_to_g(self, ac: str, pos: Tuple[int, int]) -> Optional[Dict]: """Get g. annotation from c. annotation. - :param str ac: cDNA accession - :param Tuple[int, int] pos: [cDNA pos start, cDNA pos end] + :param ac: cDNA accession + :param pos: [cDNA pos start, cDNA pos end] :return: Gene, Transcript accession and position change, Altered transcript accession and position change, Strand """ @@ -208,12 +224,11 @@ async def _get_and_validate_genomic_tx_data( ) -> Optional[Dict]: """Get and validate genomic_tx_data - :param str tx_ac: Accession on c. coordinate - :param Tuple[int, int] pos: (start pos, end pos) - :param Union[AnnotationLayer.CDNA, AnnotationLayer.GENOMIC] annotation_layer: - Annotation layer for `ac` and `pos` - :param Optional[int] coding_start_site: Coding start site - :param Optional[str] alt_ac: Accession on g. coordinate + :param tx_ac: Accession on c. coordinate + :param pos: (start pos, end pos) + :param annotation_layer: Annotation layer for ``ac`` and ``pos`` + :param coding_start_site: Coding start site + :param alt_ac: Accession on g. coordinate :return: genomic_tx_data if found and validated, else None """ genomic_tx_data = await self.uta_db.get_genomic_tx_data( @@ -258,15 +273,15 @@ def _get_c_data( ) -> CdnaRepresentation: """Return transcript data on c. coordinate. + :param gene: Gene symbol :param cds_start_end: Coding start and end site for transcript :param c_pos_change: Start and end positions for change on c. coordinate :param strand: Strand :param status: Status of transcript :param refseq_c_ac: Refseq transcript - :param gene: HGNC gene symbol :param ensembl_c_ac: Ensembl transcript :param alt_ac: Genomic accession - :return: Coding dna representation + :return: Transcript data on c. coord """ cds_start = cds_start_end[0] cds_end = cds_start_end[1] @@ -307,10 +322,9 @@ def _get_mane_p( ) -> DataRepresentation: """Translate MANE Transcript c. annotation to p. annotation - :param Dict mane_data: MANE Transcript data - :param Tuple[int, int] mane_c_pos_range: Position change range - on MANE Transcript c. coordinate - :return: Protein representation + :param mane_data: MANE Transcript data + :param mane_c_pos_range: Position change range on MANE Transcript c. coordinate + :return: MANE transcripts accessions and position change on p. coordinate """ return DataRepresentation( gene=mane_data["symbol"], @@ -334,14 +348,14 @@ async def _g_to_c( ) -> Optional[CdnaRepresentation]: """Get transcript c. annotation data from g. annotation. - :param Dict g: Genomic data - :param str refseq_c_ac: Refseq transcript accession - :param TranscriptPriority status: Status of transcript - :param Optional[str] ensembl_c_ac: Ensembl transcript accession - :param Optional[str] alt_ac: Genomic accession - :param bool found_result: `True` if found result, so do not need to query + :param g: Genomic data + :param refseq_c_ac: Refseq transcript accession + :param status: Status of transcript + :param ensembl_c_ac: Ensembl transcript accession + :param alt_ac: Genomic accession + :param found_result: ``True`` if found result, so do not need to query tx_exon_aln_v table. This is because the user did not need to liftover. - `False` if need to get result from tx_exon_aln_v table. + ``False`` if need to get result from tx_exon_aln_v table. :return: Transcript data """ if found_result: @@ -406,8 +420,8 @@ def _validate_reading_frames( :param end_pos: Original end cDNA position change :param transcript_data: Ensembl and RefSeq transcripts with corresponding position change - :return: `True` if reading frames are the same after translation. - `False` otherwise + :return: ``True`` if reading frames are the same after translation. + ``False`` otherwise """ for pos, pos_index in [(start_pos, 0), (end_pos, 1)]: if pos is not None: @@ -450,8 +464,8 @@ def _validate_references( position change :param expected_ref: Reference at position given during input :param anno: Annotation layer we are starting from - :param residue_mode: Residue mode for `start_pos` and `end_pos` - :return: `True` if reference check passes. `False` otherwise. + :param residue_mode: Residue mode for ``start_pos`` and ``end_pos`` + :return: ``True`` if reference check passes. ``False`` otherwise. """ if anno == AnnotationLayer.CDNA: start_pos += coding_start_site @@ -498,10 +512,10 @@ def _validate_index( ) -> bool: """Validate that positions actually exist on accession - :param str ac: Accession - :param Tuple[int, int] pos: Start position change, End position change - :param int coding_start_site: coding start site for accession - :return: `True` if positions exist on accession. `False` otherwise + :param ac: Accession + :param pos: Start position change, End position change + :param coding_start_site: coding start site for accession + :return: ``True`` if positions exist on accession. ``False`` otherwise """ start_pos = pos[0] + coding_start_site end_pos = pos[1] + coding_start_site @@ -574,23 +588,48 @@ async def get_longest_compatible_transcript( ) -> Optional[ Union[DataRepresentation, CdnaRepresentation, ProteinAndCdnaRepresentation] ]: - """Get longest compatible transcript from a gene. - Transcript is compatible if it passes validation checks. + """Get longest compatible transcript from a gene. See the documentation for + the :ref:`transcript compatibility policy ` for more + information. + + >>> from cool_seq_tool.app import CoolSeqTool + >>> from cool_seq_tool.schemas import AnnotationLayer, ResidueMode + >>> mane_mapper = CoolSeqTool().mane_transcript + >>> mane_transcripts = { + ... "ENST00000646891.2", + ... "NM_001374258.1", + ... "NM_004333.6", + ... "ENST00000644969.2", + ... } + >>> result = await mane_mapper.get_longest_compatible_transcript( + ... 599, + ... 599, + ... gene="BRAF", + ... start_annotation_layer=AnnotationLayer.PROTEIN, + ... residue_mode=ResidueMode.INTER_RESIDUE, + ... mane_transcripts=mane_transcripts, + ... ) + >>> result.refseq + 'NP_001365396.1' + + If unable to find a match on GRCh38, this method will then attempt to drop down + to GRCh37. + + # TODO example for inputs that demonstrate this? :param start_pos: Start position change :param end_pos: End position change :param start_annotation_layer: Starting annotation layer :param gene: HGNC gene symbol :param ref: Reference at position given during input - :param residue_mode: Residue mode for `start_pos` and `end_pos` + :param residue_mode: Residue mode for ``start_pos`` and ``end_pos`` :param mane_transcripts: Attempted mane transcripts that were not compatible :param alt_ac: Genomic accession :param end_annotation_layer: The end annotation layer. If not provided, will be - set to the following - `EndAnnotationLayer.PROTEIN` if - `start_annotation_layer == AnnotationLayer.PROTEIN` - `EndAnnotationLayer.CDNA` otherwise - :return: Data for longest compatible transcript if successful. Else, None + set to ``EndAnnotationLayer.PROTEIN`` if + ``start_annotation_layer == AnnotationLayer.PROTEIN``, + ``EndAnnotationLayer.CDNA`` otherwise + :return: Data for longest compatible transcript """ def _get_protein_rep( @@ -814,22 +853,33 @@ async def get_mane_transcript( try_longest_compatible: bool = False, residue_mode: ResidueMode = ResidueMode.RESIDUE, ) -> Optional[Union[DataRepresentation, CdnaRepresentation]]: - """Return mane transcript. + """Return MANE transcript. + + >>> from cool_seq_tool.app import CoolSeqTool + >>> from cool_seq_tool.schemas import AnnotationLayer, ResidueMode + >>> mane_mapper = CoolSeqTool().mane_transcript + >>> result = await mane_mapper.get_mane_transcript( + ... "NP_004324.2", + ... 599, + ... AnnotationLayer.PROTEIN, + ... residue_mode=ResidueMode.INTER_RESIDUE, + ... ) + >>> (result.gene, result.refseq, result.status) + ('BRAF', 'NP_004324.2', ) :param ac: Accession :param start_pos: Start position change :param start_annotation_layer: Starting annotation layer. - :param end_pos: End position change. If `None` assumes both `start_pos` and - `end_pos` have same values. + :param end_pos: End position change. If ``None`` assumes both ``start_pos`` and + ``end_pos`` have same values. :param gene: HGNC gene symbol :param ref: Reference at position given during input - :param try_longest_compatible: `True` if should try longest compatible remaining - if mane transcript was not compatible. `False` otherwise. - :param ResidueMode residue_mode: Starting residue mode for `start_pos` - and `end_pos`. Will always return coordinates in inter-residue + :param try_longest_compatible: ``True`` if should try longest compatible remaining + if mane transcript was not compatible. ``False`` otherwise. + :param ResidueMode residue_mode: Starting residue mode for ``start_pos`` and + ``end_pos``. Will always return coordinates in inter-residue :return: MANE data or longest transcript compatible data if validation - checks are correct. Will return inter-residue coordinates. - Else, `None` + checks are correct. Will return inter-residue coordinates. Else, ``None``. """ inter_residue_pos, warning = get_inter_residue_pos( start_pos, residue_mode, end_pos=end_pos @@ -949,9 +999,9 @@ async def g_to_grch38( ) -> Optional[Dict]: """Return genomic coordinate on GRCh38 when not given gene context. - :param str ac: Genomic accession - :param int start_pos: Genomic start position change - :param int end_pos: Genomic end position change + :param ac: Genomic accession + :param start_pos: Genomic start position + :param end_pos: Genomic end position :return: NC accession, start and end pos on GRCh38 assembly """ if end_pos is None: @@ -1006,8 +1056,8 @@ def get_mane_c_pos_change( ) -> Tuple[int, int]: """Get mane c position change - :param Dict mane_tx_genomic_data: MANE transcript and genomic data - :param int coding_start_site: Coding start site + :param mane_tx_genomic_data: MANE transcript and genomic data + :param coding_start_site: Coding start site :return: cDNA pos start, cDNA pos end """ tx_pos_range = mane_tx_genomic_data["tx_pos_range"] @@ -1031,17 +1081,41 @@ async def g_to_mane_c( residue_mode: ResidueMode = ResidueMode.RESIDUE, ) -> Optional[Union[GenomicRepresentation, CdnaRepresentation]]: """Return MANE Transcript on the c. coordinate. - If gene is provided, g->GRCh38->MANE c. - If MANE c. cannot be found, we return the genomic coordinate on - GRCh38 - If gene is not provided, g -> GRCh38 - - :param str ac: Transcript accession on g. coordinate - :param int start_pos: genomic change start position - :param int end_pos: genomic change end position - :param str gene: Gene symbol - :param ResidueMode residue_mode: Starting residue mode for `start_pos` - and `end_pos`. Will always return coordinates in inter-residue + + >>> from cool_seq_tool.app import CoolSeqTool + >>> mane_mapper = CoolSeqTool().mane_transcript + + If an arg for ``gene`` is provided, lifts to GRCh38, then gets MANE cDNA + representation. + + >>> result = await mane_mapper.g_to_mane_c( + ... "NC_000007.13", + ... 55259515, + ... None, + ... gene="EGFR" + ... ) + >>> type(result) + cool_seq_tool.mappers.mane_transcript.CdnaRepresentation + >>> result.status + + + Locating a MANE transcript requires a ``gene`` symbol argument -- if none is + given, this method will only lift over to genomic coordinates on GRCh38. + + >>> result = await mane_mapper.g_to_mane_c( + ... "NC_000007.13", 55259515, None + ... ) + >>> type(result) + cool_seq_tool.mappers.mane_transcript.GenomicRepresentation + >>> result.refseq, result.pos, result.status + ('NC_000007.14', (55191821, 55191821), ) + + :param ac: Transcript accession on g. coordinate + :param start_pos: genomic start position + :param end_pos: genomic end position + :param gene: HGNC gene symbol + :param residue_mode: Starting residue mode for ``start_pos`` and ``end_pos``. + Will always return coordinates in inter-residue. :return: MANE Transcripts with cDNA change on c. coordinate if gene is provided. Else, GRCh38 data """ @@ -1133,22 +1207,23 @@ async def grch38_to_mane_c_p( gene: Optional[str] = None, residue_mode: ResidueMode = ResidueMode.RESIDUE, try_longest_compatible: bool = False, - ) -> Optional[ProteinAndCdnaRepresentation]: - """Given GRCh38 genomic representation, return cdna and protein representation. + ) -> Optional[Dict]: + """Given GRCh38 genomic representation, return protein representation. + Will try MANE Select and then MANE Plus Clinical. If neither is found and - `try_longest_compatible` is set to `true`, will also try to find the longest + ``try_longest_compatible`` is set to ``true``, will also try to find the longest compatible remaining representation. :param alt_ac: Genomic RefSeq accession on GRCh38 :param start_pos: Start position :param end_pos: End position :param gene: HGNC gene symbol - :param residue_mode: Starting residue mode for `start_pos` and `end_pos`. Will + :param residue_mode: Starting residue mode for ``start_pos`` and ``end_pos``. Will always return coordinates as inter-residue. - :param try_longest_compatible: `True` if should try longest compatible remaining - if mane transcript(s) not compatible. `False` otherwise. + :param try_longest_compatible: ``True`` if should try longest compatible remaining + if mane transcript(s) not compatible. ``False`` otherwise. :return: If successful, return MANE data or longest compatible remaining (if - `try_longest_compatible` set to `True`) cdna and protein representation. + ``try_longest_compatible`` set to ``True``) cDNA and protein representation. Will return inter-residue coordinates. """ # Step 1: Get MANE data to map to diff --git a/cool_seq_tool/schemas.py b/cool_seq_tool/schemas.py index c06edb85..6a0472f5 100644 --- a/cool_seq_tool/schemas.py +++ b/cool_seq_tool/schemas.py @@ -1,4 +1,4 @@ -"""Module for data models.""" +"""Defines attribute constants, useful object structures, and API response schemas.""" import re from datetime import datetime from enum import Enum @@ -71,7 +71,7 @@ class GenomicRequestBody(BaseModelForbidExtra): @model_validator(mode="after") def check_start_and_end(cls, values): - """Check that at least one of {`start`, `end`} is set""" + """Check that at least one of {``start``, ``end``} is set""" msg = "Must provide either `start` or `end`" start, end = values.start, values.end assert start or end, msg @@ -104,7 +104,7 @@ class TranscriptRequestBody(BaseModelForbidExtra): @model_validator(mode="after") def check_exon_start_and_exon_end(cls, values): - """Check that at least one of {`exon_start`, `exon_end`} is set""" + """Check that at least one of {``exon_start``, ``exon_end``} is set""" msg = "Must provide either `exon_start` or `exon_end`" exon_start, exon_end = values.exon_start, values.exon_end assert exon_start or exon_end, msg @@ -166,9 +166,9 @@ class GenomicData(BaseModelForbidExtra): @model_validator(mode="after") def check_start_end(cls, values): - """Check that at least one of {`start`, `end`} is set. - Check that at least one of {`exon_start`, `exon_end`} is set. - If not set, set corresponding offset to `None` + """Check that at least one of {``start``, ``end``} is set. + Check that at least one of {``exon_start``, ``exon_end``} is set. + If not set, set corresponding offset to ``None`` """ msg = "Missing values for `start` or `end`" start = values.start diff --git a/cool_seq_tool/sources/__init__.py b/cool_seq_tool/sources/__init__.py index c753f9a3..2a0a9e1e 100644 --- a/cool_seq_tool/sources/__init__.py +++ b/cool_seq_tool/sources/__init__.py @@ -2,3 +2,5 @@ from .mane_transcript_mappings import MANETranscriptMappings from .transcript_mappings import TranscriptMappings from .uta_database import UTADatabase + +__all__ = ["MANETranscriptMappings", "TranscriptMappings", "UTADatabase"] diff --git a/cool_seq_tool/sources/mane_transcript_mappings.py b/cool_seq_tool/sources/mane_transcript_mappings.py index 5b429506..7617e682 100644 --- a/cool_seq_tool/sources/mane_transcript_mappings.py +++ b/cool_seq_tool/sources/mane_transcript_mappings.py @@ -1,4 +1,6 @@ -"""The module for loading MANE Transcript mappings to genes.""" +"""Provide fast tabular access to MANE summary file. Enables retrieval of associated +MANE transcripts for gene symbols, genomic positions, or transcript accessions. +""" import logging from pathlib import Path from typing import Dict, List @@ -11,10 +13,18 @@ class MANETranscriptMappings: - """The MANE Transcript mappings class.""" + """Provide fast tabular access to MANE summary file. + + By default, acquires data from `NCBI FTP server `_ + if unavailable locally. The local data location can be passed as an argument or + given under the environment variable ``MANE_SUMMARY_PATH``. + + See the `NCBI MANE page `_ for more information. + """ def __init__(self, mane_data_path: Path = MANE_SUMMARY_PATH) -> None: """Initialize the MANE Transcript mappings class. + :param Path mane_data_path: Path to RefSeq MANE summary data """ self.mane_data_path = mane_data_path @@ -22,16 +32,26 @@ def __init__(self, mane_data_path: Path = MANE_SUMMARY_PATH) -> None: def _load_mane_transcript_data(self) -> pl.DataFrame: """Load RefSeq MANE data file into DataFrame. + :return: DataFrame containing RefSeq MANE Transcript data """ return pl.read_csv(self.mane_data_path, separator="\t") def get_gene_mane_data(self, gene_symbol: str) -> List[Dict]: """Return MANE Transcript data for a gene. + + >>> from cool_seq_tool.sources import MANETranscriptMappings + >>> m = MANETranscriptMappings() + >>> braf_mane = m.get_gene_mane_data("BRAF") + >>> (braf_mane[0]["RefSeq_nuc"], braf_mane[0]["MANE_status"]) + ('NM_04333.6', 'MANE Select') + >>> (braf_mane[1]["RefSeq_nuc"], braf_mane[1]["MANE_status"]) + ('NM_001374258.1', 'MANE Plus Clinical') + :param str gene_symbol: HGNC Gene Symbol - :return: List of MANE Transcript data (Transcript accessions, - gene, and location information). Sorted list: MANE Select and then MANE Plus - Clinical + :return: List of MANE Transcript data (Transcript accessions, gene, and + location information). The list is sorted so that a MANE Select entry comes + first, followed by a MANE Plus Clinical entry, if available. """ data = self.df.filter(pl.col("symbol") == gene_symbol.upper()) @@ -58,7 +78,8 @@ def get_mane_from_transcripts(self, transcripts: List[str]) -> List[Dict]: def get_mane_data_from_chr_pos( self, alt_ac: str, start: int, end: int ) -> List[Dict]: - """Get MANE data given chromosome, start pos, end end pos. Assumes GRCh38. + """Get MANE data given a GRCh38 genomic position. + :param str alt_ac: NC Accession :param int start: Start genomic position. Assumes residue coordinates. :param int end: End genomic position. Assumes residue coordinates. diff --git a/cool_seq_tool/sources/transcript_mappings.py b/cool_seq_tool/sources/transcript_mappings.py index d39f1097..f3d4b67a 100644 --- a/cool_seq_tool/sources/transcript_mappings.py +++ b/cool_seq_tool/sources/transcript_mappings.py @@ -1,4 +1,4 @@ -"""The module for Transcript Mappings.""" +"""Provide mappings between gene symbols and RefSeq + Ensembl transcript accessions.""" import csv from pathlib import Path from typing import Dict, List, Optional @@ -7,7 +7,17 @@ class TranscriptMappings: - """The transcript mappings class.""" + """Provide mappings between gene symbols and RefSeq + Ensembl transcript accessions. + + Uses ``LRG_RefSeqGene`` and ``transcript_mappings.csv``, which will automatically + be acquired if they aren't already available. See the + :ref:`configuration ` section in the documentation for information + about manual acquisition of data. + + In general, this class's methods expect to receive NCBI gene symbols, so users + should be careful about the sourcing of their input in cases where terms are + conflicted or ambiguous (which, to be fair, should be relatively rare). + """ def __init__( self, @@ -16,8 +26,8 @@ def __init__( ) -> None: """Initialize the transcript mappings class. - :param Path transcript_file_path: Path to transcript mappings file - :param Path lrg_refseqgene_path: Path to LRG RefSeqGene file + :param transcript_file_path: Path to transcript mappings file + :param lrg_refseqgene_path: Path to LRG RefSeqGene file """ # ENSP <-> Gene Symbol self.ensembl_protein_version_for_gene_symbol: Dict[str, List[str]] = {} @@ -53,7 +63,7 @@ def __init__( def _load_transcript_mappings_data(self, transcript_file_path: Path) -> None: """Load transcript mappings file to dictionaries. - :param Path transcript_file_path: Path to transcript mappings file + :param transcript_file_path: Path to transcript mappings file """ with open(transcript_file_path) as file: reader = csv.DictReader(file, delimiter="\t") @@ -127,7 +137,13 @@ def _load_refseq_gene_symbol_data(self, lrg_refseqgene_path: Path) -> None: def protein_transcripts(self, identifier: str) -> List[str]: """Return a list of protein transcripts for a gene symbol. - :param str identifier: Gene identifier to get protein transcripts for + >>> from cool_seq_tool.sources import TranscriptMappings + >>> TranscriptMappings().protein_transcripts("BRAF")[:3] + ['ENSP00000420119.2', + 'ENSP00000288602', + 'NP_004324.2'] + + :param identifier: Gene identifier to get protein transcripts for :return: Protein transcripts for a gene symbol """ protein_transcripts = list() @@ -141,7 +157,7 @@ def protein_transcripts(self, identifier: str) -> List[str]: def coding_dna_transcripts(self, identifier: str) -> List[str]: """Return transcripts from a coding dna refseq for a gene symbol. - :param str identifier: Gene identifier to find transcripts for + :param identifier: Gene identifier to find transcripts for :return: cDNA transcripts for a gene symbol """ genomic_transcripts = list() @@ -159,7 +175,7 @@ def coding_dna_transcripts(self, identifier: str) -> List[str]: def get_gene_symbol_from_ensembl_protein(self, q: str) -> Optional[str]: """Return the gene symbol for a Ensembl Protein. - :param str q: ensembl protein accession + :param q: ensembl protein accession :return: Gene symbol """ gene_symbol = self.ensembl_protein_version_to_gene_symbol.get(q) @@ -172,7 +188,7 @@ def get_gene_symbol_from_ensembl_protein(self, q: str) -> Optional[str]: def get_gene_symbol_from_refeq_protein(self, q: str) -> Optional[str]: """Return the gene symbol for a Refseq Protein. - :param str q: RefSeq protein accession + :param q: RefSeq protein accession :return: Gene symbol """ return self.refseq_protein_to_gene_symbol.get(q) @@ -180,7 +196,7 @@ def get_gene_symbol_from_refeq_protein(self, q: str) -> Optional[str]: def get_gene_symbol_from_refseq_rna(self, q: str) -> Optional[str]: """Return gene symbol for a Refseq RNA Transcript. - :param str q: RefSeq RNA transcript accession + :param q: RefSeq RNA transcript accession :return: Gene symbol """ gene_symbol = self.refseq_rna_version_to_gene_symbol.get(q) @@ -193,7 +209,7 @@ def get_gene_symbol_from_refseq_rna(self, q: str) -> Optional[str]: def get_gene_symbol_from_ensembl_transcript(self, q: str) -> Optional[str]: """Return gene symbol for an Ensembl Transcript. - :param str q: Ensembl transcript accession + :param q: Ensembl transcript accession :return: Gene symbol """ gene_symbol = self.ensembl_transcript_version_to_gene_symbol.get(q) diff --git a/cool_seq_tool/sources/uta_database.py b/cool_seq_tool/sources/uta_database.py index 1ba7a552..2435840e 100644 --- a/cool_seq_tool/sources/uta_database.py +++ b/cool_seq_tool/sources/uta_database.py @@ -1,4 +1,4 @@ -"""Module for UTA queries.""" +"""Provide transcript lookup and metadata tools via the UTA database.""" import ast import base64 import logging @@ -31,7 +31,20 @@ class UTADatabase: - """Class for connecting and querying UTA database.""" + """Provide transcript lookup and metadata tools via the Universal Transcript Archive + (UTA) database. + + Users should use the create() method to construct a new instance: + + >>> from cool_seq_tool.sources.uta_database import UTADatabase + >>> uta_db = await UTADatabase.create() + + This class uses the `asyncpg `_ + library to connect to PostgreSQL. This means that most methods are defined as + coroutines and must be called with ``await``. See the + `Python stdlib docs `_ + for more information. + """ def __init__( self, @@ -39,19 +52,19 @@ def __init__( chain_file_37_to_38: Optional[str] = None, chain_file_38_to_37: Optional[str] = None, ) -> None: - """Initialize DB class. Downstream libraries should use the create() - method to construct a new instance: await UTADatabase.create() + """Initialize DB class. Should only be used by ``create()`` method, and not + be called directly by a user. :param db_url: PostgreSQL connection URL - Format: `driver://user:password@host/database/schema` + Format: ``driver://user:password@host/database/schema`` :param chain_file_37_to_38: Optional path to chain file for 37 to 38 assembly. - This is used for pyliftover. If this is not provided, will check to see if - LIFTOVER_CHAIN_37_TO_38 env var is set. If neither is provided, will allow - pyliftover to download a chain file from UCSC + This is used for ``pyliftover``. If this is not provided, will check to see + if ``LIFTOVER_CHAIN_37_TO_38`` env var is set. If neither is provided, will + allow ``pyliftover`` to download a chain file from UCSC :param chain_file_38_to_37: Optional path to chain file for 38 to 37 assembly. - This is used for pyliftover. If this is not provided, will check to see if - LIFTOVER_CHAIN_38_TO_37 env var is set. If neither is provided, will allow - pyliftover to download a chain file from UCSC + This is used for ``pyliftover``. If this is not provided, will check to see + if ``LIFTOVER_CHAIN_38_TO_37`` env var is set. If neither is provided, will + allow ``pyliftover`` to download a chain file from UCSC """ self.schema = None self._connection_pool = None @@ -137,10 +150,16 @@ async def create_pool(self) -> None: async def create( cls: Type[UTADatabaseType], db_url: str = UTA_DB_URL ) -> UTADatabaseType: - """Provide fully-initialized class instance (a la factory pattern) - :param UTADatabaseType cls: supplied implicitly - :param str db_url: PostgreSQL connection URL - Format: `driver://user:password@host/database/schema` + """Manufacture a fully-initialized class instance (a la factory pattern). This + method should be used instead of calling the class directly to create a new + instance. + + >>> from cool_seq_tool.sources.uta_database import UTADatabase + >>> uta_db = await UTADatabase.create() + + :param cls: supplied implicitly + :param db_url: PostgreSQL connection URL + Format: ``driver://user:password@host/database/schema`` :return: UTA DB access class instance """ self = cls(db_url) @@ -151,7 +170,7 @@ async def create( async def execute_query(self, query: str) -> Any: # noqa: ANN401 """Execute a query and return its result. - :param str query: Query to make on database + :param query: Query to make on database :return: Query's result """ @@ -223,7 +242,7 @@ async def _create_genomic_table(self) -> None: def _transform_list(li: List) -> List[List[Any]]: """Transform list to only contain field values - :param List li: List of asyncpg.Record objects + :param li: List of asyncpg.Record objects :return: List of list of objects """ results = list() @@ -241,13 +260,12 @@ async def chr_to_gene_and_accessions( ) -> Tuple[Optional[Dict], Optional[str]]: """Return genes and genomic accessions related to a position on a chr. - :param int chromosome: Chromosome number - :param int pos: Genomic position - :param Optional[int] strand: Strand. Must be either `-1` or `1` - :param Optional[str] alt_ac: Genomic accession - :param Optional[str] gene: Gene symbol - :return: Dictionary containing genes and genomic accessions and - warnings if found + :param chromosome: Chromosome number + :param pos: Genomic position + :param strand: Strand. Must be either ``-1`` or ``1`` + :param alt_ac: Genomic accession + :param gene: Gene symbol + :return: Dictionary containing genes and genomic accessions and warnings if found """ alt_ac_cond = ( f"WHERE alt_ac = '{alt_ac}'" @@ -293,8 +311,8 @@ async def get_tx_exons( ) -> Tuple[Optional[List[Tuple[int, int]]], Optional[str]]: """Get list of transcript exons start/end coordinates. - :param str tx_ac: Transcript accession - :param Optional[str] alt_ac: Genomic accession + :param tx_ac: Transcript accession + :param alt_ac: Genomic accession :return: List of a transcript's accessions and warnings if found """ if alt_ac: @@ -335,9 +353,9 @@ def _validate_exon( ) -> Tuple[Optional[Tuple[int, int]], Optional[str]]: """Validate that exon number is valid - :param str transcript: Transcript accession - :param List tx_exons: List of transcript's exons - :param Optional[int] exon_number: Exon number to validate + :param transcript: Transcript accession + :param tx_exons: List of transcript's exons + :param exon_number: Exon number to validate :return: Transcript coordinates and warnings if found """ msg = f"Exon {exon_number} does not exist on {transcript}" @@ -394,12 +412,10 @@ async def get_alt_ac_start_and_end( ) -> Tuple[Optional[Tuple[Tuple, Tuple]], Optional[str]]: """Get genomic coordinates for related transcript exon start and end. - :param str tx_ac: Transcript accession - :param Optional[List[str]] tx_exon_start: Transcript's exon start - coordinates - :param Optional[List[str]] tx_exon_end: Transcript's exon end - coordinates - :param str gene: Gene symbol + :param tx_ac: Transcript accession + :param tx_exon_start: Transcript's exon start coordinates + :param tx_exon_end: Transcript's exon end coordinates + :param gene: Gene symbol :return: Alt ac start and end data, and warnings if found """ if tx_exon_start: @@ -443,10 +459,10 @@ async def get_alt_ac_start_or_end( ) -> Tuple[Optional[Tuple[str, str, int, int, int]], Optional[str]]: """Get genomic data for related transcript exon start or end. - :param str tx_ac: Transcript accession - :param int tx_exon_start: Transcript's exon start coordinate - :param int tx_exon_end: Transcript's exon end coordinate - :param Optional[str] gene: Gene symbol + :param tx_ac: Transcript accession + :param tx_exon_start: Transcript's exon start coordinate + :param tx_exon_end: Transcript's exon end coordinate + :param gene: Gene symbol :return: [hgnc symbol, genomic accession for chromosome, start exon's end coordinate, end exon's start coordinate, strand], and warnings if found @@ -487,7 +503,7 @@ async def get_alt_ac_start_or_end( async def get_cds_start_end(self, tx_ac: str) -> Optional[Tuple[int, int]]: """Get coding start and end site - :param str tx_ac: Transcript accession + :param tx_ac: Transcript accession :return: [Coding start site, Coding end site] """ if tx_ac.startswith("ENS"): @@ -511,7 +527,7 @@ async def get_cds_start_end(self, tx_ac: str) -> Optional[Tuple[int, int]]: async def get_newest_assembly_ac(self, ac: str) -> List[str]: """Find accession associated to latest genomic assembly - :param str ac: Accession + :param ac: Accession :return: List of accessions associated to latest genomic assembly. Order by desc """ @@ -540,8 +556,8 @@ async def get_newest_assembly_ac(self, ac: str) -> List[str]: async def validate_genomic_ac(self, ac: str) -> bool: """Return whether or not genomic accession exists. - :param str ac: Genomic accession - :return: `True` if genomic accession exists. `False` otherwise. + :param ac: Genomic accession + :return: ``True`` if genomic accession exists. ``False`` otherwise. """ query = f""" SELECT EXISTS( @@ -554,10 +570,15 @@ async def validate_genomic_ac(self, ac: str) -> bool: return result[0][0] async def get_ac_descr(self, ac: str) -> Optional[str]: - """Return accession description. - Typically description exists if not GRCh38 assembly. + """Return accession description. This is typically available only for accessions + from older (pre-GRCh38) builds. - :param str ac: Accession + >>> from cool_seq_tool.sources.uta_database import UTADatabase + >>> uta_db = await UTADatabase.create() + >>> await uta_db.get_ac_descr("NC_000001.10") + 'Homo sapiens chromosome 1, GRCh37.p13 Primary Assembly' + + :param ac: chromosome accession, e.g. ``"NC_000001.10"`` :return: Description containing assembly and chromosome """ query = f""" @@ -580,23 +601,23 @@ async def get_tx_exon_aln_v_data( tx_ac: str, start_pos: int, end_pos: int, - alt_ac: str = None, + alt_ac: Optional[str] = None, use_tx_pos: bool = True, like_tx_ac: bool = False, ) -> List: """Return queried data from tx_exon_aln_v table. - :param str tx_ac: accession on c. coordinate - :param int start_pos: Start position change - :param int end_pos: End position change - :param str alt_ac: accession on g. coordinate - :param bool use_tx_pos: `True` if querying on transcript position. This means - `start_pos` and `end_pos` are on the c. coordinate - `False` if querying on genomic position. This means `start_pos` and - `end_pos` are on the g. coordinate - :param bool like_tx_ac: `True` if tx_ac condition should be a like statement. + :param tx_ac: accession on c. coordinate + :param start_pos: Start position change + :param end_pos: End position change + :param alt_ac: accession on g. coordinate + :param use_tx_pos: ``True`` if querying on transcript position. This means + ``start_pos`` and ``end_pos`` are on the c. coordinate + ``False`` if querying on genomic position. This means ``start_pos`` and + ``end_pos`` are on the g. coordinate + :param like_tx_ac: ``True`` if tx_ac condition should be a like statement. This is used when you want to query an accession regardless of its version - `False` if tx_condition will be exact match + ``False`` if tx_condition will be exact match :return: List of tx_exon_aln_v data """ if end_pos is None: @@ -659,7 +680,7 @@ async def get_tx_exon_aln_v_data( def data_from_result(result: List) -> Optional[Dict]: """Return data found from result. - :param List result: Data from tx_exon_aln_v table + :param result: Data from tx_exon_aln_v table :return: Gene, strand, and position ranges for tx and alt_ac """ gene = result[0] @@ -694,13 +715,28 @@ def data_from_result(result: List) -> Optional[Dict]: async def get_mane_c_genomic_data( self, ac: str, alt_ac: Optional[str], start_pos: int, end_pos: int ) -> Optional[Dict]: - """Get MANE Transcript and genomic data. - - Used when going from g -> MANE c - :param str ac: MANE Transcript accession - :param str alt_ac: NC Accession - :param int start_pos: Genomic start position change - :param int end_pos: Genomic end position change + """Get MANE transcript and genomic data. Used when going from g. to MANE c. + representation. + + >>> from cool_seq_tool.sources import UTADatabase + >>> uta_db = await UTADatabase.create() + >>> result = await uta_db.get_mane_c_genomic_data( + ... "NM_004333.6", + ... None, + ... 140753335, + ... 140753335, + ... ) + >>> result["gene"] + 'BRAF' + >>> result["alt_ac"] + 'NC_000007.14' + + :param ac: MANE transcript accession + :param alt_ac: NC accession. Used to triangulate on correct genomic data. Can + be set to ``None`` if unavailable. + :param start_pos: Genomic start position + :param end_pos: Genomic end position change + :return: MANE transcript results if successful """ results = await self.get_tx_exon_aln_v_data( ac, start_pos, end_pos, alt_ac=alt_ac, use_tx_pos=False @@ -752,13 +788,12 @@ async def get_genomic_tx_data( ) -> Optional[Dict]: """Get transcript mapping to genomic data. - :param str tx_ac: Accession on c. coordinate - :param Tuple pos: (start pos, end pos) - :param Union[AnnotationLayer.CDNA, AnnotationLayer.GENOMIC] annotation_layer: - Annotation layer for `ac` and `pos` - :param Optional[str] alt_ac: Accession on g. coordinate - :param Assembly target_genome_assembly: Genome assembly to get genomic data for. - If `alt_ac` is provided, it will return the associated assembly. + :param tx_ac: Accession on c. coordinate + :param pos: (start pos, end pos) + :param annotation_layer: Annotation layer for ``ac`` and ``pos`` + :param alt_ac: Accession on g. coordinate + :param target_genome_assembly: Genome assembly to get genomic data for. + If ``alt_ac`` is provided, it will return the associated assembly. :return: Gene, Transcript accession and position change, Altered transcript accession and position change, Strand """ @@ -810,7 +845,7 @@ async def get_genomic_tx_data( async def get_ac_from_gene(self, gene: str) -> List[str]: """Return genomic accession(s) associated to a gene. - :param str gene: Gene symbol + :param gene: Gene symbol :return: List of genomic accessions, sorted in desc order """ query = f""" @@ -834,9 +869,14 @@ async def get_gene_from_ac( ) -> Optional[List[str]]: """Get transcripts from NC accession and positions. - :param str ac: NC Accession - :param int start_pos: Start position change - :param int end_pos: End position change + >>> from cool_seq_tool.sources import UTADatabase + >>> uta_db = await UTADatabase.create() + >>> await uta_db.get_gene_from_ac("NC_000017.11", 43044296, 43045802) + ['BRCA1'] + + :param ac: NC accession, e.g. ``"NC_000001.11"`` + :param start_pos: Start position change + :param end_pos: End position change :return: List of HGNC gene symbols """ if end_pos is None: @@ -871,20 +911,20 @@ async def get_transcripts( use_tx_pos: bool = True, alt_ac: Optional[str] = None, ) -> pl.DataFrame: - """Get transcripts for a given `gene` or `alt_ac` related to optional positions. + """Get transcripts for a given ``gene`` or ``alt_ac`` related to optional positions. :param start_pos: Start position change - If not provided and `end_pos` not provided, all transcripts associated with + If not provided and ``end_pos`` not provided, all transcripts associated with the gene and/or accession will be returned :param end_pos: End position change - If not provided and `start_pos` not provided, all transcripts associated + If not provided and ``start_pos`` not provided, all transcripts associated with the gene and/or accession will be returned :param gene: HGNC gene symbol - :param use_tx_pos: `True` if querying on transcript position. This means - `start_pos` and `end_pos` are c. coordinate positions. `False` if querying - on genomic position. This means `start_pos` and `end_pos` are g. coordinate + :param use_tx_pos: ``True`` if querying on transcript position. This means + ``start_pos`` and ``end_pos`` are c. coordinate positions. ``False`` if querying + on genomic position. This means ``start_pos`` and ``end_pos`` are g. coordinate positions - :param alt_ac: Genomic accession. If not provided, must provide `gene` + :param alt_ac: Genomic accession. If not provided, must provide ``gene`` :return: Data Frame containing transcripts associated with a gene. Transcripts are ordered by most recent NC accession, then by descending transcript length @@ -943,7 +983,7 @@ async def get_transcripts( async def get_chr_assembly(self, ac: str) -> Optional[Tuple[str, str]]: """Get chromosome and assembly for NC accession if not in GRCh38. - :param str ac: NC accession + :param ac: NC accession :return: Chromosome and Assembly accession is on """ descr = await self.get_ac_descr(ac) @@ -966,8 +1006,8 @@ async def get_chr_assembly(self, ac: str) -> Optional[Tuple[str, str]]: async def liftover_to_38(self, genomic_tx_data: Dict) -> None: """Liftover genomic_tx_data to hg38 assembly. - :param Dict genomic_tx_data: Dictionary containing gene, nc_accession, - alt_pos, and strand + :param genomic_tx_data: Dictionary containing gene, nc_accession, alt_pos, and + strand """ descr = await self.get_chr_assembly(genomic_tx_data["alt_ac"]) if descr is None: @@ -1022,9 +1062,9 @@ def get_liftover( ) -> Optional[Tuple]: """Get new genome assembly data for a position on a chromosome. - :param str chromosome: The chromosome number. Must be prefixed with `chr` - :param int pos: Position on the chromosome - :param Assembly liftover_to_assembly: Assembly to liftover to + :param chromosome: The chromosome number. Must be prefixed with ``chr`` + :param pos: Position on the chromosome + :param liftover_to_assembly: Assembly to liftover to :return: [Target chromosome, target position, target strand, conversion_chain_score] for assembly """ @@ -1055,11 +1095,11 @@ def _set_liftover( ) -> None: """Update genomic_tx_data to have coordinates for given assembly. - :param Dict genomic_tx_data: Dictionary containing gene, nc_accession, - alt_pos, and strand - :param str key: Key to access coordinate positions - :param str chromosome: Chromosome, must be prefixed with `chr` - :param Assembly liftover_to_assembly: Assembly to liftover to + :param genomic_tx_data: Dictionary containing gene, nc_accession, alt_pos, and + strand + :param key: Key to access coordinate positions + :param chromosome: Chromosome, must be prefixed with ``chr`` + :param liftover_to_assembly: Assembly to liftover to """ liftover_start_i = self.get_liftover( chromosome, genomic_tx_data[key][0], liftover_to_assembly @@ -1084,11 +1124,12 @@ def _set_liftover( genomic_tx_data[key] = liftover_start_i[1], liftover_end_i[1] async def p_to_c_ac(self, p_ac: str) -> List[str]: - """Return c. accession from p. accession. + """Return cDNA reference sequence accession from protein reference sequence + accession (i.e. ``p.`` to ``c.`` in HGVS syntax) - :param str p_ac: Protein accession - :return: List of rows containing c. accessions that are associated - with the given p. accession. In ascending order. + :param p_ac: Protein accession + :return: List of rows containing c. accessions that are associated with the + given p. accession. In ascending order. """ # Ensembl accessions do not have versions if p_ac.startswith("EN"): @@ -1115,8 +1156,8 @@ async def get_transcripts_from_genomic_pos( ) -> List[str]: """Get transcripts associated to a genomic ac and position. - :param str alt_ac: Genomic accession - :param int g_pos: Genomic position + :param alt_ac: Genomic accession + :param g_pos: Genomic position :return: RefSeq transcripts on c. coordinate """ query = f""" @@ -1133,7 +1174,7 @@ async def get_transcripts_from_genomic_pos( @staticmethod def get_secret() -> str: - """Get secrets for UTA DB instances.""" + """Get secrets for UTA DB instances. Used for deployment on AWS.""" secret_name = environ["UTA_DB_SECRET"] region_name = "us-east-2" diff --git a/cool_seq_tool/utils.py b/cool_seq_tool/utils.py index 15ea546a..bf1e2222 100644 --- a/cool_seq_tool/utils.py +++ b/cool_seq_tool/utils.py @@ -1,4 +1,4 @@ -"""Module for common utilities used throughout the app""" +"""Provide a small set of general helper functions.""" import logging from datetime import datetime from typing import Optional, Tuple @@ -12,12 +12,22 @@ def get_inter_residue_pos( start_pos: int, residue_mode: ResidueMode, end_pos: Optional[int] = None ) -> Tuple[Optional[Tuple[int, int]], Optional[str]]: - """Return inter-residue position + """Return equivalent inter-residue position. + + Generally, we prefer to work with inter-residue coordinates where possible. Our + rationale is detailed in an appendix to the + `VRS docs `_. + This function is used internally to shift user-provided coordinates accordingly. + + >>> from cool_seq_tool.utils import get_inter_residue_pos + >>> from cool_seq_tool.schemas import ResidueMode + >>> get_inter_residue_pos(10, ResidueMode.RESIDUE) + ((9, 9), None) :param start_pos: Start position - :param residue_mode: Residue mode for `start_pos` and `end_pos` - :param end_pos: End position. If `None` assumes both `start` and `end` have same - values. + :param residue_mode: Residue mode for ``start_pos`` and ``end_pos`` + :param end_pos: End position. If ``None`` assumes both ``start`` and ``end`` have + same values. :return: Inter-residue coordinates, warning """ if residue_mode == ResidueMode.RESIDUE: @@ -41,7 +51,8 @@ def get_inter_residue_pos( @staticmethod def service_meta() -> ServiceMeta: - """Return ServiceMeta for cool_seq_tool + """Return description of request and service, including parameters like software + version for reproducibility. :return: ServiceMeta object """ diff --git a/cool_seq_tool/version.py b/cool_seq_tool/version.py index 9fa4000e..061f84a6 100644 --- a/cool_seq_tool/version.py +++ b/cool_seq_tool/version.py @@ -1,2 +1,2 @@ """Define package version.""" -__version__ = "0.4.0-dev0" +__version__ = "0.4.0-dev1" diff --git a/docs/Makefile b/docs/Makefile new file mode 100644 index 00000000..d0c3cbf1 --- /dev/null +++ b/docs/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = source +BUILDDIR = build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/TranscriptSelectionPriority.md b/docs/TranscriptSelectionPriority.md deleted file mode 100644 index 72ef9944..00000000 --- a/docs/TranscriptSelectionPriority.md +++ /dev/null @@ -1,22 +0,0 @@ -# Transcript Selection Policy - -This document contains information on the Transcript Selection Policy. We use this policy for selecting a representative transcript using sequence attributes and MANE annotations from the `mane_transcript` module. - -More information on MANE can be found [here](https://www.ncbi.nlm.nih.gov/refseq/MANE/). - -## Representative transcript priority -We evaluate all compatible transcripts against each of the below criteria, and select the transcript which meets the earliest criterion as representative. - -1. Transcript is annotated as a MANE Select transcript -2. Transcript is annotated as a MANE Plus Clinical transcript -3. Longest Compatible Remaining\ - a. If there is a tie, choose the first-published transcript (lowest-numbered accession for RefSeq/Ensembl) among those transcripts meeting this criterion\ - _Note: We want the most recent version of a transcript associated with an assembly_ - -## Compatible Transcripts - -Compatible transcripts are those that pass validation checks. The checks that we make are: - - - Validating the position exists on an accession - - Validating reference sequences - - Validating exon structure diff --git a/docs/make.bat b/docs/make.bat new file mode 100644 index 00000000..7d65db30 --- /dev/null +++ b/docs/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=source +set BUILDDIR=build + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.https://www.sphinx-doc.org/ + exit /b 1 +) + +if "%1" == "" goto help + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/biomart.png b/docs/source/_static/img/biomart.png similarity index 100% rename from biomart.png rename to docs/source/_static/img/biomart.png diff --git a/docs/source/_templates/module_summary.rst b/docs/source/_templates/module_summary.rst new file mode 100644 index 00000000..3a40b575 --- /dev/null +++ b/docs/source/_templates/module_summary.rst @@ -0,0 +1,7 @@ +{{ fullname | underline }} + +.. automodule:: {{ fullname }} + :members: + :undoc-members: + :special-members: __init__ + :exclude-members: model_fields, model_config diff --git a/docs/source/conf.py b/docs/source/conf.py new file mode 100644 index 00000000..119bea7c --- /dev/null +++ b/docs/source/conf.py @@ -0,0 +1,80 @@ +# Configuration file for the Sphinx documentation builder. +# +# For the full list of built-in configuration values, see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Project information ----------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information + +project = "Cool-Seq-Tool" +copyright = "2023, Wagner Lab" +author = "Wagner Lab" +html_title = "Cool-Seq-Tool" + +# -- General configuration --------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration + +extensions = [ + "sphinx.ext.autodoc", + "sphinx_autodoc_typehints", + "sphinx.ext.linkcode", + "sphinx.ext.autosummary", + "sphinx_copybutton", +] + +templates_path = ["_templates"] +exclude_patterns = [] + +# -- Options for HTML output ------------------------------------------------- +# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output + +html_theme = "furo" +html_static_path = [] +html_css_files = [ + "https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/fontawesome.min.css", + "https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/solid.min.css", + "https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/brands.min.css", +] +html_theme_options = { + "footer_icons": [ + { + "name": "GitHub", + "url": "https://github.com/genomicmedlab/cool-seq-tool", + "html": "", + "class": "fa-brands fa-solid fa-github", + }, + { + "name": "Wagner Lab", + "url": "https://www.nationwidechildrens.org/specialties/institute-for-genomic-medicine/research-labs/wagner-lab", + "html": "", + "class": "fa-solid fa-house", + }, + ], +} +# -- autodoc things ---------------------------------------------------------- +import os # noqa: E402 +import sys # noqa: E402 + +sys.path.insert(0, os.path.abspath("../../cool_seq_tool")) +autodoc_preserve_defaults = True + +# -- get version ------------------------------------------------------------- +from cool_seq_tool.version import __version__ # noqa: E402 + +version = __version__ +release = version + + +# -- linkcode ---------------------------------------------------------------- +def linkcode_resolve(domain, info): + if domain != "py": + return None + if not info["module"]: + return None + filename = info["module"].replace(".", "/") + return f"https://github.com/genomicmedlab/cool-seq-tool/blob/main/{filename}.py" # noqa: E501 + + +# -- code block style -------------------------------------------------------- +pygments_style = "default" +pygements_dark_style = "monokai" diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst new file mode 100644 index 00000000..d0feaef9 --- /dev/null +++ b/docs/source/contributing.rst @@ -0,0 +1,83 @@ +Contributing to Cool-Seq-Tool +============================= + +Bug reports and feature requests +-------------------------------- + +Bugs and new feature requests can be submitted to the Cool-Seq-Tool `issue tracker on GitHub `_. See `this StackOverflow post `_ for tips on how to craft a helpful bug report. + +Development prerequisites +------------------------- +For a development install, we recommend using Pipenv. See the `Pipenv docs `_ for direction on installing Pipenv in your environment. + +Setup +----- +Clone the repository: :: + + git clone https://github.com/genomicmedlab/cool-seq-tool + cd cool-seq-tool + +Then initialize the Pipenv environment: :: + + pipenv update + pipenv install --dev + pipenv shell + +Alternatively, use a virtual environment and install :ref:`all dependency groups`: :: + + python3 -m venv venv + source venv/bin/activate + python3 -m pip install -e ".[dev,tests,docs]" + +We use `pre-commit `_ to run conformance tests before commits. This provides checks for: + +* Code format and style +* Added large files +* AWS credentials +* Private keys + +Before your first commit, run: :: + + pre-commit install + +Style +----- + +Code style is managed by `Ruff `_ and `Black `_, and should be checked via pre-commit hook before commits. Final QC is applied with GitHub Actions to every pull request. + +Tests +----- + +Tests are executed with `pytest `_: :: + + pytest + +Documentation +------------- + +The documentation is built with Sphinx, which is included as part of the Pipenv developer dependencies, or in the ``docs`` dependency group. Navigate to the ``docs/`` subdirectory and use ``make`` to build the HTML version: :: + + pipenv shell + cd docs + make html + +See the `Sphinx documentation `_ for more information. + +.. _build_transcript_mappings_tsv: + +transcript_mappings.tsv +----------------------- + +Cool-Seq-Tool uses a static copy of transcript mapping data acquired from `Ensembl BioMart `_. To regenerate this file from the BioMart, select the ``Human Genes (GRCh38.p13)`` dataset and choose the following attributes: + +* Gene stable ID +* Gene stable ID version +* Transcript stable ID +* Transcript stable ID version +* Protein stable ID +* Protein stable ID version +* RefSeq match transcript (MANE Select) +* Gene name + +.. image:: _static/img/biomart.png + :alt: BioMart screenshot diff --git a/docs/source/index.rst b/docs/source/index.rst new file mode 100644 index 00000000..2011e494 --- /dev/null +++ b/docs/source/index.rst @@ -0,0 +1,23 @@ +Cool-Seq-Tool |version| +======================= + +The **CoolSeqTool** provides: + +* A Pythonic API on top of sequence data of interest to tertiary analysis tools, including mappings between gene names and transcripts, `MANE transcript `_ descriptions, and the `Universal Transcript Archive `_ +* Augmented access to the `SeqRepo `_ database, including multiple additional methods and tools +* Mapping tools that combine the above to support translation between various references sequences and annotation layers, and to MANE-designated transcripts + +See the :ref:`Installation ` and :ref:`Usage ` pages for information on getting started. Individual classes and methods are documented within the :ref:`API reference `. + +CoolSeqTool was created to support the `Knowledgebase Integration Project `_ of the `Variant Interpretation for Cancer Consortium (VICC) `_. It is developed primarily by the `Wagner Lab `_. Full source code is available on `GitHub `_. + +.. toctree:: + :hidden: + :maxdepth: 2 + + Installation + Usage + Transcript Selection + API Reference + Contributing + License diff --git a/docs/source/install.rst b/docs/source/install.rst new file mode 100644 index 00000000..c6c66977 --- /dev/null +++ b/docs/source/install.rst @@ -0,0 +1,76 @@ +.. _installation: + +Installation +============ + +Prerequisites +------------- + +* Python 3.8+ +* A UNIX-like environment (e.g. MacOS, WSL, Ubuntu) +* A recent version of PostgreSQL (ideally at least 11+) + +Install Cool-Seq-Tool +--------------------- + +Install Cool-Seq-Tool from `PyPI `_: + + pip install cool-seq-tool + +.. _dependency-groups: + +.. note:: + + Cool-Seq-Tool provides extra dependency groups for development and testing purposes. Most users won't need to install them. + + * ``dev`` includes packages for linting and performing other kinds of quality checks + * ``tests`` includes packages for running tests + * ``docs`` includes packages for writing and building documentation + +Set up UTA +---------- + +Cool-Seq-Tool requires an available instance of the Universal Transcript Archive (UTA) database. Complete installation instructions (via Docker or a local server) are available at the `UTA GitHub repository `_. For local usage, we recommend the following: + +.. long-term, it would be best to move this over to the UTA repo to avoid duplication + +.. code-block:: + + createuser -U postgres uta_admin + createuser -U postgres anonymous + createdb -U postgres -O uta_admin uta + + export UTA_VERSION=uta_20210129b.pgd.gz # most recent as of 2023/12/05 + curl -O https://dl.biocommons.org/uta/$UTA_VERSION + gzip -cdq ${UTA_VERSION} | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433 + +By default, Cool-Seq-Tool expects to connect to the UTA database via a PostgreSQL connection served local on port 5433, under the PostgreSQL username ``uta_admin`` and the schema ``uta_20210129b``. + +Set up SeqRepo +-------------- + +Cool-Seq-Tool requires access to `SeqRepo `_ data. In general, we recommend the following for local setup: + +.. long-term, it would be best to move this over to seqrepo to avoid duplication + +.. code-block:: + + pip install seqrepo + export SEQREPO_VERSION=2021-01-29 # or newer if available -- check `seqrepo list-remote-instances` + sudo mkdir /usr/local/share/seqrepo + sudo chown $USER /usr/local/share/seqrepo + seqrepo pull -i $SEQREPO_VERSION + +If you encounter a permission error similar to the one below: + +.. code-block:: + + PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29' + +Try moving data manually with ``sudo``: + +.. code-block:: + + sudo mv /usr/local/share/seqrepo/$SEQREPO_VERSION.* /usr/local/share/seqrepo/$SEQREPO_VERSION + +See `mirroring documentation `_ on the SeqRepo GitHub repo for instructions and additional troubleshooting. diff --git a/docs/source/license.rst b/docs/source/license.rst new file mode 100644 index 00000000..0a3b8703 --- /dev/null +++ b/docs/source/license.rst @@ -0,0 +1,4 @@ +License +======= + +.. include:: ../../LICENSE diff --git a/docs/source/reference/index.rst b/docs/source/reference/index.rst new file mode 100644 index 00000000..490b7d67 --- /dev/null +++ b/docs/source/reference/index.rst @@ -0,0 +1,59 @@ +.. _api_reference: + +API Reference +============= + +.. _core_modules_api_index: + +Core Modules +------------ + +.. autosummary:: + :nosignatures: + :toctree: api/ + :template: module_summary.rst + + cool_seq_tool.app + cool_seq_tool.schemas + cool_seq_tool.utils + cool_seq_tool.data.data_downloads + +.. _sources_modules_api_index: + +Data Sources +------------ + +.. autosummary:: + :nosignatures: + :toctree: api/sources/ + :template: module_summary.rst + + cool_seq_tool.sources.mane_transcript_mappings + cool_seq_tool.sources.transcript_mappings + cool_seq_tool.sources.uta_database + +.. _handlers_modules_api_index: + +Data Handlers +------------- + +.. autosummary:: + :nosignatures: + :toctree: api/handlers/ + :template: module_summary.rst + + cool_seq_tool.handlers.seqrepo_access + +.. _mappers_modules_api_index: + +Data Mappers +------------ + +.. autosummary:: + :nosignatures: + :toctree: api/mappers/ + :template: module_summary.rst + + cool_seq_tool.mappers.alignment + cool_seq_tool.mappers.exon_genomic_coords + cool_seq_tool.mappers.mane_transcript diff --git a/docs/source/transcript_selection.rst b/docs/source/transcript_selection.rst new file mode 100644 index 00000000..1d0ea0a2 --- /dev/null +++ b/docs/source/transcript_selection.rst @@ -0,0 +1,35 @@ +.. _transcript_selection_policy: + +Transcript Selection +==================== + +One of the core uses of Cool-Seq-Tool is to acquire and use consensus-based, representative transcripts in performing genomic analysis. Here, we describe the selection processes, programmed in the :py:class:`MANETranscript ` class, for choosing the best available transcripts that are compatible with requested data. + +We rely heavily on transcripts annotated under the `Matched Annotation from NCBI and EMBL-EBI (MANE)` Transcripts project. For more information on the MANE project, see the `NCBI MANE page `_. + +.. _transcript_compatibility: + +Transcript compatibility +------------------------ + +The following validation checks are performed to determine compatibility of a transcript position: + +* The position exists on an accession +* The sequence matches the expected reference sequence for a given accession or position +* Exon numbering matches known exon structure + +A transcript that fails to pass any of these checks is discarded as incompatible. + +Representative transcript priority +---------------------------------- + +All compatible transcripts are evaluated and ordered against the below criteria. The candidate transcript which meets the earliest criterion is chosen as representative. + +#. Transcript is annotated as a `MANE Select` transcript +#. Transcript is annotated as a `MANE Plus Clinical` transcript +#. Transcript is the longest-compatible remaining transcript +#. Transcript is the first-published (lowest-numbered RefSeq/Ensembl accession) remaining transcript + +.. note:: + + We always prefer the most recent version of a transcript associated with an assembly. diff --git a/docs/source/usage.rst b/docs/source/usage.rst new file mode 100644 index 00000000..6b859b53 --- /dev/null +++ b/docs/source/usage.rst @@ -0,0 +1,81 @@ +.. _usage: + +Usage +===== + +Cool-Seq-Tool provides easy access to, and useful operations on, a selection of important genomic resources. Modules are divided into three groups: + +* :ref:`Data sources `, for basic acquisition and setup for a data source via Python + +* :ref:`Data handlers `, for additional operations on top of existing sources + +* :ref:`Data mappers `, for functions that incorporate multiple sources/handlers to produce output + +The core :py:class:`CoolSeqTool ` class encapsulates all of their functions and can be used for easy initialization and access: + +.. code-block:: pycon + + >>> from cool_seq_tool.app import CoolSeqTool + >>> cst = CoolSeqTool() + >>> cst.seqrepo_access.translate_alias("NM_002529.3")[0][-1] + 'ga4gh:SQ.RSkww1aYmsMiWbNdNnOTnVDAM3ZWp1uA' + >>> cst.transcript_mappings.ensembl_protein_for_gene_symbol["BRAF"][0] + 'ENSP00000419060' + >>> await cst.uta_db.get_ac_from_gene("BRAF") + ['NC_000007.14', 'NC_000007.13'] + +Descriptions and examples of functions can be found in the :ref:`API Reference ` section. + +REST server +----------- + +Core Cool-Seq-Tool functions can also be performed via a REST HTTP interface, provided via `FastAPI `_. Use the ``uvicorn`` shell command to start a server instance: + +.. code-block:: shell + + uvicorn cool_seq_tool.api:app + +By default, ``uvicorn`` serves to port 8000. Once initialized, go to ``_ in a web browser for OpenAPI docs describing available endpoints. + +REST routes are defined using the FastAPI ``APIRouter`` class, meaning that they can also be mounted to other FastAPI applications: + +.. code-block:: python + + from fastapi import FastAPI + from cool_seq_tool.routers import mane + + app = FastAPI() + app.include_router(mane.router) + +.. _configuration: + +Environment configuration +------------------------- + +Individual classes will accept arguments upon initialization to set parameters regarding data sources. In general, these parameters are also configurable via environment variables, e.g. in a cloud deployment. + +.. list-table:: + :widths: 25 100 + :header-rows: 1 + + * - Variable + - Description + * - ``LRG_REFSEQGENE_PATH`` + - Path to LRG_RefSeqGene file. Used in :py:class:`TranscriptMappings ` to provide mappings between gene symbols and RefSeq/Ensembl transcript accessions. If not defined, defaults to the most recent version (formatted as ``data/LRG_RefSeqGene_YYYYMMDD``) within the Cool-Seq-Tool library directory. Cool-Seq-Tool will acquire this data manually if no configuration is provided. + * - ``TRANSCRIPT_MAPPINGS_PATH`` + - Path to transcript mapping file generated from `Ensembl BioMart `_. Used in :py:class:`TranscriptMappings `. If not defined, uses a copy of the file that is bundled within the Cool-Seq-Tool installation. See the :ref:`contributor instructions ` for information on manually rebuilding it. Cool-Seq-Tool will acquire this data manually if no configuration is provided. + * - ``MANE_SUMMARY_PATH`` + - Path to MANE Summary file. Used in :py:class:`MANETranscriptMappings ` to provide MANE transcript annotations. If not defined, defaults to the most recent version (formatted as ``data/MANE.GRCh38vX.X.summary.txt``) within the Cool-Seq-Tool library directory. + * - ``SEQREPO_ROOT_DIR`` + - Path to SeqRepo directory (i.e. contains ``aliases.sqlite3`` database file, and ``sequences`` directory). Used by :py:class:`SeqRepoAccess `_, i.e. of the form ``postgresql://:@://``, used by the :py:class:`cool_seq_tool.sources.uta_database.UTADatabase` class. By default, it is set to ``postgresql://uta_admin:uta@localhost:5433/uta/uta_20210129b``. + * - ``LIFTOVER_CHAIN_37_TO_38`` + - A path to a `chainfile `_ for lifting from GRCh37 to GRCh38. Used by :py:class:`cool_seq_tool.sources.uta_database.UTADatabase` as input to `pyliftover `_. If not provided, pyliftover will fetch it automatically from UCSC. + * - ``LIFTOVER_CHAIN_38_TO_37`` + - A path to a `chainfile `_ for lifting from GRCh38 to GRCh37. Used by :py:class:`cool_seq_tool.sources.uta_database.UTADatabase` as input to `pyliftover `_. If not provided, pyliftover will fetch it automatically from UCSC. + +Schema support +-------------- + +Many genomic data objects produced by Cool-Seq-Tool are structured in conformance with the `Variation Representation Specification `_, courtesy of the `VRS-Python `_ library. diff --git a/pyproject.toml b/pyproject.toml index 685e11d0..c50fa99c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,6 +6,8 @@ build-backend = "setuptools.build_meta:__legacy__" line-length = 88 [tool.ruff] +exclude = ["docs/source/conf.py"] + # pycodestyle (E, W) # Pyflakes (F) # flake8-annotations (ANN) diff --git a/setup.cfg b/setup.cfg index e94da141..14ba77ef 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,6 +1,6 @@ [metadata] name = cool_seq_tool -description = Common Operations On Lots-of Sequences Tool. +description = Common Operations On Lots of Sequences Tool long_description = file:README.md long_description_content_type = text/markdown author = Wagner Lab, Nationwide Childrens Hospital @@ -44,5 +44,13 @@ tests = pytest-asyncio == 0.18.3 mock +docs = + sphinx==6.1.3 + sphinx-autodoc-typehints==1.22.0 + sphinx-autobuild==2021.3.14 + sphinx-copybutton==0.5.2 + sphinxext-opengraph==0.8.2 + furo==2023.3.27 + [tool:pytest] addopts = --ignore setup.py --ignore codebuild/ --doctest-modules --cov-report term-missing --disable-warnings --cov .