docs: add documentation (#222)

GenomicMedLab · Dec 26, 2023 · 28e99a9 · 28e99a9
1 parent 239833c
commit 28e99a9
Show file tree

Hide file tree

Showing 35 changed files with 1,183 additions and 473 deletions.
diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
@@ -30,3 +30,21 @@ jobs:
 
       - name: ruff
         uses: chartboost/ruff-action@v1
+  docs:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: 3.11
+
+      - name: Install dependencies
+        run: |
+          python3 -m pip install --upgrade pip
+          python3 -m pip install '.[docs]'
+
+      - name: Attempt docs build
+        working-directory: ./docs
+        run: make html
diff --git a/.gitignore b/.gitignore
@@ -69,7 +69,7 @@ instance/
 .scrapy
 
 # Sphinx documentation
-docs/_build/
+docs/build/
 
 # PyBuilder
 target/
@@ -130,9 +130,14 @@ dmypy.json
 
 Pipfile.lock
 
+.DS_Store
+
 # Data files
 cool_seq_tool/data/seqrepo/
 cool_seq_tool/data/*.txt
 cool_seq_tool/data/LRG_RefSeqGene*
 cool_seq_tool/data/MANE*
 cool_seq_tool/data/notebooks/
+
+# Autogenerated docs
+docs/source/reference/api
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,16 @@
+version: 2
+
+build:
+  os: "ubuntu-20.04"
+  tools:
+    python: "3.11"
+
+python:
+  install:
+    - method: pip
+      path: .
+      extra_requirements:
+        - docs
+
+sphinx:
+  configuration: docs/source/conf.py
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2021 VICC
+Copyright (c) 2021-2023 Wagner Lab
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,153 +1,52 @@
-# **C**ommon **O**perations **O**n **L**ots-of **Seq**uences Tool
+<h1 align="center">
+CoolSeqTool
+</h1>
 
-The **cool-seq-tool** provides:
+**[Documentation](#)** · [Installation](#) · [Usage](#) · [API reference](#)
 
-  - Transcript alignment data from the [UTA](https://github.com/biocommons/uta) database
-  - Fast access to sequence data using [SeqRepo](https://github.com/biocommons/biocommons.seqrepo)
-  - Liftover between assemblies (GRCh38 <--> GRCh37) from [PyLiftover](https://github.com/konstantint/pyliftover)
-  - Lifting over to preferred [MANE](https://www.ncbi.nlm.nih.gov/refseq/MANE/) compatible transcript. See [here](docs/TranscriptSelectionPriority.md) for more information.
+## Overview
 
-## Installation
+<!-- description -->
+The **CoolSeqTool** provides:
 
-### pip
+ - A Pythonic API on top of sequence data of interest to tertiary analysis tools, including mappings between gene names and transcripts, [MANE transcript](https://www.ncbi.nlm.nih.gov/refseq/MANE/) descriptions, and the [Universal Transcript Archive](https://github.com/biocommons/uta)
+ - Augmented access to the [SeqRepo](https://github.com/biocommons/biocommons.seqrepo) database, including multiple additional methods and tools
+ - Mapping tools that combine the above to support translation between references sequences, annotation layers, and MANE transcripts
+<!-- /description -->
 
-```commandline
-pip install cool-seq-tool[dev,tests]
-```
-
-### Development
-
-Clone the repo:
-
-```commandline
-git clone https://github.com/GenomicMedLab/cool-seq-tool
-cd cool_seq_tool
-```
-
-[Install Pipenv](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today) if necessary.
-
-Install backend dependencies and enter Pipenv environment:
-
-```commandline
-pipenv shell
-pipenv update
-pipenv install --dev
-```
-
-### UTA Database Installation
-
-`cool-seq-tool` uses intalls local UTA database. For other ways to install, visit [biocommons.uta](https://github.com/biocommons/uta).
-
-#### Local Installation
-
-_The following commands will likely need modification appropriate for the installation environment._
-1. Install [PostgreSQL](https://www.postgresql.org/)
-2. Create user and database.
-
-    ```
-    $ createuser -U postgres uta_admin
-    $ createuser -U postgres anonymous
-    $ createdb -U postgres -O uta_admin uta
-    ```
-
-3. To install locally, from the _cool_seq_tool/data_ directory:
-```
-export UTA_VERSION=uta_20210129b.pgd.gz
-curl -O https://dl.biocommons.org/uta/$UTA_VERSION
-gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433
-```
-
-##### UTA Installation Issues
-If you have trouble installing UTA, you can visit [these two READMEs](https://github.com/ga4gh/vrs-python/tree/main/docs/setup_help).
-
-#### Connecting to the database
-
-To connect to the UTA database, you can use the default url (`postgresql://uta_admin:uta@localhost:5433/uta/uta_20210129b`).
-
-If you do not wish to use the default, you must set the environment variable `UTA_DB_URL` which has the format of `driver://user:password@host:port/database/schema`.
-
-### Data Downloads
+---
 
-#### SeqRepo
-`cool-seq-tool` relies on [seqrepo](https://github.com/biocommons/biocommons.seqrepo), which you must download yourself.
+## Install
 
-Use the `SEQREPO_ROOT_DIR` environment variable to set the path of an already existing SeqRepo directory. The default is `/usr/local/share/seqrepo/latest`.
+CoolSeqTool is available on [PyPI](https://pypi.org/project/cool-seq-tool)
 
-From the _root_ directory:
+```shell
+python3 -m pip install cool-seq-tool
 ```
-pip install seqrepo
-sudo mkdir /usr/local/share/seqrepo
-sudo chown $USER /usr/local/share/seqrepo
-seqrepo pull -i 2021-01-29  # Replace with latest version using `seqrepo list-remote-instances` if outdated
-```
-
-If you get an error similar to the one below:
-```
-PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'
-```
-
-You will want to do the following:\
-(*Might not be ._fkuefgd, so replace with your error message path*)
-```console
-sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29
-exit
-```
-
-#### LRG_RefSeqGene
-
-`cool-seq-tool` fetches the latest version of `LRG_RefSeqGene` if the environment variable `LRG_REFSEQGENE_PATH` is not set. When `LRG_REFSEQGENE_PATH` is set, `cool-seq-tool` will look at this path and expect the LRG_RefSeqGene file. This file is found can be found [here](https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene).
-
-#### MANE Summary Data
-
-`cool-seq-tool` fetches the latest version of `MANE.GRCh38.*.summary.txt.gz` if the environment variable `MANE_SUMMARY_PATH` is not set. When `MANE_SUMMARY_PATH` is set, `cool-seq-tool` will look at this path and expect the MANE Summary Data file. This file is found can be found [here](https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/).
-
-#### transcript_mapping.tsv
-`cool-seq-tool` is packaged with transcript mapping data acquired from [Ensembl BioMart](http://www.ensembl.org/biomart/martview). If the environment variable `TRANSCRIPT_MAPPINGS_PATH` is not set, `cool-seq-tool` will use the built-in file. When `TRANSCRIPT_MAPPINGS_PATH` is set, `cool_seq_tool` will look at this path and expect to find the transcript mapping TSV file.
 
-To acquire this data manually from the [BioMart](https://www.ensembl.org/biomart/martview), select the `Human Genes (GRCh38.p13)` dataset and choose the following attributes:
+See the [installation instructions](#) in the documentation for a description of dependency setup requirements.
 
-* Gene stable ID
-* Gene stable ID version
-* Transcript stable ID
-* Transcript stable ID version
-* Protein stable ID
-* Protein stable ID version
-* RefSeq match transcript (MANE Select)
-* Gene name
+---
 
-![image](biomart.png)
+## Usage
 
-## Starting the UTA Tools Service Locally
+All CoolSeqTool resources can be initialized by way of a top-level class instance:
 
-To start the service, run the following:
-
-```commandline
-uvicorn cool_seq_tool.api:app --reload
+```pycon
+>>> from cool_seq_tool.app import CoolSeqTool
+>>> cst = CoolSeqTool()
+>>> result = await cst.mane_transcript.get_mane_transcript(
+...     "NP_004324.2",
+...     599,
+...     AnnotationLayer.PROTEIN,
+...     residue_mode=ResidueMode.INTER_RESIDUE,
+... )
+>>> result.gene, result.refseq, result.status
+('EGFR', 'NM_005228.5', <TranscriptPriority.MANE_SELECT: 'mane_select'>)
 ```
 
-Next, view the FastAPI on your local machine: http://127.0.0.1:8000/cool_seq_tool
-
-## Init coding style tests
-
-Code style is managed by [Ruff](https://github.com/astral-sh/ruff) and [Black](https://github.com/psf/black), and should be checked prior to commit.
-
-We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.
-
-This ensures:
-
-* Check code style
-* Check for added large files
-* Detect AWS Credentials
-* Detect Private Key
+---
 
-Before first commit run:
+## Feedback and contributing
 
-```
-pre-commit install
-```
-
-## Testing
-From the _root_ directory of the repository:
-```
-pytest
-```
+We welcome bug reports, feature requests, and code contributions from users and interested collaborators. The [documentation](#) contains guidance for submitting feedback and contributing new code.
diff --git a/cool_seq_tool/api.py b/cool_seq_tool/api.py
@@ -24,9 +24,9 @@ def custom_openapi() -> Dict:
     if app.openapi_schema:
         return app.openapi_schema
     openapi_schema = get_openapi(
-        title="The GenomicMedLab Cool Seq Tool",
+        title="The GenomicMedLab Cool-Seq-Tool",
         version=__version__,
-        description="Common Operations On Lots-of Sequences Tool.",
+        description="Common Operations On Lots of Sequences Tool.",
         routes=app.routes,
     )
 

diff --git a/cool_seq_tool/app.py b/cool_seq_tool/app.py
@@ -1,4 +1,6 @@
-"""Module for initializing data sources."""
+"""Provides core CoolSeqTool class, which non-redundantly initializes all Cool-Seq-Tool
+data handler and mapping resources for straightforward access.
+"""
 import logging
 from pathlib import Path
 from typing import Optional
@@ -25,7 +27,26 @@
 
 
 class CoolSeqTool:
-    """Class to initialize data sources."""
+    """Non-redundantly initialize all Cool-Seq-Tool data resources, available under the
+    following attribute names:
+
+    * ``self.seqrepo_access``: :py:class:`SeqRepoAccess <cool_seq_tool.handlers.seqrepo_access.SeqRepoAccess>`
+    * ``self.transcript_mappings``: :py:class:`TranscriptMappings <cool_seq_tool.sources.transcript_mappings.TranscriptMappings>`
+    * ``self.mane_transcript_mappings``: :py:class:`MANETranscriptMappings <cool_seq_tool.sources.mane_transcript_mappings.MANETranscriptMappings>`
+    * ``self.uta_db``: :py:class:`UTADatabase <cool_seq_tool.sources.uta_database.UTADatabase>`
+    * ``self.alignment_mapper``: :py:class:`AlignmentMapper <cool_seq_tool.mappers.alignment.AlignmentMapper>`
+    * ``self.mane_transcript``: :py:class:`MANETranscript <cool_seq_tool.mappers.mane_transcript.MANETranscript>`
+    * ``self.ex_g_coords_mapper``: :py:class:`ExonGenomicCoordsMapper <cool_seq_tool.mappers.exon_genomic_coords.ExonGenomicCoordsMapper>`
+
+    Initialization with default resource locations is straightforward:
+
+    .. code-block:: pycon
+
+       >>> from cool_seq_tool.app import CoolSeqTool
+       >>> cst = CoolSeqTool()
+
+    See the :ref:`configuration <configuration>` section for more information.
+    """
 
     def __init__(
         self,
@@ -37,11 +58,11 @@ def __init__(
     ) -> None:
         """Initialize CoolSeqTool class
 
-        :param transcript_file_path: The path to transcript_mapping.tsv
-        :param lrg_refseqgene_path: The path to LRG_RefSeqGene
+        :param transcript_file_path: The path to ``transcript_mapping.tsv``
+        :param lrg_refseqgene_path: The path to the LRG_RefSeqGene file
         :param mane_data_path: Path to RefSeq MANE summary data
         :param db_url: PostgreSQL connection URL
-            Format: `driver://user:password@host/database/schema`
+            Format: ``driver://user:password@host/database/schema``
         :param sr: SeqRepo instance. If this is not provided, will create a new instance
         """
         if not sr:

diff --git a/cool_seq_tool/data/data_downloads.py b/cool_seq_tool/data/data_downloads.py
@@ -1,4 +1,4 @@
-"""Module for handling downloadable data files."""
+"""Handle acquisition of external data."""
 import datetime
 import gzip
 import logging
@@ -15,8 +15,11 @@
 
 
 class DataDownload:
-    """Class for managing downloadable data files. Responsible for checking if files
-    are available under default locations, and fetching them if not.
+    """Manage downloadable data files. Responsible for checking if files are available
+    under expected locations, and fetching them if not.
+
+    Relevant methods are called automatically by data classes; users should not have
+    to interact with this class under normal circumstances.
     """
 
     def __init__(self) -> None:
@@ -25,7 +28,7 @@ def __init__(self) -> None:
 
     def get_mane_summary(self) -> Path:
         """Identify latest MANE summary data. If unavailable locally, download from
-        source.
+        `NCBI FTP server <https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/>`_.
 
         :return: path to MANE summary file
         """
@@ -52,7 +55,7 @@ def get_mane_summary(self) -> Path:
 
     def get_lrg_refseq_gene_data(self) -> Path:
         """Identify latest LRG RefSeq Gene file. If unavailable locally, download from
-        source.
+        `NCBI FTP server <https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/>`_.
 
         :return: path to acquired LRG RefSeq Gene data file
         """