Skip to content

Commit

Permalink
docs: add documentation (#222)
Browse files Browse the repository at this point in the history
  • Loading branch information
jsstevenson authored Dec 26, 2023
1 parent 239833c commit 28e99a9
Show file tree
Hide file tree
Showing 35 changed files with 1,183 additions and 473 deletions.
18 changes: 18 additions & 0 deletions .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,21 @@ jobs:

- name: ruff
uses: chartboost/ruff-action@v1
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: 3.11

- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install '.[docs]'
- name: Attempt docs build
working-directory: ./docs
run: make html
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ instance/
.scrapy

# Sphinx documentation
docs/_build/
docs/build/

# PyBuilder
target/
Expand Down Expand Up @@ -130,9 +130,14 @@ dmypy.json

Pipfile.lock

.DS_Store

# Data files
cool_seq_tool/data/seqrepo/
cool_seq_tool/data/*.txt
cool_seq_tool/data/LRG_RefSeqGene*
cool_seq_tool/data/MANE*
cool_seq_tool/data/notebooks/

# Autogenerated docs
docs/source/reference/api
16 changes: 16 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
version: 2

build:
os: "ubuntu-20.04"
tools:
python: "3.11"

python:
install:
- method: pip
path: .
extra_requirements:
- docs

sphinx:
configuration: docs/source/conf.py
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2021 VICC
Copyright (c) 2021-2023 Wagner Lab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
169 changes: 34 additions & 135 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,153 +1,52 @@
# **C**ommon **O**perations **O**n **L**ots-of **Seq**uences Tool
<h1 align="center">
CoolSeqTool
</h1>

The **cool-seq-tool** provides:
**[Documentation](#)** · [Installation](#) · [Usage](#) · [API reference](#)

- Transcript alignment data from the [UTA](https://github.com/biocommons/uta) database
- Fast access to sequence data using [SeqRepo](https://github.com/biocommons/biocommons.seqrepo)
- Liftover between assemblies (GRCh38 <--> GRCh37) from [PyLiftover](https://github.com/konstantint/pyliftover)
- Lifting over to preferred [MANE](https://www.ncbi.nlm.nih.gov/refseq/MANE/) compatible transcript. See [here](docs/TranscriptSelectionPriority.md) for more information.
## Overview

## Installation
<!-- description -->
The **CoolSeqTool** provides:

### pip
- A Pythonic API on top of sequence data of interest to tertiary analysis tools, including mappings between gene names and transcripts, [MANE transcript](https://www.ncbi.nlm.nih.gov/refseq/MANE/) descriptions, and the [Universal Transcript Archive](https://github.com/biocommons/uta)
- Augmented access to the [SeqRepo](https://github.com/biocommons/biocommons.seqrepo) database, including multiple additional methods and tools
- Mapping tools that combine the above to support translation between references sequences, annotation layers, and MANE transcripts
<!-- /description -->

```commandline
pip install cool-seq-tool[dev,tests]
```

### Development

Clone the repo:

```commandline
git clone https://github.com/GenomicMedLab/cool-seq-tool
cd cool_seq_tool
```

[Install Pipenv](https://pipenv-fork.readthedocs.io/en/latest/#install-pipenv-today) if necessary.

Install backend dependencies and enter Pipenv environment:

```commandline
pipenv shell
pipenv update
pipenv install --dev
```

### UTA Database Installation

`cool-seq-tool` uses intalls local UTA database. For other ways to install, visit [biocommons.uta](https://github.com/biocommons/uta).

#### Local Installation

_The following commands will likely need modification appropriate for the installation environment._
1. Install [PostgreSQL](https://www.postgresql.org/)
2. Create user and database.

```
$ createuser -U postgres uta_admin
$ createuser -U postgres anonymous
$ createdb -U postgres -O uta_admin uta
```
3. To install locally, from the _cool_seq_tool/data_ directory:
```
export UTA_VERSION=uta_20210129b.pgd.gz
curl -O https://dl.biocommons.org/uta/$UTA_VERSION
gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433
```
##### UTA Installation Issues
If you have trouble installing UTA, you can visit [these two READMEs](https://github.com/ga4gh/vrs-python/tree/main/docs/setup_help).
#### Connecting to the database
To connect to the UTA database, you can use the default url (`postgresql://uta_admin:uta@localhost:5433/uta/uta_20210129b`).
If you do not wish to use the default, you must set the environment variable `UTA_DB_URL` which has the format of `driver://user:password@host:port/database/schema`.
### Data Downloads
---

#### SeqRepo
`cool-seq-tool` relies on [seqrepo](https://github.com/biocommons/biocommons.seqrepo), which you must download yourself.
## Install

Use the `SEQREPO_ROOT_DIR` environment variable to set the path of an already existing SeqRepo directory. The default is `/usr/local/share/seqrepo/latest`.
CoolSeqTool is available on [PyPI](https://pypi.org/project/cool-seq-tool)

From the _root_ directory:
```shell
python3 -m pip install cool-seq-tool
```
pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2021-01-29 # Replace with latest version using `seqrepo list-remote-instances` if outdated
```
If you get an error similar to the one below:
```
PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'
```
You will want to do the following:\
(*Might not be ._fkuefgd, so replace with your error message path*)
```console
sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29
exit
```

#### LRG_RefSeqGene

`cool-seq-tool` fetches the latest version of `LRG_RefSeqGene` if the environment variable `LRG_REFSEQGENE_PATH` is not set. When `LRG_REFSEQGENE_PATH` is set, `cool-seq-tool` will look at this path and expect the LRG_RefSeqGene file. This file is found can be found [here](https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene).

#### MANE Summary Data

`cool-seq-tool` fetches the latest version of `MANE.GRCh38.*.summary.txt.gz` if the environment variable `MANE_SUMMARY_PATH` is not set. When `MANE_SUMMARY_PATH` is set, `cool-seq-tool` will look at this path and expect the MANE Summary Data file. This file is found can be found [here](https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/).

#### transcript_mapping.tsv
`cool-seq-tool` is packaged with transcript mapping data acquired from [Ensembl BioMart](http://www.ensembl.org/biomart/martview). If the environment variable `TRANSCRIPT_MAPPINGS_PATH` is not set, `cool-seq-tool` will use the built-in file. When `TRANSCRIPT_MAPPINGS_PATH` is set, `cool_seq_tool` will look at this path and expect to find the transcript mapping TSV file.

To acquire this data manually from the [BioMart](https://www.ensembl.org/biomart/martview), select the `Human Genes (GRCh38.p13)` dataset and choose the following attributes:
See the [installation instructions](#) in the documentation for a description of dependency setup requirements.

* Gene stable ID
* Gene stable ID version
* Transcript stable ID
* Transcript stable ID version
* Protein stable ID
* Protein stable ID version
* RefSeq match transcript (MANE Select)
* Gene name
---

![image](biomart.png)
## Usage

## Starting the UTA Tools Service Locally
All CoolSeqTool resources can be initialized by way of a top-level class instance:

To start the service, run the following:

```commandline
uvicorn cool_seq_tool.api:app --reload
```pycon
>>> from cool_seq_tool.app import CoolSeqTool
>>> cst = CoolSeqTool()
>>> result = await cst.mane_transcript.get_mane_transcript(
... "NP_004324.2",
... 599,
... AnnotationLayer.PROTEIN,
... residue_mode=ResidueMode.INTER_RESIDUE,
... )
>>> result.gene, result.refseq, result.status
('EGFR', 'NM_005228.5', <TranscriptPriority.MANE_SELECT: 'mane_select'>)
```

Next, view the FastAPI on your local machine: http://127.0.0.1:8000/cool_seq_tool

## Init coding style tests

Code style is managed by [Ruff](https://github.com/astral-sh/ruff) and [Black](https://github.com/psf/black), and should be checked prior to commit.

We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.

This ensures:

* Check code style
* Check for added large files
* Detect AWS Credentials
* Detect Private Key
---

Before first commit run:
## Feedback and contributing

```
pre-commit install
```

## Testing
From the _root_ directory of the repository:
```
pytest
```
We welcome bug reports, feature requests, and code contributions from users and interested collaborators. The [documentation](#) contains guidance for submitting feedback and contributing new code.
4 changes: 2 additions & 2 deletions cool_seq_tool/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ def custom_openapi() -> Dict:
if app.openapi_schema:
return app.openapi_schema
openapi_schema = get_openapi(
title="The GenomicMedLab Cool Seq Tool",
title="The GenomicMedLab Cool-Seq-Tool",
version=__version__,
description="Common Operations On Lots-of Sequences Tool.",
description="Common Operations On Lots of Sequences Tool.",
routes=app.routes,
)

Expand Down
31 changes: 26 additions & 5 deletions cool_seq_tool/app.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""Module for initializing data sources."""
"""Provides core CoolSeqTool class, which non-redundantly initializes all Cool-Seq-Tool
data handler and mapping resources for straightforward access.
"""
import logging
from pathlib import Path
from typing import Optional
Expand All @@ -25,7 +27,26 @@


class CoolSeqTool:
"""Class to initialize data sources."""
"""Non-redundantly initialize all Cool-Seq-Tool data resources, available under the
following attribute names:
* ``self.seqrepo_access``: :py:class:`SeqRepoAccess <cool_seq_tool.handlers.seqrepo_access.SeqRepoAccess>`
* ``self.transcript_mappings``: :py:class:`TranscriptMappings <cool_seq_tool.sources.transcript_mappings.TranscriptMappings>`
* ``self.mane_transcript_mappings``: :py:class:`MANETranscriptMappings <cool_seq_tool.sources.mane_transcript_mappings.MANETranscriptMappings>`
* ``self.uta_db``: :py:class:`UTADatabase <cool_seq_tool.sources.uta_database.UTADatabase>`
* ``self.alignment_mapper``: :py:class:`AlignmentMapper <cool_seq_tool.mappers.alignment.AlignmentMapper>`
* ``self.mane_transcript``: :py:class:`MANETranscript <cool_seq_tool.mappers.mane_transcript.MANETranscript>`
* ``self.ex_g_coords_mapper``: :py:class:`ExonGenomicCoordsMapper <cool_seq_tool.mappers.exon_genomic_coords.ExonGenomicCoordsMapper>`
Initialization with default resource locations is straightforward:
.. code-block:: pycon
>>> from cool_seq_tool.app import CoolSeqTool
>>> cst = CoolSeqTool()
See the :ref:`configuration <configuration>` section for more information.
"""

def __init__(
self,
Expand All @@ -37,11 +58,11 @@ def __init__(
) -> None:
"""Initialize CoolSeqTool class
:param transcript_file_path: The path to transcript_mapping.tsv
:param lrg_refseqgene_path: The path to LRG_RefSeqGene
:param transcript_file_path: The path to ``transcript_mapping.tsv``
:param lrg_refseqgene_path: The path to the LRG_RefSeqGene file
:param mane_data_path: Path to RefSeq MANE summary data
:param db_url: PostgreSQL connection URL
Format: `driver://user:password@host/database/schema`
Format: ``driver://user:password@host/database/schema``
:param sr: SeqRepo instance. If this is not provided, will create a new instance
"""
if not sr:
Expand Down
13 changes: 8 additions & 5 deletions cool_seq_tool/data/data_downloads.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Module for handling downloadable data files."""
"""Handle acquisition of external data."""
import datetime
import gzip
import logging
Expand All @@ -15,8 +15,11 @@


class DataDownload:
"""Class for managing downloadable data files. Responsible for checking if files
are available under default locations, and fetching them if not.
"""Manage downloadable data files. Responsible for checking if files are available
under expected locations, and fetching them if not.
Relevant methods are called automatically by data classes; users should not have
to interact with this class under normal circumstances.
"""

def __init__(self) -> None:
Expand All @@ -25,7 +28,7 @@ def __init__(self) -> None:

def get_mane_summary(self) -> Path:
"""Identify latest MANE summary data. If unavailable locally, download from
source.
`NCBI FTP server <https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/current/>`_.
:return: path to MANE summary file
"""
Expand All @@ -52,7 +55,7 @@ def get_mane_summary(self) -> Path:

def get_lrg_refseq_gene_data(self) -> Path:
"""Identify latest LRG RefSeq Gene file. If unavailable locally, download from
source.
`NCBI FTP server <https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/>`_.
:return: path to acquired LRG RefSeq Gene data file
"""
Expand Down
Loading

0 comments on commit 28e99a9

Please sign in to comment.