Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPDI normalization with refseq files #76

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
e1247a9
Add SPDI normalization with refseq files
mihaitodor Sep 21, 2023
4b6fd02
WIP
mihaitodor Sep 27, 2023
c758d11
Test SPDI 3 bit packing RefSeq scheme
mihaitodor Oct 5, 2023
be468fb
Disable some hgvs validations and rename some variables
mihaitodor Oct 18, 2023
addb39e
Fix bit mask bug
mihaitodor Oct 19, 2023
08eacee
Add historical RNA sequences
mihaitodor Nov 16, 2023
96e4b87
Set the HGVS_SEQREPO_URL env var for the hgvs library
mihaitodor Nov 16, 2023
98ac143
Disable the HGVS library LRU cache to avoid blowing up memory
mihaitodor Nov 16, 2023
15ec580
Add Normalize Variant utility
mihaitodor Nov 16, 2023
657b34b
Output the projected HGVS variants via Normalize Variant
mihaitodor Nov 16, 2023
67bf94e
Update to the latest version of UTA
mihaitodor Nov 16, 2023
137f86b
Re-enable variant validation
mihaitodor Dec 7, 2023
2b8bc62
Fix CI/CD pipeline
mihaitodor Dec 7, 2023
66ac5da
Emit a 422 status when unable to normalize a variant
mihaitodor Dec 7, 2023
0f9c51a
Use Heroku Prod UTA database
mihaitodor Dec 12, 2023
24ad91f
Add manual deployment instructions
mihaitodor Dec 14, 2023
19b8c8d
Use Biocommons bioutils package for SPDI normalization
mihaitodor Dec 14, 2023
b06fb4d
Emit a 404 status when the seqfetcher endpoint fails
mihaitodor Dec 14, 2023
e1c137b
Fix bug in get_variant and fix tests
mihaitodor Dec 18, 2023
39b8b0e
Don't trim alleles when ref == alt for SPDI normalization
mihaitodor Dec 21, 2023
bf892ba
Disable HGVS strict bounds checks
mihaitodor Jan 6, 2024
d75368e
Fix broken test which now works again as expected
mihaitodor Jan 6, 2024
b754775
Re-enable test
mihaitodor Jan 6, 2024
4251ff3
Add HLA normalization via py-ard
mihaitodor Jan 25, 2024
68943d1
Refactor utilities data fetching
mihaitodor Jan 25, 2024
13179c7
Add utility for fetching and packing RefSeq data
mihaitodor Jan 25, 2024
6f440f7
Use pyliftover as fallback for hgvs liftover failures
mihaitodor Apr 10, 2024
6b70281
WIP...?
mihaitodor Sep 5, 2024
e5a1102
WIP2...
mihaitodor Jan 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
UTA_DATABASE_SCHEMA=uta_20240523b
UTILITIES_DATA_VERSION=113c119
2 changes: 1 addition & 1 deletion .github/workflows/cicd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
args: --extend-ignore E501,E741

- name: Run Tests
run: python -m pytest
run: ./fetch_utilities_data.sh && python -m pytest

deploy:
name: Deploy to dev
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@
.pytest_cache
__pycache__
.venv
/seqrepo
/refseq
/data
/tmp
3 changes: 0 additions & 3 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,6 @@
"python.testing.pytestArgs": [
"."
],
"[python]": {
"editor.defaultFormatter": "ms-python.autopep8",
},
"autopep8.args": [
"--max-line-length=200"
],
Expand Down
2 changes: 1 addition & 1 deletion Procfile
Original file line number Diff line number Diff line change
@@ -1 +1 @@
web: gunicorn run:app
web: ./fetch_utilities_data.sh && gunicorn run:app
127 changes: 125 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,132 @@ The operations return the following status codes:

## Testing

To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code Testing functionality which should discover them automatically. You can also
run `python3 -m pytest` from the terminal to execute them all.
For local development, you will have to create a `secrets.env` file in the root of the repo and add in it the MongoDB
password and, optionally, the UTA Postgres database connection string (see the UTA section below for details):

```
MONGODB_READONLY_PASSWORD=...
UTA_DATABASE_URL=...
```

To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code
Testing functionality which should discover them automatically. You can also run `python3 -m pytest` from the terminal
to execute them all.

Additionally, since the tests run against the Mongo DB database, if you need to update the test data in this repo, you
can run `OVERWRITE_TEST_EXPECTED_DATA=true python3 -m pytest` from the terminal and then create a pull request with the
changes.

## Heroku Deployment

Currently, there are two environments running in Heroku:
- Dev: <https://fhir-gen-ops-dev-ca42373833b6.herokuapp.com/>
- Prod: <https://fhir-gen-ops.herokuapp.com/>

Pull requests will trigger a deployment to the dev environment automatically after being merged.

The ["Manual Deployment"](https://github.com/FHIR/genomics-operations/actions/workflows/manual_deployment.yml) workflow
can be used to deploy code to either the `dev` or `prod` environments. To do so, please select "Run workflow", ignore
the "Use workflow from" dropdown which lists the branches in the current repo (I can't disable / remove it) and then
select the environment, the branch and the repository. By default, the `https://github.com/FHIR/genomics-operations`
repo is specified, but you can replace it with any any fork.

Deployments to the prod environment can only be triggered manually from the `main` branch of the repo using the Manual
Deployment.

### Heroku Stack

Make sure that the Python version under [`runtime.txt`](./runtime.txt) is
[supported](https://devcenter.heroku.com/articles/python-support#supported-runtimes) by the
[Heroku stack](https://devcenter.heroku.com/articles/stack) that is currently running in each environment.

### UTA Database

The Biocommons [hgvs](https://github.com/biocommons/hgvs) library which is used for variant parsing, validation and
normalisation requires access to a copy of the [UTA](https://github.com/biocommons/uta) Postgres database.

We have provisioned a Heroku Postgres instance in the Prod environment which contains the imported data from a database
dump as described [here](https://github.com/biocommons/uta#installing-from-database-dumps).

We define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name of the
currently imported database schema.

#### Database import procedure (it will take about 30 minutes):

- Go to the UTA dump download site (http://dl.biocommons.org/uta/) and get the latest `<UTA_SCHEMA>.pgd.gz` file.
- Go to https://dashboard.heroku.com/apps/fhir-gen-ops/resources and click on the "Heroku Postgres" instance (it will
open a new window)
- Go to the Settings tab
- Click "View Credentials"
- Use the fields from this window to fill in the variables below

```shell
$ POSTGRES_HOST="<Heroku Postgres Host>"
$ POSTGRES_DATABASE="<Heroku Postgres Database>"
$ POSTGRES_USER="<Heroku Postgres User>"
$ PGPASSWORD="<Heroku Postgres Password>"
$ UTA_SCHEMA="<UTA Schema>" # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
$ gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v '^GRANT USAGE ON SCHEMA .* TO anonymous;$' | grep -v '^ALTER .* OWNER TO uta_admin;$' | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
```

Note: The `grep -v` commands are required because the Heroku Postgres instance doesn't allow us to create a new role.

Once complete, make sure you update the `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file and commit
it.

#### Connection string

The connection string for this database can be found in the same Heroku Postgres Settings tab under "View Credentials".
It is pre-populated in the Heroku runtime under the `UTA_DATABASE_URL` environment variable. Additionally, we set the
same `UTA_DATABASE_URL` environment variable in GitHub so the CI can can use this database when running the tests.

For local development, if you'd like to use this Postgres instance instead of the HGVS public one
(`postgresql://anonymous:[email protected]/uta`), please add `UTA_DATABASE_URL` with the Heroku Postgres
connection string in the `secrets.env` file.

#### Testing the database

```shell
$ pgcli "${UTA_DATABASE_URL}"
> set schema '<UTA Schema>'; # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
> select count(*) from alembic_version
union select count(*) from associated_accessions
union select count(*) from exon
union select count(*) from exon_aln
union select count(*) from exon_set
union select count(*) from gene
union select count(*) from meta
union select count(*) from origin
union select count(*) from seq
union select count(*) from seq_anno
union select count(*) from transcript
union select count(*) from translation_exception;
```

### RefSeq data

The RefSeq metadata from the UTA database needs to be in sync with the RefSeq data which is available for the *Seqfetcher
Utility* endpoint. Currently, this is stored in GitHub as release artifacts.

To update the RefSeq data, you will have to install `seqrepo` locally and run `./utilities/pack_seqrepo_data.py`. Here
is a step-by-step guide on how to do this:

```shell
$ mkdir seqrepo
$ cd seqrepo
$ python3 -m venv .venv
$ . .venv/bin/activate
$ pip install setuptools==75.7.0
$ pip install biocommons.seqrepo==0.6.9
$ # See https://github.com/biocommons/biocommons.seqrepo/issues/171 for a bug that's causing issues with the builtin
$ # rsync on OSX.
$ brew install rsync # OSX-specific. Guess the standard package managers have it available on Linux
$ seqrepo --rsync-exe /opt/homebrew/bin/rsync -r . pull --update-latest
$ # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which got created):
$ # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
$
$ # cd to genomics-operations repo
$ python ./utilities/pack_seqrepo_data.py --seqrepo_dir /path/to/seqrepo/dir/latest
$ # Upload tar archives from ./tmp/ to a new GitHub release and then update `UTILITIES_DATA_VERSION` in the `.env` file
$ # such that it contains the short SHA of the new release which contains the updated data.
```
7 changes: 7 additions & 0 deletions app/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
from flask_cors import CORS
import os

import hgvs
# Disable the hgvs LRU cache to avoid blowing up memory
# TODO: Revisit this, since this caching might not use a ton of memory.
hgvs.global_config.lru_cache.maxsize = 0
# Disable HGVS strict bounds checks as a workaround for liftover failures: https://github.com/biocommons/hgvs/issues/717
hgvs.global_config.mapping.strict_bounds = False


def create_app():
# App and API
Expand Down
80 changes: 80 additions & 0 deletions app/api_spec.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1248,6 +1248,86 @@ paths:
pattern: '^\s*[Nn][Cc]_\d{4,10}(\.)(\d{1,2}):\d{1,10}-\d{1,10}\s*$'
example: "NC_000001.11:11794399-11794400"

/utilities/seqfetcher/1/sequence/{acc}:
get:
summary: "Seqfetcher"
operationId: "app.utilities_endpoints.seqfetcher"
tags:
- "Operations Utilities (not part of balloted HL7 Operations)"
responses:
"200":
description: "Returns RefSeq subsequence"
content:
text/plain:
schema:
type: string
parameters:
- name: acc
in: path
required: true
description: Accession
schema:
type: string
example: "NC_000001.10"
- name: start
in: query
required: true
description: Subsequence start index
schema:
type: integer
example: 1
- name: end
in: query
required: true
description: Subsequence end index
schema:
type: integer
example: 2

/utilities/normalize-variant:
get:
summary: "Normalize Variant"
operationId: "app.utilities_endpoints.normalize_variant"
tags:
- "Operations Utilities (not part of balloted HL7 Operations)"
responses:
"200":
description: "Returns a normalized variant in both GRCh37 and GRCh38."
content:
application/json:
schema:
type: object
parameters:
- name: variant
in: query
required: true
description: "Variant."
schema:
type: string
example: "NM_021960.4:c.740C>T"

/utilities/normalize-hla:
get:
summary: "Normalize HLA"
operationId: "app.utilities_endpoints.normalize_hla"
tags:
- "Operations Utilities (not part of balloted HL7 Operations)"
responses:
"200":
description: "Returns a normalized HLA ARD allele."
content:
application/json:
schema:
type: object
parameters:
- name: allele
in: query
required: true
description: "Allele."
schema:
type: string
example: "B14"

tags:
- name: Subject Genotype Operations
- name: Subject Phenotype Operations
Expand Down
Loading