FHIR · mihaitodor · Sep 21, 2023 · Sep 27, 2023 · Oct 5, 2023 · Oct 18, 2023
diff --git a/.env b/.env
@@ -0,0 +1,2 @@
+UTA_DATABASE_SCHEMA=uta_20240523b
+UTILITIES_DATA_VERSION=113c119
diff --git a/.github/workflows/cicd.yml b/.github/workflows/cicd.yml
@@ -28,7 +28,7 @@ jobs:
           args: --extend-ignore E501,E741
 
       - name: Run Tests
-        run: python -m pytest
+        run: ./fetch_utilities_data.sh && python -m pytest
 
   deploy:
     name: Deploy to dev

diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,7 @@
 .pytest_cache
 __pycache__
 .venv
+/seqrepo
+/refseq
+/data
+/tmp
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -10,9 +10,6 @@
     "python.testing.pytestArgs": [
         "."
     ],
-    "[python]": {
-        "editor.defaultFormatter": "ms-python.autopep8",
-    },
     "autopep8.args": [
         "--max-line-length=200"
     ],

diff --git a/Procfile b/Procfile
@@ -1 +1 @@
-web: gunicorn run:app
+web: ./fetch_utilities_data.sh && gunicorn run:app
diff --git a/README.md b/README.md
@@ -42,9 +42,132 @@ The operations return the following status codes:
 
 ## Testing
 
-To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code Testing functionality which should discover them automatically. You can also
-run `python3 -m pytest` from the terminal to execute them all.
+For local development, you will have to create a `secrets.env` file in the root of the repo and add in it the MongoDB
+password and, optionally, the UTA Postgres database connection string (see the UTA section below for details):
+
+```
+MONGODB_READONLY_PASSWORD=...
+UTA_DATABASE_URL=...
+```
+
+To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code
+Testing functionality which should discover them automatically. You can also run `python3 -m pytest` from the terminal
+to execute them all.
 
 Additionally, since the tests run against the Mongo DB database, if you need to update the test data in this repo, you
 can run `OVERWRITE_TEST_EXPECTED_DATA=true python3 -m pytest` from the terminal and then create a pull request with the
 changes.
+
+## Heroku Deployment
+
+Currently, there are two environments running in Heroku:
+- Dev: <https://fhir-gen-ops-dev-ca42373833b6.herokuapp.com/>
+- Prod: <https://fhir-gen-ops.herokuapp.com/>
+
+Pull requests will trigger a deployment to the dev environment automatically after being merged.
+
+The ["Manual Deployment"](https://github.com/FHIR/genomics-operations/actions/workflows/manual_deployment.yml) workflow
+can be used to deploy code to either the `dev` or `prod` environments. To do so, please select "Run workflow", ignore
+the "Use workflow from" dropdown which lists the branches in the current repo (I can't disable / remove it) and then
+select the environment, the branch and the repository. By default, the `https://github.com/FHIR/genomics-operations`
+repo is specified, but you can replace it with any any fork.
+
+Deployments to the prod environment can only be triggered manually from the `main` branch of the repo using the Manual
+Deployment.
+
+### Heroku Stack
+
+Make sure that the Python version under [`runtime.txt`](./runtime.txt) is
+[supported](https://devcenter.heroku.com/articles/python-support#supported-runtimes) by the
+[Heroku stack](https://devcenter.heroku.com/articles/stack) that is currently running in each environment.
+
+### UTA Database
+
+The Biocommons [hgvs](https://github.com/biocommons/hgvs) library which is used for variant parsing, validation and
+normalisation requires access to a copy of the [UTA](https://github.com/biocommons/uta) Postgres database.
+
+We have provisioned a Heroku Postgres instance in the Prod environment which contains the imported data from a database
+dump as described [here](https://github.com/biocommons/uta#installing-from-database-dumps).
+
+We define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name of the
+currently imported database schema.
+
+#### Database import procedure (it will take about 30 minutes):
+
+- Go to the UTA dump download site (http://dl.biocommons.org/uta/) and get the latest `<UTA_SCHEMA>.pgd.gz` file.
+- Go to https://dashboard.heroku.com/apps/fhir-gen-ops/resources and click on the "Heroku Postgres" instance (it will
+open a new window)
+- Go to the Settings tab
+- Click "View Credentials"
+- Use the fields from this window to fill in the variables below
+
+```shell
+$ POSTGRES_HOST="<Heroku Postgres Host>"
+$ POSTGRES_DATABASE="<Heroku Postgres Database>"
+$ POSTGRES_USER="<Heroku Postgres User>"
+$ PGPASSWORD="<Heroku Postgres Password>"
+$ UTA_SCHEMA="<UTA Schema>" # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
+$ gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v '^GRANT USAGE ON SCHEMA .* TO anonymous;$' | grep -v '^ALTER .* OWNER TO uta_admin;$' | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
+```
+
+Note: The `grep -v` commands are required because the Heroku Postgres instance doesn't allow us to create a new role.
+
+Once complete, make sure you update the `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file and commit
+it.
+
+#### Connection string
+
+The connection string for this database can be found in the same Heroku Postgres Settings tab under "View Credentials".
+It is pre-populated in the Heroku runtime under the `UTA_DATABASE_URL` environment variable. Additionally, we set the
+same `UTA_DATABASE_URL` environment variable in GitHub so the CI can can use this database when running the tests.
+
+For local development, if you'd like to use this Postgres instance instead of the HGVS public one
+(`postgresql://anonymous:[email protected]/uta`), please add `UTA_DATABASE_URL` with the Heroku Postgres
+connection string in the `secrets.env` file.
+
+#### Testing the database
+
+```shell
+$ pgcli "${UTA_DATABASE_URL}"
+> set schema '<UTA Schema>'; # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
+> select count(*) from alembic_version
+    union select count(*) from associated_accessions
+    union select count(*) from exon
+    union select count(*) from exon_aln
+    union select count(*) from exon_set
+    union select count(*) from gene
+    union select count(*) from meta
+    union select count(*) from origin
+    union select count(*) from seq
+    union select count(*) from seq_anno
+    union select count(*) from transcript
+    union select count(*) from translation_exception;
+```
+
+### RefSeq data
+
+The RefSeq metadata from the UTA database needs to be in sync with the RefSeq data which is available for the *Seqfetcher
+Utility* endpoint. Currently, this is stored in GitHub as release artifacts.
+
+To update the RefSeq data, you will have to install `seqrepo` locally and run `./utilities/pack_seqrepo_data.py`. Here
+is a step-by-step guide on how to do this:
+
+```shell
+$ mkdir seqrepo
+$ cd seqrepo
+$ python3 -m venv .venv
+$ . .venv/bin/activate
+$ pip install setuptools==75.7.0
+$ pip install biocommons.seqrepo==0.6.9
+$ # See https://github.com/biocommons/biocommons.seqrepo/issues/171 for a bug that's causing issues with the builtin
+$ # rsync on OSX.
+$ brew install rsync # OSX-specific. Guess the standard package managers have it available on Linux
+$ seqrepo --rsync-exe /opt/homebrew/bin/rsync -r . pull --update-latest
+$ # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which got created):
+$ # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
+$
+$ # cd to genomics-operations repo
+$ python ./utilities/pack_seqrepo_data.py --seqrepo_dir /path/to/seqrepo/dir/latest
+$ # Upload tar archives from ./tmp/ to a new GitHub release and then update `UTILITIES_DATA_VERSION` in the `.env` file
+$ # such that it contains the short SHA of the new release which contains the updated data.
+```
diff --git a/app/__init__.py b/app/__init__.py
@@ -3,6 +3,13 @@
 from flask_cors import CORS
 import os
 
+import hgvs
+# Disable the hgvs LRU cache to avoid blowing up memory
+# TODO: Revisit this, since this caching might not use a ton of memory.
+hgvs.global_config.lru_cache.maxsize = 0
+# Disable HGVS strict bounds checks as a workaround for liftover failures: https://github.com/biocommons/hgvs/issues/717
+hgvs.global_config.mapping.strict_bounds = False
+
 
 def create_app():
     # App and API

diff --git a/app/api_spec.yml b/app/api_spec.yml
@@ -1248,6 +1248,86 @@ paths:
             pattern: '^\s*[Nn][Cc]_\d{4,10}(\.)(\d{1,2}):\d{1,10}-\d{1,10}\s*$'
             example: "NC_000001.11:11794399-11794400"
 
+  /utilities/seqfetcher/1/sequence/{acc}:
+    get:
+      summary: "Seqfetcher"
+      operationId: "app.utilities_endpoints.seqfetcher"
+      tags:
+        - "Operations Utilities (not part of balloted HL7 Operations)"
+      responses:
+        "200":
+          description: "Returns RefSeq subsequence"
+          content:
+            text/plain:
+              schema:
+                type: string
+      parameters:
+        - name: acc
+          in: path
+          required: true
+          description: Accession
+          schema:
+            type: string
+            example: "NC_000001.10"
+        - name: start
+          in: query
+          required: true
+          description: Subsequence start index
+          schema:
+            type: integer
+            example: 1
+        - name: end
+          in: query
+          required: true
+          description: Subsequence end index
+          schema:
+            type: integer
+            example: 2
+
+  /utilities/normalize-variant:
+    get:
+      summary: "Normalize Variant"
+      operationId: "app.utilities_endpoints.normalize_variant"
+      tags:
+        - "Operations Utilities (not part of balloted HL7 Operations)"
+      responses:
+        "200":
+          description: "Returns a normalized variant in both GRCh37 and GRCh38."
+          content:
+            application/json:
+              schema:
+                type: object
+      parameters:
+        - name: variant
+          in: query
+          required: true
+          description: "Variant."
+          schema:
+            type: string
+            example: "NM_021960.4:c.740C>T"
+
+  /utilities/normalize-hla:
+    get:
+      summary: "Normalize HLA"
+      operationId: "app.utilities_endpoints.normalize_hla"
+      tags:
+        - "Operations Utilities (not part of balloted HL7 Operations)"
+      responses:
+        "200":
+          description: "Returns a normalized HLA ARD allele."
+          content:
+            application/json:
+              schema:
+                type: object
+      parameters:
+        - name: allele
+          in: query
+          required: true
+          description: "Allele."
+          schema:
+            type: string
+            example: "B14"
+
 tags:
   - name: Subject Genotype Operations
   - name: Subject Phenotype Operations
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		UTA_DATABASE_SCHEMA=uta_20240523b
		UTILITIES_DATA_VERSION=113c119
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		web: gunicorn run:app
		web: ./fetch_utilities_data.sh && gunicorn run:app