February 2022 Update:
The Open Targets COVID-19 Target Prioritisation Tool has been deprecated and archived. For more information, please see opentargets/platform#1964 and this Community thread.
Centralise publicly available datasets in order to build a Virus – Host Target Knowledgebase for Drug Target Selection.
A full description of the project vision is here Some of the questions the project is trying to answer here
This project is now closed to contributors.
- Potential data sources are fetched based on URLs provided in the
Makefile
if such URL is not available, the data tables can be directly added to the/data
folder. - Relevant pieces of information is extracted from the raw source data as part of a parsing step.
- Data might be further processed if necessary eg. mapping cross-references, integration with other sources etc.
- Pre-processed tables are then picked up by the integrator script(s) and compiled into presentable tables.
The pipeline integrates information about human and viral targets from the following sources:
Name | Source | Description |
---|---|---|
Human gene information | Ensembl | Information about human genes |
COVID-19 UniProtKB | UniProt | UniProt site with information about SARS and SARS-CoV-2 proteins |
SARS-CoV-2 Complexes | IntAct | Information about SARS-CoV-2 protein complexes |
Human - virus interactome | IntAct | Human - SARS-CoV-2 interactome from Gordon et al. 2020 plus intercations of human proteins with proteins of other viruses based on IntAct data |
Human interactome | IntAct | Human protein-protein interactions from IntAct database |
Baseline expression per anatomical systems | Open Targets | Baseline gene expression per anatomical systems provided by Expression Atlas group used in the Open Targets Platform |
Baseline expression distribution and specificity | Human protein Atlas | Information about subcellular location of proteins, tissue distribution and tissue specificity as provided by HPA |
Protein expression during SARS-CoV-2 infection | Bojkova et al. 2020 | Information about proteins whose abundance is regulated during viral infection from Bojkova et al. 2020 paper |
Target Tractability | Open Targets | Target tractability assessment for small molecules, antibodies and other modalities provided by ChEMBL and used in the Open Targets Platform |
Target Safety | Open Targets | Manually curated target safety data used in the Open Targets Platform |
Target Drugs | Open Targets | Information about drugs extracted from the ChEMBL evidence file used in the Open Targets platform |
Drugs in COVID-19 clinical trials | ChEMBL | Drugs in clinical trials against COVID-19 |
Active compounds in COVID-19 in vitro assays | ChEMBL | Compounds shown to be active in COVID-19 in vitro assays provided by ChEMBL |
Mendelian randomization | Open Targets |
Other files not listed in the table are also used for supporting purposes such as gene id mappings.
The following programs have to be installed and available in order to run the pipeline:
- Python 3.7
- Pipenv: Recommended v2018.11.26 or newer. If using a package manager check the version installed, since version v11.9.0 available in Ubuntu does not work.
- jq
- R 4.0.0
- pandoc 1.12.3 or higher
The pipeline has been run successfully with those dependencies on macOS (Catalina) and Ubuntu (20.04 LTS).
git clone https://github.com/opentargets/ot_covid19
cd ot_covid19
make all
make all
downloads all data, builds Python environment, run parsers and the integrator script(s).
make setup-environment
- building Python environmentmake downloads
- download files onlymake parsers
- run parsers onlymake integrator
- run integrator(s) onlymake clean-all
- removing temporary files.
Data folders:
/data
- containing version controlled data files that cannot be directly accessed from the web/temp
- created bymake
will be populated when run locally. Data under this folder is not versioned/temp/raw_files
- raw data files fetched from the web/temp/parsed_tables
- parsed tables/temp/preformated_tables
- pre-processed data ready to be integrated
Script folders:
/src/parsers
- contains parser scripts/src/query
- contains SQL queries/src/integrators
- contains integrator scripts
For the integration part the concept is that we want to keep out logic altogether from the integrator scripts. All data processing steps should happen earlier in the parsing and pre-processing steps. Integrator scripts take tables with columns that are ready to be added to the final tables. So there are a few requirements:
- Tables are read from
/temp/preformated_tables
directory only - The tables must be in an uncompressed tsv format
- Tables must have a column called
id
containing unique identifiers (eg. Ensembl gene id or Uniprot primary accession) - Tables to be integrated must be added to the integration config files describing how the integration should happen
This recipe shows how to integrate a dataset were we are expecting new targets (eg viral proteins) that are not included in the complete human gene set:
"uniprot_covid19_parsed.tsv": {
"columns": [],
"flag": true,
"flag_label": "COVID-19 UniprotKB",
"how": "outer",
"columns_to_map": {
"taxon_id": "taxon_id",
"uniprot_ids": "uniprot_accessions"
}
}
Where:
columns
contains the list of columns to be added. It can be empty if only flag is used.flag
- boolean, indicating if the table is used as a flag eg. marking genes that are in the COVID-19 uniprot datasetflag_label
- title of the flag columnhow
- describing how the join should happen. By default it is left. Use outer if you expect new targets in the integrated datasetcolumns_to_map
- column mapping to populate existing fields for the new targets.
This recipe shows how to integrate a dataset where we are not expecting new targets, and two new columns are added to the final table:
"ot_drugs_processed.tsv": {
"columns": ["drug_names", "max_phase"],
"flag": false
}