OMIM stands for "Online Mendelian Inheritance in Man", and is an online catalog of human genes and genetic disorders. The official site is: https://omim.org/
This purpose of this repository is for data transformations for ingest into Mondo. Mainly,
it is for generating an omim.ttl
and other release artefacts.
Disclaimer: This repository and its created data artefacts are unnofficial. For official, up-to-date OMIM data, please visit omim.org.
- Run:
cp .env.example .env
- Change the value of
API_KEY
to your own. If you don't have one, you can request one at https://omim.org/downloads. This will probably be sufficient for the purposes of downloading the necessary text files, but if not, you can also require access to the REST API as well: https://omim.org/api.
- RealPython blog install guide: My preferred guide for installing on Windows or Mac
- Python documentation for installing on Windows
- Python documentation for installing on Mac
- Run:
make install
- There is a known possible issue with dependency
psutil
on some systems. If you get an error related to this when installing, ignore it, as it is does not seem to be needed to run any of the tools. If however you do get apsutil
error when running anything, please let us know by creating an issue.
Run: sh run.sh make all
Running this will create new release artefacts in the root directory.
You can also run make build
or python -m omim2obo
. These are all the same
command. This will download files from omim.org and run the build.
Offline/cache option: python -m omim2obo --use-cache
If there's an issue downloading the files, or you are offline, or you just want
to use the cache anyway, you can pass the --use-cache
flag.
Details
Command: sh run.sh make get-pmids
Currently, the only feature is get_codes_by_yyyy_mm
, which returns a list of
OMIM codes and their prefixes from https://omim.org/statistics/update.
make scrape y=<YEAR> m=<MONTH>
make scrape y=<YEAR> m=<MONTH> > <path/to/outputFile>
- Get codes for May 2021, printed to terminal:
make scrape y=2021 m=5
- Get codes for May 2021 and output to a file "myfile.txt":
make scrape y=2021 m=5 > myfile.txt
Command:
make scrape y=2021 m=5
Response:
[('#', '619340'),
('#', '619355'),
('*', '619357'),
('*', '619358'),
('*', '619359'),
('#', '619325'),
('#', '619328'),
('*', '100850'),
...
('#', '613102')]
Using get_codes_by_yyyy_mm()
will return a list of tuples.
from omim2obo.omim_code_scraper import get_codes_by_yyyy_mm
code_tuples = get_codes_by_yyyy_mm('2021/05')
omim.ttl
: OMIM ontologizedomim.sssom.tsv
: SSSOM mapping filemondo-omim-genes.robot.tsv
: ROBOT template for adding OMIM genes to Mondoreview.tsv
: Special cases to consider for manual review
Notice: These are generated based on the latest downloadable data files from omim.org, updated daily, rather than what is seen on the omim.org/entry/MIM# pages. Note that the data files and the entry pages aren't always in sync, and that one or the other may be slightly more up-to or out-of date for a period of time.
Columns:
classCode
: integerclassLabel
: stringvalue
: any: Some form of data to reviewcomment
: string (optional)
This review case involves what would be otherwise considered a valid, disease-defining disease-gene (D2G) relationship, but for the fact that it quite unusually includes 'digenic' in the label, even though it only had 1 association. OMIM doesn't have a guaranatee on the data quality of its disease-gene associations marked 'digenic', so for any of these entries, it could be the case that either (a) it is not 'digenic'; OMIM should remove that from the label, and Mondo can make an explicit exception to add the relationship, or could otherwise wait until OMIM fixes the issue and it will automatically be added, or (b) it is in fact 'digenic', and OMIM should add the missing 2nd gene association.
The unique characteristics of cases of this class are as follows:
- Each case has 2 rows in
morbidmap.txt
and are part of a pattern. - Row 1: One row is a typical, valid, disease-defining entry. For the given phenotype MIM in that row, there are no
- other rows in
morbidmap.txt
where it appears as a phenotype having an association with another gene.- In all such cases seen thus far as of 2024/11/18, all of these are cancer cases, and the label ends with "somatic".
- This entry appears in the Phenotype-Gene Relationships table on the MIM's omim.org/entry page.
- Row 2: There is a second row where the phenotype in the first row appears as a gene.
- For this row, there is no MIM in the phenotype field.
- This row does not appear in the Gene-Phenotype Relationships table on the MIM's omim.org/entry page.
- This row is self-referential. The label in the Phenotype field is one of the titles of the MIM in the Gene field.
Example case:
Phenotype | Gene/Locus And Other Related Symbols | MIM Number | Cyto Location |
---|---|---|---|
Small cell cancer of the lung, somatic, 182280 (3) | RB1 | 614041 | 13q14.2 |
Small-cell cancer of lung (2) | SCLC1 | 182280 | 3p23-p21 |
All known cases:
There is a spreadsheet which collates all known cases as of 2024/11/18: google sheet. The MIMs of the known cases are: 159595
, 182280
, 607107
, and 615830
.
Additional notes:
Note that unlike the other cases, a single case of "D2G: self-referential" spans multiple rows in review.tsv
.
The cases are enumerated in the TSV, with individual cases identifiable via a leading integer in the value
column,
e.g. "1: " for the first case, "2: " for the second, and so on.
Also, see note in section "3. D2G: somatic" about intersection between these two cases.
Happens when all conditions were met for this association to be considered disease-defining, but the mutation is a somatic cell mutation, rather than a germline mutation. This is indicated by the appearance of the word 'somatic' in the label of the phenotype MIM in the association. These cases should be reviewed because currently any association meeting the criteria to be considered disease-defining is also considered a germline mutation and the association is represented in omim.owl
using the property 'is causal germline mutation in' (RO:0004013).
Note that there is an intersection between this case and case 2, "D2G: self-referential". Sometimes the somatic cases will also be self-referential, but not always. However, all cases of "D2G: self-referential" have historically included a row where the phenotype includes the word 'somatic'.
Happens when all conditions were met for this association to be considered disease-defining. However, the phenotype in the association unexpectedly has the type of "gene" rather than "phenotype". This is unexpected and considered a data quality issue on the OMIM side. As of 2024/10, we flagged this to the OMIM team and they corrected all such cases.
Happens when all conditions were met for this association to be considered disease-defining. However, the phenotype in the association has an unexpected type of either 'OBSOLETE', 'SUSPECTED', or 'HAS_AFFECTED_FEATURE'. As of 2024/12, we have not seen such cases appear, but we have set this review case up to watch for them should they occur.
This pipeline involves the processing of morbidmap.txt
to create ontological representations of Gene --> Disease and
Disease --> Gene associations.
Phenotype | Gene/Locus And Other Related Symbols | MIM Number | Cyto Location |
---|---|---|---|
Prune belly syndrome, 100100 (3) | CHRM3, PBS, EGBRS | 118494 | 1q43 |
OMIM:100100
(Prune belly syndrome) is the Phenotype ("Disease"), and OMIM:118494
(CHRM3) is the associated Gene.
They are related via mapping key (3)
(explained below).
OMIM:100100 a owl:Class ;
rdfs:label "prune belly syndrome" ;
rdfs:subClassOf _:N2fd22c9bb2f04630b81414cff9514660 ;
biolink:category biolink:Disease .
_:N2fd22c9bb2f04630b81414cff9514660 a owl:Restriction ;
owl:onProperty RO:0004003 ;
owl:someValuesFrom OMIM:118494 .
The association is represented as an rdfs:subClassOf
owl:Restriction
, where mapping key (3)
is represented as
RO:0004003
.
In order to add these associations to an OWL ontology, we must use an appropriate predicate. Below are the 4 OMIM
morbidmap.txt
mapping keys and their definitions, alongside the RO predicates we've
chosen to represent them.
Note that the directionality of these associations / predicates is in the Gene->Disease direction: (Gene MIM) --(Mapping key / RO predicate)--> (Disease MIM)
1: The disorder is placed on the map based on its association with a gene, but the underlying defect is not known.
Not ontologized. These types are ignored due to the uncertainty of the nature of the association.
2: The disorder has been placed on the map by linkage or other statistical method; no mutation has been found.
RO:0003303 (causes condition):
A relationship between an entity (e.g. a genotype, genetic variation, chemical, or environmental exposure) and a
condition (a phenotype or disease), where the entity has some causal role for the condition.
3: The molecular basis for the disorder is known; a mutation has been found in the gene.
RO:0004013 (is causal germline mutation in):
Relates a gene to condition, such that a mutation in this gene is sufficient to produce the condition and that can be
passed on to offspring[modified from orphanet].
Note: For these "mapping key (3)" cases, there also exists an inverse predicate which we ontologize in the inverse direction: (Disease MIM) --(Mapping key 3 / RO:0004003)--> (Gene MIM): RO:0004003 (has material basis in germline mutation in)
4: A contiguous gene deletion or duplication syndrome, multiple genes are deleted or duplicated causing the phenotype.
RO:0003304 (contributes to condition):
A relationship between an entity (e.g. a genotype, genetic variation, chemical, or environmental exposure) and a
condition (a phenotype or disease), where the entity has some contributing role that influences the condition.
Important caveat: Singular vs multiple associations
These above RO predicates are only used if there is only 1 gene associated with a given disease, i.e.
in morbidmap.txt
, there is only 1 row where the MIM appears in the Phenotype
field.
In cases where there is >1 association, the following RO predicate is used instead, regardless of if the mapping key is (2), (3), or (4): RO:0003302 (causes or contributes to condition): A relationship between an entity (e.g. a genotype, genetic variation, chemical, or environmental exposure) and a condition (a phenotype or disease), where the entity has some causal or contributing role that influences the condition.
Of the above 3 Gene->Disease association predicates (those with mapping keys (2), (3), and (4)), the one which we
consider "disease defining" is (3) (RO:0004013). For these cases, as mentioned above, we also declare an association in
the Disease->Gene direction, RO:0004003. However, we only declare these associations if several other conditions are
also met. These other conditions are: (i) the Phenotype not be marked as a non-disease (represented by the label
being wrapped in []
), (ii) that is not a mutation that contribute to susceptibility to multifactorial disorders
(e.g., diabetes, asthma) or to susceptibility to infection (e.g., malaria) (represented by the label being wrapped in
{}
), and (iii) not be marked provisional (represented by the label beginning with ?
). These 3 special markers are
further explained in the OMIM FAQ. Additionally, as mentioned above, we only declare
the association in omim.ttl
if there is 1 and only 1 association shown in `morbidmap.txt
So, all of the conditions together are:
- Mapping key is (3)
- Only 1 association
- Phenotype not marked as non-disease (
[]
) - Phenotype not marked as susceptibility to multifactorial disorders or infection (
{}
) - Phenotype not marked provisional (
?
)