-
Notifications
You must be signed in to change notification settings - Fork 1
Using A Custom Knowledge Base Example
This is an example for adding a custom knowledge base that is not in ttl
format such that the provided script for
extracting custom mappings (scripts/extract_custom_mappings.py
) cannot be used out of the box.
Please follow the steps in Using A Custom Knowledge Base and replace step 2 (running the scripts/extract_custom_mappings.py
script) with the steps described here.
In this example, we use the CTD Diseases Knowledge Base. The knowledge base can be
downloaded as TSV
file from this link: https://ctdbase.org/reports/CTD_diseases.tsv.gz:
wget https://ctdbase.org/reports/CTD_diseases.tsv.gz
gzip -d CTD_diseases.tsv.gz
Now we need to extract the custom mapping from the TSV
file. Let's put the following in a script called
extract_custom_mappings_from_tsv.py
:
import argparse
import log
import sys
from src import settings
def main(args):
logger.info(f"Creating mappings from {args.input_file} ...")
entity_to_name = {}
entity_to_types = {}
all_entity_types = {}
with open(args.input_file, "r", encoding="utf8") as file:
for line in file:
if line.startswith("#"):
# Ingore comment lines starting with '#'
continue
# Extract relevant entity information from the TSV line
entity_name, entity_id, _, _, types, _, _, _, _ = line.strip("\n").split("\t")
entity_to_name[entity_id] = entity_name
entity_to_types[entity_id] = types.split("|")
# Collect in a dictionary all distinct entity types
for typ in types.split("|"):
if typ not in all_entity_types:
# Initialize the entity type name with "OTHER". This will
# be replaced later by the corresponding entity name if it
# exists in the KB.
all_entity_types[typ] = "OTHER"
# Add the entity type name as value to each entity type ID in the type dictionary
for t in all_entity_types.keys():
if t in entity_to_name:
all_entity_types[t] = entity_to_name[t]
# Write the entity ID to entity name dictionary to file
with open(settings.CUSTOM_ENTITY_TO_NAME_FILE, "w", encoding="utf8") as file:
for entity_id, name in entity_to_name.items():
file.write(f"{entity_id}\t{name}\n")
# Write the entity ID to entity type IDs dictionary to file.
with open(settings.CUSTOM_ENTITY_TO_TYPES_FILE, "w", encoding="utf8") as file:
for entity_id, types in entity_to_types.items():
file.write(f"{entity_id}")
for typ in types:
file.write(f"\t{typ}")
file.write("\n")
# Write distinct entity types with their names to file
with open(settings.CUSTOM_WHITELIST_TYPES_FILE, "w", encoding="utf8") as file:
for type_id, name in all_entity_types.items():
file.write(f"{type_id}\t{name}\n")
logger.info(f"Wrote {len(entity_to_name)} entity to name mappings to {settings.CUSTOM_ENTITY_TO_NAME_FILE} "
f"and {len(entity_to_types)} entity to types mappings to {settings.CUSTOM_ENTITY_TO_TYPES_FILE} "
f"and {len(all_entity_types)} entity types to {settings.CUSTOM_WHITELIST_TYPES_FILE}")
if __name__ == "__main__":
parser.add_argument("input_file", type=str,
help="Input KB file that contains the custom knowledge base or ontology.")
logger = log.setup_logger(sys.argv[0])
logger.debug(' '.join(sys.argv))
main(parser.parse_args())
The script can be called with
python3 extract_custom_mappings_from_tsv.py CTD_diseases.tsv
Performing the steps above is essentially already enough to use the custom KB in ELEVANT. However, with this script, all entity types are
considered whitelist types and will be displayed in the web app as selectable categories. For the CTD Diseases
KB, this would be 2943 entity types. Since this would be way too many to display in the web app, we should reduce
the number of whitelist types. An easy choice is to select second-level types of the type-hierarchy. That is, when
looking at entities in the KB as nodes and adding directed edges between entities and their types, we would first
determine the root type by getting the node with incoming but no outgoing edges. This is typically a type like
"entity" or in case of the CTD Diseases KB "disease". Then the whitelist types are the nodes that are direct
neighbors of this root type. Write the selected whitelist types to the file <data_directory>/custom_mappings/whitelist_types.tsv
.
Each line should contain the type URI and the type name. Finally, copy the whitelist_types.tsv
file to the small-data-files
directory.