Skip to content

Using A Custom Knowledge Base Example

Natalie Prange edited this page Sep 20, 2024 · 1 revision

This is an example for adding a custom knowledge base that is not in ttl format such that the provided script for extracting custom mappings (scripts/extract_custom_mappings.py) cannot be used out of the box.

Please follow the steps in Using A Custom Knowledge Base and replace step 2 (running the scripts/extract_custom_mappings.py script) with the steps described here.

Get Example Data

In this example, we use the CTD Diseases Knowledge Base. The knowledge base can be downloaded as TSV file from this link: https://ctdbase.org/reports/CTD_diseases.tsv.gz:

wget https://ctdbase.org/reports/CTD_diseases.tsv.gz
gzip -d CTD_diseases.tsv.gz

Extract Custom Mappings

Now we need to extract the custom mapping from the TSV file. Let's put the following in a script called extract_custom_mappings_from_tsv.py:

import argparse
import log
import sys

from src import settings


def main(args):
    logger.info(f"Creating mappings from {args.input_file} ...")

    entity_to_name = {}
    entity_to_types = {}
    all_entity_types = {}

    with open(args.input_file, "r", encoding="utf8") as file:
        for line in file:
            if line.startswith("#"):
                # Ingore comment lines starting with '#'
                continue
            # Extract relevant entity information from the TSV line
            entity_name, entity_id, _, _, types, _, _, _, _ = line.strip("\n").split("\t")
            entity_to_name[entity_id] = entity_name
            entity_to_types[entity_id] = types.split("|")

            # Collect in a dictionary all distinct entity types
            for typ in types.split("|"):
                if typ not in all_entity_types:
                    # Initialize the entity type name with "OTHER". This will
                    # be replaced later by the corresponding entity name if it
                    # exists in the KB.
                    all_entity_types[typ] = "OTHER"

    # Add the entity type name as value to each entity type ID in the type dictionary
    for t in all_entity_types.keys():
        if t in entity_to_name:
            all_entity_types[t] = entity_to_name[t]

    # Write the entity ID to entity name dictionary to file
    with open(settings.CUSTOM_ENTITY_TO_NAME_FILE, "w", encoding="utf8") as file:
        for entity_id, name in entity_to_name.items():
            file.write(f"{entity_id}\t{name}\n")

    # Write the entity ID to entity type IDs dictionary to file.
    with open(settings.CUSTOM_ENTITY_TO_TYPES_FILE, "w", encoding="utf8") as file:
        for entity_id, types in entity_to_types.items():
            file.write(f"{entity_id}")
            for typ in types:
                file.write(f"\t{typ}")
            file.write("\n")

    # Write distinct entity types with their names to file
    with open(settings.CUSTOM_WHITELIST_TYPES_FILE, "w", encoding="utf8") as file:
        for type_id, name in all_entity_types.items():
            file.write(f"{type_id}\t{name}\n")

    logger.info(f"Wrote {len(entity_to_name)} entity to name mappings to {settings.CUSTOM_ENTITY_TO_NAME_FILE} "
                f"and {len(entity_to_types)} entity to types mappings to {settings.CUSTOM_ENTITY_TO_TYPES_FILE} "
                f"and {len(all_entity_types)} entity types to {settings.CUSTOM_WHITELIST_TYPES_FILE}")


if __name__ == "__main__":
    parser.add_argument("input_file", type=str,
                        help="Input KB file that contains the custom knowledge base or ontology.")

    logger = log.setup_logger(sys.argv[0])
    logger.debug(' '.join(sys.argv))

    main(parser.parse_args())

The script can be called with

python3 extract_custom_mappings_from_tsv.py CTD_diseases.tsv

Select Whitelist Types

Performing the steps above is essentially already enough to use the custom KB in ELEVANT. However, with this script, all entity types are considered whitelist types and will be displayed in the web app as selectable categories. For the CTD Diseases KB, this would be 2943 entity types. Since this would be way too many to display in the web app, we should reduce the number of whitelist types. An easy choice is to select second-level types of the type-hierarchy. That is, when looking at entities in the KB as nodes and adding directed edges between entities and their types, we would first determine the root type by getting the node with incoming but no outgoing edges. This is typically a type like "entity" or in case of the CTD Diseases KB "disease". Then the whitelist types are the nodes that are direct neighbors of this root type. Write the selected whitelist types to the file <data_directory>/custom_mappings/whitelist_types.tsv. Each line should contain the type URI and the type name. Finally, copy the whitelist_types.tsv file to the small-data-files directory.