Text2Graph is a Python-based framework for the autonomous construction of domain-specific Knowledge Graphs (KG) from unstructured text data. The system transforms textual data into a labeled property graph-based representation using various NLP and semantic processing tools and integrates with Neo4j for graph storage and querying.
- Automated NLP pipeline for entity extraction, relation extraction, and semantic enrichment.
- Integration with Neo4j for graph-based storage and querying.
- Support for large-scale Knowledge Graphs with rich domain-specific entity typing, event tagging, and temporal relationships.
- Extensible and schema-free architecture.
- Ubuntu Linux (or Windows Subsystem for Linux (WSL))
- Neo4j 4.4 with APOC plugin
- Docker for external NLP services
- Python 3.8+
Ensure Python 3.8+ is installed. You can install the required dependencies using the following instructions:
- Clone the repository
git clone https://github.com/neostrange/text2graph.git cd text2graph
- Install Spacy Mode
pip install spacy python -m spacy download en_core_web_trf python -m spacy download en_core_web_lg
- Additional SpaCy Dependency
pip install spacy-dbpedia-spotlight
- Install WordNet (nltk) 3.1
pip install nltk python -c "import nltk; nltk.download('wordnet')"
- Other Python Dependencies
pip install cgitb requests distutils spacy json tokenize GPUtil textwrap py2neo configparser neo4j pip install py2neo==2021.2.3 pip install GPUtil
- SpaCy Transformers (Version Fix) If you encounter issues with SpaCy's transformers, you may need to downgrade:
pip install spacy-transformers==1.1.6
- Install Neo4j 4.4:** Follow the Neo4j installation instructions for your system.
- Enable APOC Plugin**
- Install the APOC plugin via the Neo4j plugin manager.
- Create an
apoc.conf
file in the Neo4j configuration folder (typically located at/var/lib/neo4j/conf/
), and add the following:apoc.import.file.enabled=true
- Copy Dataset Files**
- Copy the files from the
dataset
folder in this repository to the import folder of Neo4j (usually located at/var/lib/neo4j/import/
).
- Copy the files from the
- Restart Neo4j**
- Restart the Neo4j service to apply the changes:
sudo systemctl restart neo4j
- Restart the Neo4j service to apply the changes:
- Neo4j Configuration for WSL**
- If running on WSL, enable the Neo4j default listen address in the Neo4j configuration (
/etc/neo4j/neo4j.conf
):dbms.default_listen_address=0.0.0.0
- If running on WSL, enable the Neo4j default listen address in the Neo4j configuration (
To extend the pipeline, you must ensure the following Docker containers are running:
- Coreference Resolution:
- Build from this repository: Spacy-Coref-Docker
- Expose on localhost:9999:
docker run -p 9999:9999 neostrange/spacy-experimental-coref
- Event Tagging:
- Build from this repository: TTK Docker
- Expose on localhost:5050:
docker run -p 5050:5050 neostrange/ttk
- Temporal Expression Tagging:
- Use the HeidelTime WebService: HeidelTime-WebService-Docker
- Expose on localhost:5000:
docker run -p 5000:5000 neostrange/heideltime
- Word Sense Disambiguation:
- Semantic Role Labeling:
- text2graph uses AllenNlp to perform semantic role labeling. You can find the instrcutions to setup docker container for AllenNlp SRL
- From AllenNLP Docker:
- Expose on localhost:8000:
docker run -p 8000:8000 allennlp/allennlp
Make sure all Docker services are running before initiating the text2graph
pipeline to ensure full functionality for entity enrichment, event tagging, and temporal expressions.
If you're running the project on WSL (Windows Subsystem for Linux), you may need to configure the firewall:
- Add WSL to Windows Firewall:
- Run the following command in PowerShell (as Administrator):
New-NetFirewallRule -DisplayName "WSL" -Direction Inbound -InterfaceAlias "vEthernet (WSL)" -Action Allow
- Run the following command in PowerShell (as Administrator):
- Restart Your System:
- After applying the firewall rule, restart your computer.
Set up the Python Path (optional):
If you need to work with nested directories, you can add the current working directory to the Python path:
export PYTHONPATH="$(pwd):$PYTHONPATH"
The Text2Graph pipeline is a modular system designed to efficiently generate Knowledge Graphs from textual data. It consists of several distinct phases, each focusing on specific NLP tasks. Let's walk through how to run the pipeline and explore each phase:
Phase 1: Basic Linguistic Analysis (python3 GraphBasedNLP.py --input /path/to/text/documents)
- Function: This phase performs the foundational tasks of Natural Language Processing (NLP) on the input text documents.
- Input: You can specify the path to your text documents using the
--input
argument. If no argument is provided, the script will load text data files by default from thedata/dataset
folder within the Text2Graph repository. Currently, this folder contains pre-loaded files from the MEANTIME corpus for your convenience.
Running Phase 1:
-
Open a terminal window and navigate to the directory containing the
GraphBasedNLP.py
script within your Text2Graph installation. -
(Optional) If you have your own text documents, execute the script with the
--input
argument followed by the path to your data directory:python3 GraphBasedNLP.py --input /path/to/your/text/documents
-
If you'd like to use the pre-loaded MEANTIME corpus data, simply run the script without any arguments:
python3 GraphBasedNLP.py```
Phase 2: Refinement Phase
- Function: This phase focuses on refining the extracted information from Phase 1. It establishes connections between different linguistic elements and ensures consistency within the data.
- Input: The output from Phase 1 (typically stored in a Neo4j database) serves as the input for this phase. Running Phase 2:
- Ensure Phase 1 has completed successfully.
- Navigate to the directory containing the RefinementPhase.py script.
- Execute the script
python3 RefinementPhase.py
Phase 3: Temporal Enrichment
- Function: This phase enriches the Knowledge Graph with temporal information. It involves identifying and tagging time expressions and event triggers within the text data.
- Input: The refined data from Phase 2 is used as input for this phase.
Running Phase 3:
- Ensure Phases 1 and 2 have completed successfully.
- Navigate to the directory containing the TemporalPhase.py script.
- Execute the script:
python3 TemporalPhase.py
Phase 4: Event Enrichment
- Function: This phase focuses on enriching event information within the Knowledge Graph. It establishes links between identified events and entities, as well as other events, based on the linguistic elements present in the graph.
- Input: The temporally enriched data from Phase 3 is used as input for this phase.
Running Phase 4:
- Ensure Phases 1, 2 and 3 have completed successfully.
- Navigate to the directory containing the EventEnrichmentPhase.py script.
- Execute the script:
python3 EventEnrichmentPhase.py
Phase 5: TLink Recognition
Function: This phase aims to identify Temporal Links (TLinks) within the Knowledge Graph. TLinks describe temporal relationships between events, such as "before," "after," or "during".
Input: The event-enriched data from phase 4, will serve as input for TLink recognition.
Running Phase 5:
- Ensure all the previous steps have been completed.
- Navigate to the directory containing TlinksRecognizer.py script.
- Execute the script:
python3 TlinksRecognizer.py
Note: While the REST endpoints powered by FastAPI are not yet implemented, you can still interact with the generated Knowledge Graph directly through the Neo4j Browser or Neo4j Bloom.
These tools provide a user-friendly interface for exploring and querying the graph data. You can execute Cypher queries to retrieve specific information or visualize the graph structure.
-
A. Hur, N. Janjua, and M. Ahmed, "A Survey on State-of-the-art Techniques for Knowledge Graphs Construction and Challenges ahead," 2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), Laguna Hills, CA, USA, 2021, pp. 99-103, doi: 10.1109/AIKE52691.2021.00021.
-
Ali Hur, Naeem Janjua, and Mohiuddin Ahmed, "Unifying context with labeled property graph: A pipeline-based system for comprehensive text representation in NLP," Expert Systems with Applications, Volume 239, 2024, 122269, doi: 10.1016/j.eswa.2023.122269.
-
A. Hur, N. Janjua, "Constructing Domain-Specific Knowledge Graphs From Text: A Case Study on Subprime Mortgage Crisis," Special Issue on Knowledge Graph Construction, Semantic Web Journal (by IOS Press), 2024 (Under Review).
We welcome contributions to Text2Graph! If you encounter any bugs, have feature requests, or wish to contribute new functionality, please submit a pull request or open an issue on our GitHub repository. Your contributions help us enhance Text2Graph and make it more valuable for the community.
Thank you for your support!
This project is licensed under the MIT License. See the LICENSE file for details.