An exploration on NLP methods for information extraction from biographies, with Extended Taipei Gazetteers.
Proposed NLP Methods Overview
1. Name Entity Recognition
2. Relation Extraction
3. Weighted Cooccurrence Rank
4. Automatic Timeline Generation
Usage
Github Wiki for more
We propose and implement some new NLP methods for information extraction.
Increase recall by using multiple NER tools with auxiliary information, then increase precision by applying some filters and principles.
As a support of main relation extraction method, we can extract relation using grammar structure, based on the trait that biographee's name are usually omitted in the text.
Take biography of "王世慶" for example (under the assumption that we detect correct grammar structure)
Calculate and rank cooccurrence score which is weighted on distance, delimiters and times between names, to find out really important cooccurence and unfound relations.
Generate complete timeline using delimiter and some principles, or generate simple timeline using grammar structure.
- Python3 (we develope with Python 3.6)
pip insstall -r requirements.txt
to install all required python packages- MongoDB
- Stanford CoreNLP
download main program and unzip it somewhere
download Chinese model jar and move into the Stanford CoreNLP direcotry you just unzipped.
- Start MongoDB daemon.
sudo service mongod start
(in Ubuntu) - Start CoreNLP server.
in Stanford CoreNLP directory, execute commandjava -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000
- Execute main pipeline process, and wait for several minutes.
python3 main.py
- Results are in
./Database
some results are also kept in MongoDB. (see Wiki:Data)
note that graph result is store in.graphml
format, you can import it to Gephi or Cytoscape or whatever you like