Information_Extraction_from_Biographies

An exploration on NLP methods for information extraction from biographies, with Extended Taipei Gazetteers.

Proposed NLP Methods Overview
   1. Name Entity Recognition
   2. Relation Extraction
   3. Weighted Cooccurrence Rank
   4. Automatic Timeline Generation
Usage
Github Wiki for more

Proposed NLP Methods Overview

We propose and implement some new NLP methods for information extraction.

1. Name Entity Recognition (NER)

Increase recall by using multiple NER tools with auxiliary information, then increase precision by applying some filters and principles.

(detail...）

2. Relation Extraction

As a support of main relation extraction method, we can extract relation using grammar structure, based on the trait that biographee's name are usually omitted in the text.

Take biography of "王世慶" for example (under the assumption that we detect correct grammar structure)

(detail...）

3. Weighted Cooccurrence Rank

Calculate and rank cooccurrence score which is weighted on distance, delimiters and times between names, to find out really important cooccurence and unfound relations.

(detail...）

4. Timeline Generation

Generate complete timeline using delimiter and some principles, or generate simple timeline using grammar structure.

(detail...）

Usage

Prerequisite

Python3 (we develope with Python 3.6)
pip insstall -r requirements.txt to install all required python packages
MongoDB
Stanford CoreNLP
download main program and unzip it somewhere
download Chinese model jar and move into the Stanford CoreNLP direcotry you just unzipped.

Execution

Start MongoDB daemon.
sudo service mongod start (in Ubuntu)
Start CoreNLP server.
in Stanford CoreNLP directory, execute command java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000
Execute main pipeline process, and wait for several minutes.
python3 main.py
Results are in ./Database
some results are also kept in MongoDB. (see Wiki:Data)
note that graph result is store in .graphml format, you can import it to Gephi or Cytoscape or whatever you like

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
DataBase		DataBase
Explanation-Material		Explanation-Material
Tools		Tools
.gitignore		.gitignore
Biographee_Info.py		Biographee_Info.py
Convert_And_Extract.py		Convert_And_Extract.py
Cooccurrence.py		Cooccurrence.py
Get_Timeline.py		Get_Timeline.py
Graph.py		Graph.py
NER.py		NER.py
Preprocess.py		Preprocess.py
README.md		README.md
Relationship.py		Relationship.py
Utilities.py		Utilities.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information_Extraction_from_Biographies

Proposed NLP Methods Overview

1. Name Entity Recognition (NER)

2. Relation Extraction

3. Weighted Cooccurrence Rank

4. Timeline Generation

Usage

Prerequisite

Execution

About

Releases

Packages

Languages

richarddwang/Information_Extraction_from_Biographies

Folders and files

Latest commit

History

Repository files navigation

Information_Extraction_from_Biographies

Proposed NLP Methods Overview

1. Name Entity Recognition (NER)

2. Relation Extraction

3. Weighted Cooccurrence Rank

4. Timeline Generation

Usage

Prerequisite

Execution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages