GitHub - PhaniJella/Phrase

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
scripts		scripts
PETRgraph.py		PETRgraph.py
PETRgraph.pyc		PETRgraph.pyc
README		README
arabic_phrase_extract.py		arabic_phrase_extract.py
generateParsedFile.py		generateParsedFile.py
preprocess.py		preprocess.py
preprocess_doc.py		preprocess_doc.py
run_document.sh		run_document.sh
run_sentence.sh		run_sentence.sh

Repository files navigation

=============================================================================
1. Prerequisites
=============================================================================
1a.TurboParser
You need download TurboParser v2.3 and compile it following the instructions in TurboParser's README file.
LINK:http://www.cs.cmu.edu/~ark/TurboParser/

1b. Stanford word segmenter
You need download Stanford segmenter v3.6.0 and unzip the file.
LINK:http://nlp.stanford.edu/software/segmenter.shtml#Download

1c. pre-trained Arabic models for pos tagging and dependency parser
-> Download two models from the following link: https://1drv.ms/u/s!An8tQToldhOkhO9O8p3lB1oR5MUl6A
-> Uncompress the models. 
-> create a folder "arabic" under directory models/ in the root folder of TurboParser and save the models in that folder (e.g. /users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/models/arabic/arabic_tagger.model)

1d. Stanford CoreNLP
If your input file is document and you need to use Stanford CoreNLP to do sentence splitting
LINK:http://stanfordnlp.github.io/CoreNLP/download.html

============================================================================
2. Usage
============================================================================
Download the entire repository and save it in the root folder of TurboParser. Shell script run_sentence.sh and run_document.sh are provided to do phrase extraction of an input arabic file.

usage: Based on the input format (sentence or document), you can run either run_sentence.sh or run_document.sh
run_sentence.sh Sample_arabic_sent.xml
run_document.sh Sample_arabic_doc.xml


2a. Input data format
Two types of input format is allowed:
-> sentence: the input file has one sentence per entry. A sample "Sample_arabic_sent.xml" of input data can be found under directory data/
-> document: the input file has one document per entry. This type of input will perform sentence splitting before other processes. A sample "Sample_arabic_doc.xml" of input data can be found under directory data/

2b. Running run_document.sh or run_sentence.sh
You need change the value of following parameters based on your situation

-> SCRIPT: the location where you put folder "scripts"
example:
SCRIPT=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/scripts

-> FILE: the location where you put the input file 
example:
FILE=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/data

-> STANFORD_SEG: the location where the stanford segmenter is saved
example:
STANFORD_SEG=/users/ljwinnie/Downloads/stanford-segmenter-2015-12-09

-> TurboParserPath: the location where TurboParser is saved
example:
TurboParserPath=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0

->STANFORD_CORENLP: the location where the Stanford CoreNLP is saved
example:
STANFORD_CORENLP=/users/ljwinnie/toolbox/stanford-corenlp-full-2015-01-29