forked from JingL1014/Phrase_extractor
-
Notifications
You must be signed in to change notification settings - Fork 0
PhaniJella/Phrase_extractor
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
============================================================================= 1. Prerequisites ============================================================================= 1a.TurboParser You need download TurboParser v2.3 and compile it following the instructions in TurboParser's README file. LINK:http://www.cs.cmu.edu/~ark/TurboParser/ 1b. Stanford word segmenter You need download Stanford segmenter v3.6.0 and unzip the file. LINK:http://nlp.stanford.edu/software/segmenter.shtml#Download 1c. pre-trained Arabic models for pos tagging and dependency parser -> Download two models from the following link: https://1drv.ms/u/s!An8tQToldhOkhO9O8p3lB1oR5MUl6A -> Uncompress the models. -> create a folder "arabic" under directory models/ in the root folder of TurboParser and save the models in that folder (e.g. /users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/models/arabic/arabic_tagger.model) 1d. Stanford CoreNLP If your input file is document and you need to use Stanford CoreNLP to do sentence splitting LINK:http://stanfordnlp.github.io/CoreNLP/download.html ============================================================================ 2. Usage ============================================================================ Download the entire repository and save it in the root folder of TurboParser. Shell script run_sentence.sh and run_document.sh are provided to do phrase extraction of an input arabic file. usage: Based on the input format (sentence or document), you can run either run_sentence.sh or run_document.sh run_sentence.sh Sample_arabic_sent.xml run_document.sh Sample_arabic_doc.xml 2a. Input data format Two types of input format is allowed: -> sentence: the input file has one sentence per entry. A sample "Sample_arabic_sent.xml" of input data can be found under directory data/ -> document: the input file has one document per entry. This type of input will perform sentence splitting before other processes. A sample "Sample_arabic_doc.xml" of input data can be found under directory data/ 2b. Running run_document.sh or run_sentence.sh You need change the value of following parameters based on your situation -> SCRIPT: the location where you put folder "scripts" example: SCRIPT=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/scripts -> FILE: the location where you put the input file example: FILE=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/data -> STANFORD_SEG: the location where the stanford segmenter is saved example: STANFORD_SEG=/users/ljwinnie/Downloads/stanford-segmenter-2015-12-09 -> TurboParserPath: the location where TurboParser is saved example: TurboParserPath=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0 ->STANFORD_CORENLP: the location where the Stanford CoreNLP is saved example: STANFORD_CORENLP=/users/ljwinnie/toolbox/stanford-corenlp-full-2015-01-29
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Python 69.6%
- Shell 24.0%
- Perl 6.4%