-
Notifications
You must be signed in to change notification settings - Fork 2
JingL1014/Phrase_extractor
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
============================================================================= 1. Prerequisites ============================================================================= 1a. TurboParser You need download TurboParser v2.3 and compile it following the instructions in TurboParser's README file. LINK:http://www.cs.cmu.edu/~ark/TurboParser/. To use TurboParser, pre-trained Arabic models for pos tagging and dependency parser are needed. -> Download two models from the following link: https://1drv.ms/u/s!An8tQToldhOkhO9O8p3lB1oR5MUl6A -> Uncompress the models. -> create a folder "arabic" under directory models/ in the root folder of TurboParser and save the models in that folder (e.g. /users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/models/arabic/arabic_tagger.model) UDpipe An alternative parser is UDpipe, the pre-compiled binary package (udpipe-1.0.0-bin.zip) can be downloaded at: https://github.com/ufal/udpipe/releases/tag/v1.0.0. To use UDpipe, a language model is needed, which can be downloaded at https://1drv.ms/u/s!An8tQToldhOkh_JIep-cJhimMDiboQ 1b. Stanford word segmenter You need download Stanford segmenter v3.6.0 and unzip the file. LINK:http://nlp.stanford.edu/software/segmenter.shtml#Download 1c. Stanford CoreNLP If your input file is document and you need to use Stanford CoreNLP to do sentence splitting LINK:http://stanfordnlp.github.io/CoreNLP/download.html ============================================================================ 2. Usage ============================================================================ Download the entire repository. Shell script run_sentence.sh and run_document.sh are provided to do phrase extraction of an input arabic file. usage: Based on the input format (sentence or document), you can run either run_sentence_X.sh or run_document_X.sh (X is turboParser or udpipe based on your choice of parser). The first argument is the file name. The second argument is the format of noun phrases: long, short or both. "long" means that the extracted noun phrases include not only the head but also addtional modifiers: adjective, prepositional phrase. "short" means the extracted noun phrases include just the head. "both" means that both long and short noun phrases are extracted. run_sentence_X.sh Sample_arabic_sent.xml [NounPhraseOption] run_document_X.sh Sample_arabic_doc.xml [NounPhraseOption] 2a. Input data format Two types of input format is allowed: -> sentence: the input file has one sentence per entry. A sample "Sample_arabic_sent.xml" of input data can be found under directory data/ -> document: the input file has one document per entry. This type of input will perform sentence splitting before other processes. A sample "Sample_arabic_doc.xml" of input data can be found under directory data/ 2b. Running run_document_X.sh or run_sentence_X.sh You need change the value of following parameters based on your situation -> SCRIPT: the location where you put folder "scripts" example: SCRIPT=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/scripts -> FILE: the location where you put the input file example: FILE=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/data -> STANFORD_SEG: the location where the stanford segmenter is saved example: STANFORD_SEG=/users/ljwinnie/Downloads/stanford-segmenter-2015-12-09 -> TurboParserPath: the location where TurboParser is saved example: TurboParserPath=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0 -> udpipePath: the location where UDpipe is saved example: /users/ljwinnie/Desktop/petrarch2/phrase_extractor/udpipe-1.0.0-bin -> languageModel: the location where the language model used in UDpipe is saved example: /users/ljwinnie/Desktop/petrarch2/phrase_extractor/udpipe-1.0.0-bin/model/arabic-ud-1.2-160523.udpipe ->STANFORD_CORENLP: the location where the Stanford CoreNLP is saved example: STANFORD_CORENLP=/users/ljwinnie/toolbox/stanford-corenlp-full-2015-01-29
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published