PhraseExtraction is a research library with variety keyphrase extraction techniques. The software has three components.
- Utility - all preprocessing modules
- Candidate phrase generation
- Ranking/Scoring Techniques
Currently implemented Techniques are:
- Candidate Phrase Generation :
- Stopwordbased split
- POS Grammar based selection
- Ngrams
- Ranking Techniques :
- Term Frequency Distribution
- RAKE Ranking (RAKE, Degree)
- Text Ranking (Embedding, Window)
Methods | Options |
---|---|
Text Cleaning | Stopwords Removal Punctuation/Digit removal Entities Removal Non-english word removal Non-ASCII filtering |
Candidate Generation | Stopword based Split POS Grammar based selection Ngrams |
Ranking | Mean TF Distribution RAKE/Degree Ranking TextRank |
- PhraseExtraction
- Table of contents
- Installation
- Usage
- Support and Contributions
- Acknowledgement
- License
Assuming that anaconda environment is already installed,
- PhraseExtraction can be installed from PyPI using
pip install PhraseExtraction
- with requirements.txt
pip install -r requirements.txt
- with yml file, create conda environment
conda env create -f environment.yml
source activate env
Example notebooks can be found in the sample_notebooks
directory.
Usage of each method/technique is described in sections below.
It contains text pre-processing methods. The sample code for usage is provided below.
# Load stopwords
from phraseextraction.utility import nltk_stopwords, spacy_stopwords, gensim_stopwords, smart_stopwords, all_stopwords
print(nltk_stopwords, spacy_stopwords, gensim_stopwords, smart_stopwords, all_stopwords)
# Remove Non-ASCII characters/symbols
from phraseextraction.utility import filter_nonascii
nonascii_text = filter_nonascii(text)
# Remove punctuation & digits
from phraseextraction.utility import remove_punct_num
text_with_punc_digit_removed = remove_punct_num(text, remove_num=True)
# Remove Non-english words (junks like website, url etc)
from phraseextraction.utility import remove_non_english
english_text = remove_non_english(text)
# Remove entities using list of entities to removes
from phraseextraction import remove_named_entities
ent_list=['DATE','GPE','PERSON','CARDINAL','ORDINAL','LAW','LOC','PERCENT','QUANTITY']
ents_removed_text = utility.remove_named_entities(text, ent_list)
# Check if a token is digit
from phraseextraction import is_number
num_bool = is_number(token)
This section describes usage for 3 techniques of keywords/phrase extraction.
- Grammar based phrase extraction requires user to define POS tags pattern for the kind of phrases one wants to pick. Rules are defined in rule.py.
from rule import grammar
from candidate_generation import Grammar_Keyphrase
grammar_model = Grammar_Keyphrase(grammar)
key_phrases = grammar_model.get_keyphrases(text)
- RAKE based phrase extraction required list of stopwords to split the sentences to get candidate phrases. By default we use, combined stopwords from nltk, gensim, spacy and smartstop list.
from candidate_generation import Rake_Keyphrase
# ngram_ : The lower and upper boundary of the range of n-values for different word n-grams (2,4) means bi, tri and quad grams only.
rake_model = Rake_Keyphrase(ngram_ = (2,4), custom_stop_words=custom_stop_words)
key_phrases = rake_model.get_keyphrases(text)
- Ngrams based extracts all possible overlapping N-grams. Preprocessing and cleaning text is important step here.
from candidate_generation import Ngram_Keyphrase
ngram_model = Ngram_Keyphrase(ngram_ = (3,3)) #only trigrams
key_phrases = ngram_model.get_keyphrases(text)
How do we know which are important keywords ? Importance can be based for various premises example counts, association or centrality. This section describes methods for ranking phrases extracted from candidate phrase generation techniques.
- RAKE/Degree Scoring: Method can take RAKE or Degree scoring. To understand scoring more, one can refer RAKE paper here.
from ranking import RakeRank
rakeRank = ranking.RakeRank(method='degree')
ranked_df = rakeRank.rank_phrases(key_phrases)
- TextRank: TextRank has two methods: Window based (WindowSize) & Embedding based (WordEmbeddings). Embedding based ranking are recommended. Currently, it uses glove embedding but we intend to extend the technique to wor2vec, BERT, custom embedding models as well.
from ranking import TextRank
TR_WordEmbedding= ranking.TextRank(method= "WordEmbeddings")
ranked_df = TR_WordEmbedding.rank_phrases(key_phrases)
- Mean Term Frequency Scoring: Uses token count frequencies to score keyphrases by taking mean probability distribution of each token in keyphrase.
from ranking import FrequencyDistRank
# takes original text/doc as input to calculate count stats
freqDistRank = ranking.FrequencyDistRank(text)
ranked_df = freqDistRank.rank_phrases(key_phrases)
Please submit bug reports and feature requests as Issues. Contributions are very welcome.
For additional questions and feedback, please contact us at [email protected]
PhraseExtraction is developed under a mentorship program at Fidelity Investments.
PhraseExtraction is licensed under the Apache License 2.0.