diff --git a/README.md b/README.md index 46132457..f2eb347d 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@

From zero to hero

-Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic. +Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic. You can think of Texthero as a tool to help you _understand_ and work with text-based dataset. Given a tabular dataset, it's easy to _grasp the main concept_. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines. @@ -55,7 +55,7 @@ Texthero include tools for: * Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation. * Text visualization: vector space visualization, place localization on maps (wip). -Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). +Texthero is free, open-source and [well documented](https://texthero.org/docs) (and that's what we love most by the way!). We hope you will find pleasure working with Texthero as we had during his development. @@ -63,7 +63,7 @@ We hope you will find pleasure working with Texthero as we had during his develo Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things. -Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer! +Now, the next main milestone is to provide *multilingual support* and for this big step, we need the help of all of you. ¿Hablas español? Sprechen Sie Deutsch? 你会说中文 日本語が話せるのか?Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer! For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github [issue](https://github.com/jbesomi/texthero/issues), we will be glad to support you and help you. @@ -92,7 +92,7 @@ pip install texthero

Getting started

-The best way to learn Texthero is through the Getting Started docs. +The best way to learn Texthero is through the Getting Started docs. In case you are an advanced python user, then `help(texthero)` should do the work. @@ -102,20 +102,21 @@ In case you are an advanced python user, then `help(texthero)` should do the wor ```python -import texthero as hero -import pandas as pd - -df = pd.read_csv( - "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -) - -df['pca'] = ( - df['text'] - .pipe(hero.clean) - .pipe(hero.tfidf) - .pipe(hero.pca) -) -hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") +>>> import texthero as hero +>>> import pandas as pd +>>> +>>> df = pd.read_csv( +... "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +... ) +>>> +>>> df['pca'] = ( +... df['text'] +... .pipe(hero.clean) +... .pipe(hero.tokenize) +... .pipe(hero.tfidf) +... .pipe(hero.pca) +... ) +>>> hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news") ```

@@ -125,28 +126,29 @@ hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

2. Text preprocessing, TF-IDF, K-means and Visualization

```python -import texthero as hero -import pandas as pd - -df = pd.read_csv( - "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" -) - -df['tfidf'] = ( - df['text'] - .pipe(hero.clean) - .pipe(hero.tfidf) -) - -df['kmeans_labels'] = ( - df['tfidf'] - .pipe(hero.kmeans, n_clusters=5) - .astype(str) -) - -df['pca'] = df['tfidf'].pipe(hero.pca) - -hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") +>>> import texthero as hero +>>> import pandas as pd +>>> +>>> df = pd.read_csv( +... "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv" +... ) + +>>> df['tfidf'] = ( +... df['text'] +... .pipe(hero.clean) +... .pipe(hero.tokenize) +... .pipe(hero.tfidf) +... ) +>>> +>>> df['kmeans_labels'] = ( +... df['tfidf'] +... .pipe(hero.kmeans, n_clusters=5) +... .astype(str) +... ) +>>> +>>> df['pca'] = df['tfidf'].pipe(hero.pca) +>>> +>>> hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news") ```

@@ -180,7 +182,7 @@ Remove all types of brackets and their content. ```python >>> s = hero.remove_brackets(s) ->>> s +>>> s 0 This sèntencé needs to be cleaned! dtype: object ``` @@ -189,7 +191,7 @@ Remove diacritics. ```python >>> s = hero.remove_diacritics(s) ->>> s +>>> s 0 This sentence needs to be cleaned! dtype: object ``` @@ -198,7 +200,7 @@ Remove punctuation. ```python >>> s = hero.remove_punctuation(s) ->>> s +>>> s 0 This sentence needs to be cleaned dtype: object ``` @@ -207,7 +209,7 @@ Remove extra white-spaces. ```python >>> s = hero.remove_whitespace(s) ->>> s +>>> s 0 This sentence needs to be cleaned dtype: object ``` @@ -217,7 +219,16 @@ Sometimes we also want to get rid of stop-words. ```python >>> s = hero.remove_stopwords(s) >>> s -0 This sentence needs cleaned +0 This sentence needs cleaned +dtype: object +``` + +There is also the option to clean the text automatically by calling the "clean"-function instead of doing it step by step. +```python +>>> text = "This sèntencé (123 /) needs to [OK!] be cleaned! " +>>> s = pd.Series(text) +>>> hero.clean(s) +0 sentence needs cleaned dtype: object ``` @@ -243,9 +254,11 @@ Full documentation: [nlp](https://texthero.org/docs/api-nlp) **Scope:** map text data into vectors and do dimensionality reduction. Supported **representation** algorithms: -1. Term frequency (`count`) +1. Term frequency (`term_frequency`) 1. Term frequency-inverse document frequency (`tfidf`) +For the "representation" functions it is strongly recommended to tokenize the input series first with the `hero.tokenize(s)` function from the texthero library. + Supported **clustering** algorithms: 1. K-means (`kmeans`) 1. Density-Based Spatial Clustering of Applications with Noise (`dbscan`) @@ -295,7 +308,7 @@ The website will be soon moved from Docusaurus to Sphinx: read the [open issue t **Are you good at writing?** -Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide. +Probably this is the most important piece missing now on Texthero: more tutorials and more "Getting Started" guide. If you are good at writing you can help us! Why don't you start by [Adding a FAQ page to the website](https://github.com/jbesomi/texthero/issues/41) or explain how to [create a custom pipeline](https://github.com/jbesomi/texthero/issues/38)? Need help? We are there for you. @@ -314,6 +327,8 @@ If you have just other questions or inquiry drop me a line at jonathanbesomi__AT - [bobfang1992](https://github.com/bobfang1992) - [Ishan Arora](https://github.com/ishanarora04) - [Vidya P](https://github.com/vidyap-xgboost) +- [Henri Froese](https://github.com/henrifroese) +- [Maximilian Krahn](https://github.com/mk2510)

License

diff --git a/texthero/nlp.py b/texthero/nlp.py index 52956d5c..b390b608 100644 --- a/texthero/nlp.py +++ b/texthero/nlp.py @@ -1,12 +1,12 @@ """ -Common NLP tasks such as named_entities, noun_chunks, etc. +The texthero.nlp module supports common NLP tasks such as named_entities, noun_chunks, ... on Pandas Series and DataFrame. """ import spacy import pandas as pd -def named_entities(s, package="spacy"): +def named_entities(s: pd.Series, package="spacy") -> pd.Series: """ Return named-entities. @@ -14,7 +14,7 @@ def named_entities(s, package="spacy"): Tuple: (`entity'name`, `entity'label`, `starting character`, `ending character`) - Under the hood, `named_entities` make use of Spacy name entity recognition. + Under the hood, `named_entities` makes use of `Spacy name entity recognition `_ List of labels: - `PERSON`: People, including fictional. @@ -36,6 +36,14 @@ def named_entities(s, package="spacy"): - `ORDINAL`: “first”, “second”, etc. - `CARDINAL`: Numerals that do not fall under another type. + Parameters + ---------- + s : Pandas Series + + Returns + ------- + Pandas Series, where each rows contains a list of tuples containing information regarding the given named entities. + Examples -------- >>> import texthero as hero @@ -57,7 +65,7 @@ def named_entities(s, package="spacy"): return pd.Series(entities, index=s.index) -def noun_chunks(s): +def noun_chunks(s: pd.Series) -> pd.Series: """ Return noun chunks (noun phrases). @@ -73,8 +81,12 @@ def noun_chunks(s): Parameters ---------- - input : Pandas Series - + s : Pandas Series + + Returns + ------- + Pandas Series, where each row contains a tuple that has information regarding the noun chunk. + Examples -------- >>> import texthero as hero @@ -107,7 +119,15 @@ def count_sentences(s: pd.Series) -> pd.Series: Return a new Pandas Series with the number of sentences per cell. - This makes use of the SpaCy `sentencizer `. + This makes use of the SpaCy `sentencizer `_ + + Parameters + ---------- + s : Pandas Series + + Returns + ------- + Pandas Series, with the number of sentences per document in every cell. Examples -------- diff --git a/texthero/preprocessing.py b/texthero/preprocessing.py index e5ee0097..99a0c60e 100644 --- a/texthero/preprocessing.py +++ b/texthero/preprocessing.py @@ -24,31 +24,73 @@ warnings.filterwarnings(action="ignore", category=UserWarning, module="gensim") -def fillna(input: pd.Series) -> pd.Series: - """Replace not assigned values with empty spaces.""" - return input.fillna("").astype("str") +def fillna(s: pd.Series) -> pd.Series: + """ + Replaces not assigned values with empty spaces. + + Parameters + ---------- + s : Pandas Series + Returns + ------- + Pandas Series + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> s = pd.Series([np.NaN, "I'm", "You're"]) + >>> hero.fillna(s) + 0 + 1 I'm + 2 You're + dtype: object + """ + return s.fillna("").astype("str") -def lowercase(input: pd.Series) -> pd.Series: - """Lowercase all text.""" - return input.str.lower() +def lowercase(s: pd.Series) -> pd.Series: + """ + Lowercase all texts in a series. -def replace_digits(input: pd.Series, symbols: str = " ", only_blocks=True) -> pd.Series: + Parameters + ---------- + s : Pandas Series + + Returns + ------- + Pandas Series + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> s = pd.Series("This is NeW YoRk wIth upPer letters") + >>> hero.lowercase(s) + 0 this is new york with upper letters + dtype: object + """ + return s.str.lower() + + +def replace_digits(s: pd.Series, symbols: str = " ", only_blocks=True) -> pd.Series: """ Replace all digits with symbols. - By default, only replace "blocks" of digits, i.e tokens composed of only numbers. + By default, only replaces "blocks" of digits, i.e tokens composed of only numbers. - When `only_blocks` is set to ´False´, replace any digits. + When `only_blocks` is set to ´False´, replaces all digits. Parameters ---------- - input : Pandas Series + s : Pandas Series + symbols : str (default single empty space " ") Symbols to replace + only_blocks : bool - When set to False, remove any digits. + When set to False, replace all digits. Returns ------- @@ -69,26 +111,31 @@ def replace_digits(input: pd.Series, symbols: str = " ", only_blocks=True) -> pd if only_blocks: pattern = r"\b\d+\b" - return input.str.replace(pattern, symbols) + return s.str.replace(pattern, symbols) else: - return input.str.replace(r"\d+", symbols) + return s.str.replace(r"\d+", symbols) -def remove_digits(input: pd.Series, only_blocks=True) -> pd.Series: +def remove_digits(s: pd.Series, only_blocks=True) -> pd.Series: """ - Remove all digits and replace it with a single space. + Removes all digits and replaces them with a single space. By default, only removes "blocks" of digits. For instance, `1234 falcon9` becomes ` falcon9`. - When the arguments `only_blocks` is set to ´False´, remove any digits. + When the arguments `only_blocks` is set to ´False´, removes any digits. See also :meth:`replace_digits` to replace digits with another string. Parameters ---------- - input : Pandas Series + s : Pandas Series + only_blocks : bool Remove only blocks of digits. + + Returns + ------- + Pandas Series Examples -------- @@ -103,21 +150,26 @@ def remove_digits(input: pd.Series, only_blocks=True) -> pd.Series: dtype: object """ - return replace_digits(input, " ", only_blocks) + return replace_digits(s, " ", only_blocks) -def replace_punctuation(input: pd.Series, symbol: str = " ") -> pd.Series: +def replace_punctuation(s: pd.Series, symbol: str = " ") -> pd.Series: """ - Replace all punctuation with a given symbol. + Replaces all punctuation with a given symbol. - `replace_punctuation` replace all punctuation from the given Pandas Series and replace it with a custom symbol. It consider as punctuation characters all :data:`string.punctuation` symbols `!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~).` + `replace_punctuation` replace all punctuation from the given Pandas Series and replace it with a custom symbol. It considers as punctuation characters all :data:`string.punctuation` symbols `!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~).` Parameters ---------- - input : Pandas Series + s : Pandas Series + symbol : str (default single empty space) Symbol to use as replacement for all string punctuation. + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -128,17 +180,25 @@ def replace_punctuation(input: pd.Series, symbol: str = " ") -> pd.Series: dtype: object """ - return input.str.replace(rf"([{string.punctuation}])+", symbol) + return s.str.replace(rf"([{string.punctuation}])+", symbol) -def remove_punctuation(input: pd.Series) -> pd.Series: +def remove_punctuation(s: pd.Series) -> pd.Series: """ Replace all punctuation with a single space (" "). - `remove_punctuation` removes all punctuation from the given Pandas Series and replace it with a single space. It consider as punctuation characters all :data:`string.punctuation` symbols `!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~).` + `remove_punctuation` removes all punctuation from the given Pandas Series and replaces it with a single space. It considers as punctuation characters all :data:`string.punctuation` symbols `!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~).` See also :meth:`replace_punctuation` to replace punctuation with a custom symbol. + Parameters + ---------- + s : Pandas Series + + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -148,32 +208,49 @@ def remove_punctuation(input: pd.Series) -> pd.Series: 0 Finnaly dtype: object """ - return replace_punctuation(input, " ") + return replace_punctuation(s, " ") def _remove_diacritics(text: str) -> str: """ - Remove diacritics and accents from one string. + Removes diacritics and accents from one string. + + Parameters + ---------- + text : String + + Returns + ------- + String Examples -------- - >>> import texthero as hero + >>> from texthero.preprocessing import _remove_diacritics >>> import pandas as pd >>> text = "Montréal, über, 12.89, Mère, Françoise, noël, 889, اِس, اُس" >>> _remove_diacritics(text) 'Montreal, uber, 12.89, Mere, Francoise, noel, 889, اس, اس' """ nfkd_form = unicodedata.normalize("NFKD", text) - # unicodedata.combinding(char) checks if the character is in + # unicodedata.combining(char) checks if the character is in # composed form (consisting of several unicode chars combined), i.e. a diacritic return "".join([char for char in nfkd_form if not unicodedata.combining(char)]) -def remove_diacritics(input: pd.Series) -> pd.Series: +def remove_diacritics(s: pd.Series) -> pd.Series: """ - Remove all diacritics and accents. + Removes all diacritics and accents. + + Removes all diacritics and accents from any word and characters from the given Pandas Series. + Returns a cleaned version of the Pandas Series. + + Parameters + ---------- + s : Pandas Series - Remove all diacritics and accents from any word and characters from the given Pandas Series. Return a cleaned version of the Pandas Series. + Returns + ------- + Pandas Series Examples -------- @@ -182,18 +259,27 @@ def remove_diacritics(input: pd.Series) -> pd.Series: >>> s = pd.Series("Montréal, über, 12.89, Mère, Françoise, noël, 889, اِس, اُس") >>> hero.remove_diacritics(s)[0] 'Montreal, uber, 12.89, Mere, Francoise, noel, 889, اس, اس' + """ - return input.astype("unicode").apply(_remove_diacritics) + return s.astype("unicode").apply(_remove_diacritics) -def remove_whitespace(input: pd.Series) -> pd.Series: +def remove_whitespace(s: pd.Series) -> pd.Series: r""" - Remove any extra white spaces. + Removes any extra white spaces. - Remove any extra whitespace in the given Pandas Series. Removes also newline, tabs and any form of space. + Removes any extra whitespace in the given Pandas Series. Removes also newline, tabs and any form of space. Useful when there is a need to visualize a Pandas Series and most cells have many newlines or other kind of space characters. + Parameters + ---------- + s : Pandas Series + + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -204,23 +290,30 @@ def remove_whitespace(input: pd.Series) -> pd.Series: dtype: object """ - return input.str.replace("\xa0", " ").str.split().str.join(" ") + return s.str.replace("\xa0", " ").str.split().str.join(" ") def _replace_stopwords(text: str, words: Set[str], symbol: str = " ") -> str: """ - Remove words in a set from a string, replacing them with a symbol. + Removes words in a set from a string, replacing them with a symbol. Parameters ---------- text: str + stopwords : Set[str] Set of stopwords string to remove. + symbol: str, Optional Character(s) to replace words with; defaults to a space. + Returns + ---------- + String + Examples -------- + >>> from texthero.preprocessing import _replace_stopwords >>> s = "the book of the jungle" >>> symbol = "$" >>> stopwords = ["the", "of"] @@ -239,27 +332,34 @@ def _replace_stopwords(text: str, words: Set[str], symbol: str = " ") -> str: def replace_stopwords( - input: pd.Series, symbol: str, stopwords: Optional[Set[str]] = None + s: pd.Series, symbol: str, stopwords: Optional[Set[str]] = None ) -> pd.Series: """ - Replace all instances of `words` with symbol. + Replaces all instances of `words` with symbol. By default uses NLTK's english stopwords of 179 words. Parameters ---------- - input : Pandas Series + s : Pandas Series + symbol: str Character(s) to replace words with. + stopwords : Set[str], Optional Set of stopwords string to remove. If not passed, by default it used NLTK English stopwords. + Returns + ------- + Pandas Series Examples -------- + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("the book of the jungle") - >>> replace_stopwords(s, "X") + >>> hero.replace_stopwords(s, "X") 0 X book X X jungle dtype: object @@ -267,24 +367,29 @@ def replace_stopwords( if stopwords is None: stopwords = _stopwords.DEFAULT - return input.apply(_replace_stopwords, args=(stopwords, symbol)) + return s.apply(_replace_stopwords, args=(stopwords, symbol)) def remove_stopwords( - input: pd.Series, stopwords: Optional[Set[str]] = None, remove_str_numbers=False + s: pd.Series, stopwords: Optional[Set[str]] = None, remove_str_numbers=False ) -> pd.Series: """ - Remove all instances of `words`. + Removes all instances of `words`. By default uses NLTK's english stopwords of 179 words: Parameters ---------- - input : Pandas Series + s : Pandas Series + stopwords : Set[str], Optional Set of stopwords string to remove. If not passed, by default it used NLTK English stopwords. + Returns + ---------- + Pandas Series + Examples -------- @@ -311,10 +416,10 @@ def remove_stopwords( """ - return replace_stopwords(input, symbol="", stopwords=stopwords) + return replace_stopwords(s, symbol="", stopwords=stopwords) -def stem(input: pd.Series, stem="snowball", language="english") -> pd.Series: +def stem(s: pd.Series, stem="snowball", language="english") -> pd.Series: r""" Stem series using either `porter` or `snowball` NLTK stemmers. @@ -325,12 +430,18 @@ def stem(input: pd.Series, stem="snowball", language="english") -> pd.Series: Parameters ---------- - input : Pandas Series + s : Pandas Series + stem : str (snowball by default) Stemming algorithm. It can be either 'snowball' or 'porter' + language : str (english by default) Supported languages: `danish`, `dutch`, `english`, `finnish`, `french`, `german` , `hungarian`, `italian`, `norwegian`, `portuguese`, `romanian`, `russian`, `spanish` and `swedish`. + Returns + ------- + Pandas Series + Notes ----- By default NLTK stemming algorithms lowercase all text. @@ -341,7 +452,7 @@ def stem(input: pd.Series, stem="snowball", language="english") -> pd.Series: >>> import texthero as hero >>> import pandas as pd >>> s = pd.Series("I used to go \t\n running.") - >>> hero.preprocessing.stem(s) + >>> hero.stem(s) 0 i use to go running. dtype: object """ @@ -356,12 +467,16 @@ def stem(input: pd.Series, stem="snowball", language="english") -> pd.Series: def _stem(text): return " ".join([stemmer.stem(word) for word in text]) - return input.str.split().apply(_stem) + return s.str.split().apply(_stem) def get_default_pipeline() -> List[Callable[[pd.Series], pd.Series]]: """ - Return a list contaning all the methods used in the default cleaning pipeline. + Returns a list contaning all the methods used in the default cleaning pipeline. + + Returns + ------- + List[Callable[[Pandas Series], Pandas Series]] Return a list with the following functions: 1. :meth:`texthero.preprocessing.fillna` @@ -385,10 +500,9 @@ def get_default_pipeline() -> List[Callable[[pd.Series], pd.Series]]: def clean(s: pd.Series, pipeline=None) -> pd.Series: """ - Pre-process a text-based Pandas Series. + Pre-process a text-based Pandas Series, by using the following default pipline. - - Default pipeline: + Default pipeline: 1. :meth:`texthero.preprocessing.fillna` 2. :meth:`texthero.preprocessing.lowercase` 3. :meth:`texthero.preprocessing.remove_digits` @@ -396,6 +510,28 @@ def clean(s: pd.Series, pipeline=None) -> pd.Series: 5. :meth:`texthero.preprocessing.remove_diacritics` 6. :meth:`texthero.preprocessing.remove_stopwords` 7. :meth:`texthero.preprocessing.remove_whitespace` + + Parameters + ---------- + s : Pandas Series + + pipeline :List[Callable[[Pandas Series], Pandas Series]] + inserting specific pipeline to clean a text + + Returns + ------- + Pandas Series + + Examples + -------- + For the default pipeline: + + >>> import texthero as hero + >>> import pandas as pd + >>> s = pd.Series("Uper 9dig. he her ÄÖÜ") + >>> hero.clean(s) + 0 uper 9dig aou + dtype: object """ if not pipeline: @@ -406,14 +542,24 @@ def clean(s: pd.Series, pipeline=None) -> pd.Series: return s -def has_content(s: pd.Series): +def has_content(s: pd.Series) -> pd.Series: r""" - Return a Boolean Pandas Series indicating if the rows has content. + Returns a Boolean Pandas Series indicating if the rows have content. + + Parameters + ---------- + s: Panda Series + + Returns + ------- + Panda Series Examples -------- + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series(["content", np.nan, "\t\n", " "]) - >>> has_content(s) + >>> hero.has_content(s) 0 True 1 False 2 False @@ -424,16 +570,24 @@ def has_content(s: pd.Series): return (s.pipe(remove_whitespace) != "") & (~s.isna()) -def drop_no_content(s: pd.Series): +def drop_no_content(s: pd.Series) -> pd.Series: r""" - Drop all rows without content. + Drops all rows without content. Every row from a given Pandas Series, where :meth:`has_content` is False, will be droped. - Drop all rows from the given Pandas Series where :meth:`has_content` is False. + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series Examples -------- + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series(["content", np.nan, "\t\n", " "]) - >>> drop_no_content(s) + >>> hero.drop_no_content(s) 0 content dtype: object @@ -441,15 +595,25 @@ def drop_no_content(s: pd.Series): return s[has_content(s)] -def remove_round_brackets(s: pd.Series): +def remove_round_brackets(s: pd.Series) -> pd.Series: """ - Remove content within parentheses () and parentheses. + Removes content within parentheses '()' and the parentheses by themself. + + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series Examples -------- + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("Texthero (is not a superhero!)") - >>> remove_round_brackets(s) + >>> hero.remove_round_brackets(s) 0 Texthero dtype: object @@ -464,14 +628,24 @@ def remove_round_brackets(s: pd.Series): return s.str.replace(r"\([^()]*\)", "") -def remove_curly_brackets(s: pd.Series): +def remove_curly_brackets(s: pd.Series) -> pd.Series: """ - Remove content within curly brackets {} and the curly brackets. + Removes content within curly brackets '{}' and the curly brackets by themself. + + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series Examples -------- + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("Texthero {is not a superhero!}") - >>> remove_curly_brackets(s) + >>> hero.remove_curly_brackets(s) 0 Texthero dtype: object @@ -486,15 +660,24 @@ def remove_curly_brackets(s: pd.Series): return s.str.replace(r"\{[^{}]*\}", "") -def remove_square_brackets(s: pd.Series): +def remove_square_brackets(s: pd.Series) -> pd.Series: """ - Remove content within square brackets [] and the square brackets. + Removes content within square brackets '[]' and the square brackets by themself. + + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series Examples -------- - + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("Texthero [is not a superhero!]") - >>> remove_square_brackets(s) + >>> hero.remove_square_brackets(s) 0 Texthero dtype: object @@ -510,15 +693,25 @@ def remove_square_brackets(s: pd.Series): return s.str.replace(r"\[[^\[\]]*\]", "") -def remove_angle_brackets(s: pd.Series): +def remove_angle_brackets(s: pd.Series) -> pd.Series: """ - Remove content within angle brackets <> and the angle brackets. + Removes content within angle brackets '<>' and the angle brackets by themself. + + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series + Examples -------- - + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("Texthero ") - >>> remove_angle_brackets(s) + >>> hero.remove_angle_brackets(s) 0 Texthero dtype: object @@ -533,17 +726,26 @@ def remove_angle_brackets(s: pd.Series): return s.str.replace(r"<[^<>]*>", "") -def remove_brackets(s: pd.Series): +def remove_brackets(s: pd.Series) -> pd.Series: """ - Remove content within brackets and the brackets itself. + Removes content within brackets and the brackets itself. + + Removes content from any kind of brackets, (), [], {}, <>. - Remove content from any kind of brackets, (), [], {}, <>. + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series Examples -------- - + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("Texthero (round) [square] [curly] [angle]") - >>> remove_brackets(s) + >>> hero.remove_brackets(s) 0 Texthero dtype: object @@ -566,14 +768,24 @@ def remove_brackets(s: pd.Series): def remove_html_tags(s: pd.Series) -> pd.Series: """ - Remove html tags from the given Pandas Series. + Removes html tags from the given Pandas Series. - Remove all html tags of the type `<.*?>` such as ,

,

and remove all html tags of type   and return a cleaned Pandas Series. + Removes all html tags of the type `<.*?>` such as ,

,

and removes all html tags of type   and returns a cleaned Pandas Series. + + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series Examples -------- + >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series("

Title

") - >>> remove_html_tags(s) + >>> hero.remove_html_tags(s) 0 Title dtype: object @@ -589,14 +801,22 @@ def remove_html_tags(s: pd.Series) -> pd.Series: def tokenize(s: pd.Series) -> pd.Series: """ - Tokenize each row of the given Series. + Tokenizes each row of the given Series. - Tokenize each row of the given Pandas Series and return a Pandas Series where each row contains a list of tokens. + Tokenizes each row of the given Pandas Series and returns a Pandas Series where each row contains a list of tokens. Algorithm: add a space between any punctuation symbol at exception if the symbol is between two alphanumeric character and split. + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -615,10 +835,12 @@ def tokenize(s: pd.Series) -> pd.Series: return s.str.replace(pattern, r"\2 \3 \4 \5").str.split() -def tokenize_with_phrases(s: pd.Series, min_count: int = 5, threshold: int = 10): +def tokenize_with_phrases( + s: pd.Series, min_count: int = 5, threshold: int = 10 +) -> pd.Series: r"""Tokenize and group up collocations words - Tokenize the given pandas Series and group up bigrams where each tokens has at least min_count term frequrncy and where the threshold is larger than the underline formula. + Tokenizes the given pandas Series and group up bigrams where each tokens has at least min_count term frequrncy and where the threshold is larger than the underline formula. :math:`\frac{(bigram\_a\_b\_count - min\_count)* len\_vocab }{ (word\_a\_count * word\_b\_count)}`. @@ -626,15 +848,21 @@ def tokenize_with_phrases(s: pd.Series, min_count: int = 5, threshold: int = 10) Parameters ---------- s : Pandas Series + min_count : Int, optional. Default is 5. ignore tokens with frequency less than this + threshold : Int, optional. Default is 10. ignore tokens with a score under that threshold + Returns + ------- + Pandas Series + Examples -------- - >>> import pandas as pd >>> import texthero as hero + >>> import pandas as pd >>> s = pd.Series(["New York is a beautiful city", "Look: New York!"]) >>> hero.tokenize_with_phrases(s, min_count=1, threshold=1) 0 [New_York, is, a, beautiful, city] @@ -661,6 +889,17 @@ def replace_urls(s: pd.Series, symbol: str) -> pd.Series: `replace_urls` replace any urls from the given Pandas Series with the given symbol. + Parameters + ---------- + s: Pandas Series + + symbol: String + The symbol to which the URL should be changed to. + + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -682,9 +921,17 @@ def replace_urls(s: pd.Series, symbol: str) -> pd.Series: def remove_urls(s: pd.Series) -> pd.Series: - r"""Remove all urls from a given Pandas Series. + r"""Removes all urls from a given Pandas Series. + + Removes all urls and replaces them with a single empty space. + + Parameters + ---------- + s: Pandas Series - `remove_urls` remove any urls and replace it with a single empty space. + Returns + ------- + Pandas Series Examples -------- @@ -712,9 +959,14 @@ def replace_tags(s: pd.Series, symbol: str) -> pd.Series: Parameters ---------- s : Pandas Series + symbols : str Symbols to replace + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -735,6 +987,14 @@ def remove_tags(s: pd.Series) -> pd.Series: A tag is a string formed by @ concatenated with a sequence of characters and digits. Example: @texthero123. Tags are replaceb by an empty space ` `. + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero @@ -759,8 +1019,13 @@ def replace_hashtags(s: pd.Series, symbol: str) -> pd.Series: Parameters ---------- s : Pandas Series + symbols : str Symbols to replace + + Returns + ------- + Panda Series Examples -------- @@ -781,6 +1046,14 @@ def remove_hashtags(s: pd.Series) -> pd.Series: A hashtag is a string formed by # concatenated with a sequence of characters, digits and underscores. Example: #texthero_123. + Parameters + ---------- + s: Pandas Series + + Returns + ------- + Pandas Series + Examples -------- >>> import texthero as hero diff --git a/texthero/representation.py b/texthero/representation.py index 9c27db97..14b64d50 100644 --- a/texthero/representation.py +++ b/texthero/representation.py @@ -43,14 +43,20 @@ def representation_series_to_flat_series( ---------- s : Sparse Pandas Series or Pandas Series The multiindexed Pandas Series to flatten. + index : Pandas Index, optional, default to None The index the flattened Series should have. + fill_missing_with : Any, default to np.nan Value to fill the NaNs (missing values) with. This _does not_ mean that existing values that are np.nan are replaced, but rather that features that are not present in one document but present in others are filled with fill_missing_with. See example below. + Returns + ------- + Panda Series + Examples -------- @@ -101,22 +107,49 @@ def representation_series_to_flat_series( def term_frequency( - s: pd.Series, max_features: Optional[int] = None, return_feature_names=False -): + s: pd.Series, + max_features: Optional[int] = None, + return_feature_names=False, + min_df=1, + max_df=1.0, + binary=False, +) -> pd.Series: """ - Represent a text-based Pandas Series using term_frequency. + Represents a text-based Pandas Series using term_frequency. The input Series should already be tokenized. If not, it will be tokenized before term_frequency is calculated. + Parameters ---------- s : Pandas Series - max_features : int, optional - Maximum number of features to keep. - return_features_names : Boolean, False by Default + + max_features : int, optional, default to None. + Maximum number of features to keep. Will keep all features if set to None. + + return_features_names : Boolean, default to False. If True, return a tuple (*term_frequency_series*, *features_names*) + max_df : float in range [0.0, 1.0] or int, default=1.0 + Ignore terms that have a document frequency (number of documents they appear in) + frequency strictly higher than the given threshold. + If float, the parameter represents a proportion of documents, integer + absolute counts. + + min_df : float in range [0.0, 1.0] or int, default=1 + When building the vocabulary ignore terms that have a document + frequency (number of documents they appear in) strictly + lower than the given threshold. + If float, the parameter represents a proportion of documents, integer + absolute counts. + + binary : bool, default=False + If True, all non zero counts are set to 1. + + Returns + ------- + Pandas Series Examples -------- @@ -130,7 +163,7 @@ def term_frequency( dtype: object To return the features_names: - + >>> import texthero as hero >>> import pandas as pd >>> s = pd.Series(["Sentence one", "Sentence two"]) @@ -149,7 +182,12 @@ def term_frequency( s = preprocessing.tokenize(s) tf = CountVectorizer( - max_features=max_features, tokenizer=lambda x: x, preprocessor=lambda x: x, + max_features=max_features, + tokenizer=lambda x: x, + preprocessor=lambda x: x, + min_df=min_df, + max_df=max_df, + binary=binary, ) s = pd.Series(tf.fit_transform(s).toarray().tolist(), index=s.index) @@ -192,17 +230,26 @@ def tfidf( Parameters ---------- s : Pandas Series (tokenized) + max_features : int, optional, default to None. If not None, only the max_features most frequent tokens are used. + min_df : int, optional, default to 1. - When building the vocabulary, ignore terms that have a document + When building the vocabulary, ignore terms that have a document frequency (number of documents a term appears in) strictly lower than the given threshold. + max_df : int or double, optional, default to 1.0 When building the vocabulary, ignore terms that have a document - frequency (number of documents a term appears in) strictly higher than the given threshold. This arguments basically permits to remove corpus-specific stop words. When the argument is a float [0.0, 1.0], the parameter represents a proportion of documents. + frequency (number of documents a term appears in) strictly higher than the given threshold. + This arguments basically permits to remove corpus-specific stop words. + When the argument is a float [0.0, 1.0], the parameter represents a proportion of documents. + return_feature_names: Boolean, optional, default to False Whether to return the feature (i.e. word) names with the output. + Returns + ------- + Panda Series Examples -------- @@ -210,11 +257,16 @@ def tfidf( >>> import pandas as pd >>> s = pd.Series(["Hi Bye", "Test Bye Bye"]) >>> s = hero.tokenize(s) - >>> hero.tfidf(s, return_feature_names=True) + >>> hero.tfidf(s, return_feature_names=True) # doctest: +SKIP (document 0 [1.0, 1.4054651081081644, 0.0] 1 [2.0, 0.0, 1.4054651081081644] dtype: object, ['Bye', 'Hi', 'Test']) + + See Also + -------- + `TF-IDF on Wikipedia `_ + """ # Check if input is tokenized. Else, print warning and tokenize. @@ -262,34 +314,125 @@ def tfidf( """ -def pca(s, n_components=2): +def pca(s: pd.Series, n_components=2, random_state=None) -> pd.Series: """ Perform principal component analysis on the given Pandas Series. - In general, *pca* should be called after the text has already been represented. + Principal Component Analysis (PCA) is a statistical method that is used + to reveal where the variance in a dataset comes from. For textual data, + one could for example first represent a Series of documents using + :meth:`texthero.representation.tfidf` to get a vector representation + of each document. Then, PCA can generate new vectors from the tfidf representation + that showcase the differences among the documents most strongly in fewer dimensions. + + For example, the tfidf vectors will have length 100 if hero.tfidf was called + on a large corpus with max_features=100. Visualizing 100 dimensions is hard! + Using PCA with n_components=3, every document will now get a vector of + length 3, and the vectors will be chosen so that the document differences + are easily visible. The corpus can now be visualized in 3D and we can + get a good first view of the data! + + In general, *pca* should be called after the text has already been represented to a matrix form. Parameters ---------- s : Pandas Series + n_components : Int. Default is 2. - Number of components to keep. If n_components is not set or None, all components are kept. + Number of components to keep (dimensionality of output vectors). + If n_components is not set or None, all components are kept. + + random_state : int, RandomState instance, default=None + Pass an int for reproducible results across multiple function calls. + + + Returns + ------- + Pandas Series with the vector calculated by PCA for the document in every cell. Examples -------- >>> import texthero as hero >>> import pandas as pd - >>> s = pd.Series(["Sentence one", "Sentence two"]) - + >>> s = pd.Series(["Football is great", "Hi, I'm Texthero, who are you? Tell me!"]) + >>> s = hero.clean(s) + >>> s = hero.tokenize(s) + >>> s = hero.tfidf(s) + >>> hero.pca(s, random_state=42) # doctest: +SKIP + document + 0 [1.5713577608669735, 1.1102230246251565e-16] + 1 [-1.5713577608669729, 1.1102230246251568e-16] + dtype: object + + See also + -------- + `PCA on Wikipedia `_ + + :meth:`tfidf` to compute TF-IDF and :meth:`term_frequency` to compute term frequency + """ - pca = PCA(n_components=n_components) + pca = PCA(n_components=n_components, random_state=random_state) return pd.Series(pca.fit_transform(list(s)).tolist(), index=s.index) -def nmf(s, n_components=2): +def nmf(s, n_components=2) -> pd.Series: """ - Perform non-negative matrix factorization. + Performs non-negative matrix factorization. + + Non-Negative Matrix Factorization (NMF) is often used in + natural language processing to find clusters of similar + texts (e.g. some texts in a corpus might be about sports + and some about music, so they will differ in the usage + of technical terms; see the example below). + + Given a document-term matrix (so in + texthero usually a Series after applying :meth:`texthero.representation.tfidf` + or some other first representation function that assigns a scalar (a weight) + to each word), NMF will find n_components many topics (clusters) + and calculate a vector for each document that places it + correctly among the topics. + + + Parameters + ---------- + s : Pandas Series + + n_components : Int. Default is 2. + Number of components to keep (dimensionality of output vectors). + If n_components is not set or None, all components are kept. + + Returns + ------- + Pandas Series with the vector calculated by NMF for the document in every cell. + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> doc1 = "Football, Sports, Soccer" + >>> doc2 = "Music, Violin, Orchestra" + >>> doc3 = "Football, Music" + >>> s = pd.Series([doc1, doc2, doc3]) + >>> s = hero.clean(s) + >>> s = hero.tokenize(s) + >>> s = hero.term_frequency(s) + >>> hero.nmf(s) # doctest: +SKIP + 0 [0.9080190347553924, 0.0] + 1 [0.0, 0.771931061231598] + 2 [0.3725409073202516, 0.31656880119331093] + dtype: object + >>> # As we can see, the third document, which + >>> # is a mix of sports and music, is placed + >>> # between the two axes (the topics) while + >>> # the other documents are placed right on + >>> # one topic axis each. + + See also + -------- + `NMF on Wikipedia `_ + + :meth:`tfidf` to compute TF-IDF and :meth:`term_frequency` to compute term frequency - """ nmf = NMF(n_components=n_components, init="random", random_state=0) return pd.Series(nmf.fit_transform(list(s)).tolist(), index=s.index) @@ -311,16 +454,145 @@ def tsne( method="barnes_hut", angle=0.5, n_jobs=-1, -): +) -> pd.Series: """ - Perform TSNE on the given pandas series. + Performs TSNE on the given pandas series. + + t-distributed Stochastic Neighbor Embedding (t-SNE) is + a machine learning algorithm used to visualize high-dimensional data in fewer + dimensions. In natural language processing, the high-dimensional + data is usually a document-term matrix + (so in texthero usually a Series after applying :meth:`texthero.representation.tfidf` + or some other first representation function that assigns a scalar (a weight) + to each word) that is hard to visualize as there + might be many terms. With t-SNE, every document + gets a new, low-dimensional (n_components entries) + vector in such a way that the differences / similarities between + documents are preserved. + Parameters ---------- s : Pandas Series + n_components : int, default is 2. - Number of components to keep. If n_components is not set or None, all components are kept. - perplexity : int, default is 30.0 + Number of components to keep (dimensionality of output vectors). + If n_components is not set or None, all components are kept. + + perplexity : float, optional (default: 30) + The perplexity is related to the number of nearest neighbors that + is used in other manifold learning algorithms. Larger datasets + usually require a larger perplexity. Consider selecting a value + between 5 and 50. Different values can result in significanlty + different results. + + early_exaggeration : float, optional (default: 12.0) + Controls how tight natural clusters in the original space are in + the embedded space and how much space will be between them. For + larger values, the space between natural clusters will be larger + in the embedded space. Again, the choice of this parameter is not + very critical. If the cost function increases during initial + optimization, the early exaggeration factor or the learning rate + might be too high. + + learning_rate : float, optional (default: 200.0) + The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If + the learning rate is too high, the data may look like a 'ball' with any + point approximately equidistant from its nearest neighbours. If the + learning rate is too low, most points may look compressed in a dense + cloud with few outliers. If the cost function gets stuck in a bad local + minimum increasing the learning rate may help. + + n_iter : int, optional (default: 1000) + Maximum number of iterations for the optimization. Should be at + least 250. + + n_iter_without_progress : int, optional (default: 300) + Maximum number of iterations without progress before we abort the + optimization, used after 250 initial iterations with early + exaggeration. Note that progress is only checked every 50 iterations so + this value is rounded to the next multiple of 50. + + min_grad_norm : float, optional (default: 1e-7) + If the gradient norm is below this threshold, the optimization will + be stopped. + + metric : string or callable, optional + The metric to use when calculating distance between instances in a + feature array. If metric is a string, it must be one of the options + allowed by scipy.spatial.distance.pdist for its metric parameter. + + Alternatively, if metric is a callable function, it is called on each + pair of instances (rows) and the resulting value recorded. The callable + should take two arrays from X as input and return a value indicating + the distance between them. The default is "euclidean" which is + interpreted as squared euclidean distance. + + init : string or numpy array, optional (default: "random") + Initialization of embedding. Possible options are 'random', 'pca', + and a numpy array of shape (n_samples, n_components). + PCA initialization cannot be used with precomputed distances and is + usually more globally stable than random initialization. + + verbose : int, optional (default: 0) + Verbosity level. + + random_state : int, RandomState instance, default=None + Determines the random number generator. Pass an int for reproducible + results across multiple function calls. Note that different + initializations might result in different local minima of the cost + function. + + method : string (default: 'barnes_hut') + By default the gradient calculation algorithm uses Barnes-Hut + approximation running in O(NlogN) time. method='exact' + will run on the slower, but exact, algorithm in O(N^2) time. The + exact algorithm should be used when nearest-neighbor errors need + to be better than 3%. However, the exact method cannot scale to + millions of examples. + + angle : float (default: 0.5) + Only used if method='barnes_hut' + This is the trade-off between speed and accuracy for Barnes-Hut T-SNE. + 'angle' is the angular size of a distant + node as measured from a point. If this size is below 'angle' then it is + used as a summary node of all points contained within it. + This method is not very sensitive to changes in this parameter + in the range of 0.2 - 0.8. Angle less than 0.2 has quickly increasing + computation time and angle greater 0.8 has quickly increasing error. + + n_jobs : int or None, optional (default=None) + The number of parallel jobs to run for neighbors search. This parameter + has no impact when ``metric="precomputed"`` or + (``metric="euclidean"`` and ``method="exact"``). + ``-1`` means using all processors. + + Returns + ------- + Pandas Series with the vector calculated by t-SNE for the document in every cell. + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> doc1 = "Football, Sports, Soccer" + >>> doc2 = "Music, Violin, Orchestra" + >>> doc3 = "Football, Music" + >>> s = pd.Series([doc1, doc2, doc3]) + >>> s = hero.clean(s) + >>> s = hero.tokenize(s) + >>> s = hero.term_frequency(s) + >>> hero.tsne(s, random_state=42) # doctest: +SKIP + 0 [-18.833383560180664, -276.800537109375] + 1 [-210.60179138183594, 143.00535583496094] + 2 [-478.27984619140625, -232.97410583496094] + dtype: object + + See also + -------- + `t-SNE on Wikipedia `_ + + :meth:`tfidf` to compute TF-IDF and :meth:`term_frequency` to compute term frequency """ tsne = TSNE( @@ -354,17 +626,112 @@ def kmeans( n_init=10, max_iter=300, tol=0.0001, - precompute_distances="auto", verbose=0, random_state=None, copy_x=True, n_jobs=-1, algorithm="auto", -): +) -> pd.Series: """ - Perform K-means clustering algorithm. + Performs K-means clustering algorithm. + + K-means clustering is used in natural language processing + to separate texts into k clusters (groups) + (e.g. some texts in a corpus might be about sports + and some about music, so they will differ in the usage + of technical terms; the K-means algorithm uses this + to separate them into two clusters). + + Given a document-term matrix (so in + texthero usually a Series after applying :meth:`texthero.representation.tfidf` + or some other first representation function that assigns a scalar (a weight) + to each word), K-means will find k topics (clusters) + and assign a topic to each document. + + Parameters + ---------- + s: Pandas Series + + n_clusters: Int, default to 5. + The number of clusters to separate the data into. + + init : {'k-means++', 'random', ndarray, callable}, default='k-means++' + Method for initialization: + + 'k-means++' : selects initial cluster centers for k-mean + clustering in a smart way to speed up convergence. See section + Notes in k_init for more details. + + 'random': choose `n_clusters` observations (rows) at random from data + for the initial centroids. + + If an ndarray is passed, it should be of shape (n_clusters, n_features) + and gives the initial centers. + + If a callable is passed, it should take arguments X, n_clusters and a + random state and return an initialization. + + n_init : int, default=10 + Number of time the k-means algorithm will be run with different + centroid seeds. The final results will be the best output of + n_init consecutive runs in terms of inertia. + + max_iter : int, default=300 + Maximum number of iterations of the k-means algorithm for a + single run. + + tol : float, default=1e-4 + Relative tolerance with regards to Frobenius norm of the difference + in the cluster centers of two consecutive iterations to declare + convergence. + It's not advised to set `tol=0` since convergence might never be + declared due to rounding errors. Use a very small number instead. + + verbose : int, default=0 + Verbosity mode. + + random_state : int, RandomState instance, default=None + Determines random number generation for centroid initialization. Use + an int to make the randomness deterministic. + + algorithm : {"auto", "full", "elkan"}, default="auto" + K-means algorithm to use. The classical EM-style algorithm is "full". + The "elkan" variation is more efficient on data with well-defined + clusters, by using the triangle inequality. However it's more memory + intensive. + + Returns + ------- + Pandas Series with the cluster the document was assigned to in each cell. + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> doc1 = "Football, Sports, Soccer" + >>> doc2 = "music, violin, orchestra" + >>> doc3 = "football, fun, sports" + >>> doc4 = "music, fun, guitar" + >>> s = pd.Series([doc1, doc2, doc3, doc4]) + >>> s = hero.clean(s) + >>> s = hero.tokenize(s) + >>> s = hero.term_frequency(s) + >>> hero.kmeans(s, n_clusters=2, random_state=42) + 0 1 + 1 0 + 2 1 + 3 0 + dtype: category + Categories (2, int64): [0, 1] + >>> # As we can see, the documents are correctly + >>> # separated into topics / clusters by the algorithm. + + See also + -------- + `kmeans on Wikipedia `_ + + :meth:`tfidf` to compute TF-IDF and :meth:`term_frequency` to compute term frequency - Return a "category" Pandas Series. """ vectors = list(s) kmeans = KMeans( @@ -373,13 +740,13 @@ def kmeans( n_init=n_init, max_iter=max_iter, tol=tol, - precompute_distances=precompute_distances, verbose=verbose, random_state=random_state, - copy_x=copy_x, - n_jobs=n_jobs, + # We are using list(s) anyway, so we can safely modify that without changing the input. + copy_x=False, algorithm=algorithm, ).fit(vectors) + return pd.Series(kmeans.predict(vectors), index=s.index).astype("category") @@ -392,12 +759,103 @@ def dbscan( algorithm="auto", leaf_size=30, p=None, - n_jobs=None, + n_jobs=-1, ): """ Perform DBSCAN clustering. - Return a "category" Pandas Series. + Density-based spatial clustering of applications with noise (DBSCAN) + is used in natural language processing + to separate texts into clusters (groups) + (e.g. some texts in a corpus might be about sports + and some about music, so they will differ in the usage + of technical terms; the DBSCAN algorithm uses this + to separate them into clusters). It chooses the + number of clusters on its own. + + Given a document-term matrix (so in + texthero usually a Series after applying :meth:`texthero.representation.tfidf` + or some other first representation function that assigns a scalar (a weight) + to each word), DBSCAN will find topics (clusters) + and assign a topic to each document. + + Parameters + ---------- + s: Pandas Series + + eps : float, default=0.5 + The maximum distance between two samples for one to be considered + as in the neighborhood of the other. This is not a maximum bound + on the distances of points within a cluster. This is the most + important DBSCAN parameter to choose appropriately for your data set + and distance function. + + min_samples : int, default=5 + The number of samples (or total weight) in a neighborhood for a point + to be considered as a core point. This includes the point itself. + + metric : string, or callable, default='euclidean' + The metric to use when calculating distance between instances in a + feature array. If metric is a string or callable, it must be one of + the options allowed by :func:`sklearn.metrics.pairwise_distances` for + its metric parameter. + + metric_params : dict, default=None + Additional keyword arguments for the metric function. + + algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' + The algorithm to be used by the NearestNeighbors module + to compute pointwise distances and find nearest neighbors. + See NearestNeighbors module documentation for details. + + leaf_size : int, default=30 + Leaf size passed to BallTree or cKDTree. This can affect the speed + of the construction and query, as well as the memory required + to store the tree. The optimal value depends + on the nature of the problem. + + p : float, default=None + The power of the Minkowski metric to be used to calculate distance + between points. + + n_jobs : int, default=None + The number of parallel jobs to run. + ``-1`` means using all processors. + + Returns + ------- + Pandas Series with the cluster the document was assigned to in each cell. + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> doc1 = "Football, Sports, Soccer" + >>> doc2 = "music, violin, orchestra" + >>> doc3 = "football, fun, sports" + >>> doc4 = "music, enjoy, guitar" + >>> s = pd.Series([doc1, doc2, doc3, doc4]) + >>> s = hero.clean(s) + >>> s = hero.tokenize(s) + >>> s = hero.tfidf(s) + >>> hero.dbscan(s, min_samples=1, eps=4) + document + 0 0 + 1 1 + 2 0 + 3 1 + dtype: category + Categories (2, int64): [0, 1] + >>> # As we can see, the documents are correctly + >>> # separated into topics / clusters by the algorithm + >>> # and we didn't even have to say how many topics there are! + + See also + -------- + `DBSCAN on Wikipedia `_ + + :meth:`tfidf` to compute TF-IDF and :meth:`term_frequency` to compute term frequency + """ return pd.Series( @@ -428,9 +886,87 @@ def meanshift( """ Perform mean shift clustering. - Return a "category" Pandas Series. - """ + Mean shift clustering + is used in natural language processing + to separate texts into clusters (groups) + (e.g. some texts in a corpus might be about sports + and some about music, so they will differ in the usage + of technical terms; the mean shift algorithm uses this + to separate them into clusters). It chooses the + number of clusters on its own. + + Given a document-term matrix (so in + texthero usually a Series after applying :meth:`texthero.representation.tfidf` + or some other first representation function that assigns a scalar (a weight) + to each word), mean shift will find topics (clusters) + and assign a topic to each document. + + Parameters + ---------- + s: Pandas Series + bandwidth : float, default=None + Bandwidth used in the RBF kernel. + + If not given, the bandwidth is estimated using + sklearn.cluster.estimate_bandwidth; see the documentation for that + function for hints on scalability. + + seeds : array-like of shape (n_samples, n_features), default=None + Seeds used to initialize kernels. + + bin_seeding : bool, default=False + If true, initial kernel locations are not locations of all + points, but rather the location of the discretized version of + points, where points are binned onto a grid whose coarseness + corresponds to the bandwidth. Setting this option to True will speed + up the algorithm because fewer seeds will be initialized. + The default value is False. + Ignored if seeds argument is not None. + + min_bin_freq : int, default=1 + To speed up the algorithm, accept only those bins with at least + min_bin_freq points as seeds. + + cluster_all : bool, default=True + If true, then all points are clustered, even those orphans that are + not within any kernel. Orphans are assigned to the nearest kernel. + If false, then orphans are given cluster label -1. + + n_jobs : int, default=None + The number of jobs to use for the computation. + ``-1`` means using all processors + + max_iter : int, default=300 + Maximum number of iterations, per seed point before the clustering + operation terminates (for that seed point), if has not converged yet. + + Returns + ------- + Pandas Series with the cluster the document was assigned to in each cell. + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> s = pd.Series([[1, 1], [2, 1], [1, 0], [4, 7], [3, 5], [3, 6]]) + >>> hero.meanshift(s, bandwidth=2) + 0 1 + 1 1 + 2 1 + 3 0 + 4 0 + 5 0 + dtype: category + Categories (2, int64): [0, 1] + + See also + -------- + `Mean-Shift on Wikipedia `_ + + :meth:`tfidf` to compute TF-IDF and :meth:`term_frequency` to compute term frequency + + """ return pd.Series( MeanShift( bandwidth=bandwidth, diff --git a/texthero/visualization.py b/texthero/visualization.py index 507b83e5..a72f7921 100644 --- a/texthero/visualization.py +++ b/texthero/visualization.py @@ -20,31 +20,86 @@ def scatterplot( df: pd.DataFrame, col: str, color: str = None, + hover_name: str = None, hover_data: [] = None, title="", return_figure=False, ): """ - Show scatterplot using python plotly scatter. + Show scatterplot of DataFrame column using python plotly scatter. + Parameters ---------- - df - col - The name of the column of the DataFrame used for x and y axis. + df: DataFrame with a column to be visualized. + + col: str + The name of the column of the DataFrame to use for x and y (and z) axis. + + color: str, default to None. + Name of the column to use for coloring (rows with same value get same color). + + title: str, default to "". + Title of the plot. + + return_figure: optional, default to False. + Function returns the figure if set to True. + + hover_data: List[str], default to []. + List of column names to supply data when hovering over a point. + + hover_name: str, default to None + Name of the column to supply title of data when hovering over a point. + + Examples + -------- + >>> import texthero as hero + >>> import pandas as pd + >>> doc1 = "Football, Sports, Soccer" + >>> doc2 = "music, violin, orchestra" + >>> doc3 = "football, fun, sports" + >>> doc4 = "music, fun, guitar" + >>> df = pd.DataFrame([doc1, doc2, doc3, doc4], columns=["texts"]) + >>> df["texts"] = hero.clean(df["texts"]) + >>> df["texts"] = hero.tokenize(df["texts"]) + >>> df["tfidf"] = hero.tfidf(df["texts"]) + >>> df["topics"] = hero.kmeans(df["tfidf"], n_clusters=2) + >>> df["pca"] = hero.pca(df["tfidf"], n_components=3) + >>> hero.scatterplot(df, col="pca", color="topics", hover_name="texts") # doctest: +SKIP """ - pca0 = df[col].apply(lambda x: x[0]) - pca1 = df[col].apply(lambda x: x[1]) + x = df[col].apply(lambda x: x[0]) + y = df[col].apply(lambda x: x[1]) + + if len(df[col][0]) == 3: + z = df[col].apply(lambda x: x[2]) + fig = px.scatter_3d( + df, + x=x, + y=y, + z=z, + color=color, + hover_data=hover_data, + title=title, + hover_name=hover_name, + ) + else: + fig = px.scatter( + df, + x=x, + y=y, + color=color, + hover_data=hover_data, + title=title, + hover_name=hover_name, + ) - fig = px.scatter( - df, x=pca0, y=pca1, color=color, hover_data=hover_data, title=title - ) # fig.show(config={'displayModeBar': False}) - fig.show() if return_figure: return fig + else: + fig.show() """ @@ -78,26 +133,42 @@ def wordcloud( Parameters ---------- s : pd.Series + font_path : str - Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don't have this font, you need to adjust this path. + Font path to the font that will be used (OTF or TTF). + Defaults to DroidSansMono path on a Linux machine. + If you are on another OS or don't have this font, you need to adjust this path. + width : int Width of the canvas. + height : int Height of the canvas. + max_words : number (default=200) The maximum number of words. + mask : nd-array or None (default=None) - When set, gives a binary mask on where to draw words. When set, width and height will be ignored and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considerd "masked out" while other entries will be free to draw on. + When set, gives a binary mask on where to draw words. + When set, width and height will be ignored and the shape of mask will be used instead. + All white (#FF or #FFFFFF) entries will be considerd "masked out" while other + entries will be free to draw on. + contour_width: float (default=0) If mask is not None and contour_width > 0, draw the mask contour. + contour_color: color value (default="PAPAYAWHIP") Mask contour color. + min_font_size : int (default=4) Smallest font size to use. Will stop when there is no more room in this size. + background_color : color value (default="PAPAYAWHIP") Background color for the word cloud image. + max_font_size : int or None (default=None) Maximum font size for the largest word. If None, height of the image is used. + relative_scaling : float (default='auto') Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With @@ -106,8 +177,10 @@ def wordcloud( their rank, relative_scaling around .5 often looks good. If 'auto' it will be set to 0.5 unless repeat is true, in which case it will be set to 0. + colormap : string or matplotlib colormap, default="viridis" Matplotlib colormap to randomly draw colors from for each word. + """ text = s.str.cat(sep=" ") @@ -162,12 +235,23 @@ def top_words(s: pd.Series, normalize=False) -> pd.Series: Return a pandas series with index the top words and as value the count. Tokenization: split by space and remove all punctuations that are not between characters. - + Parameters ---------- - normalize : + normalize : optional, default to False. When set to true, return normalized values. + Examples + -------- + >>> import pandas as pd + >>> import texthero as hero + >>> s = pd.Series("one two two three three three") + >>> hero.top_words(s) + three 3 + two 2 + one 1 + dtype: int64 + """ # Replace all punctuation that are NOT in-between chacarters diff --git a/website/docs/getting-started.md b/website/docs/getting-started.md index e2b9419c..3f8dbc26 100644 --- a/website/docs/getting-started.md +++ b/website/docs/getting-started.md @@ -9,13 +9,13 @@ Texthero is a python package to let you work efficiently and quickly with text d ## Overview -Given a dataset with structured data, it's easy to have a quick understanding of the underline data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **map it into a vector space** and gather from it **primary insights**. +Given a dataset with structured data, it's easy to have a quick understanding of the underlying data. Oppositely, given a dataset composed of text-only, it's harder to have a quick undertanding of the data. Texthero help you there, providing utility functions to quickly **clean the text data**, **tokenize it**, **map it into a vector space** and gather from it **primary insights**. ##### Pandas integration One of the main pillar of texthero is that is designed from the ground-up to work with **Pandas Dataframe** and **Series**. -Most of texthero methods, simply apply transformation to Pandas Series. As a rule of thumb, the first argument and the return ouputs of almost all texthero methods are either a Pandas Series or a Pandas DataFrame. +Most of texthero's methods simply apply a transformation to a Pandas Series. As a rule of thumb, the first argument and the ouput of almost all texthero methods are either a Pandas Series or a Pandas DataFrame. ##### Pipeline @@ -46,7 +46,7 @@ The five different areas are _athletics_, _cricket_, _football_, _rugby_ and _te The original dataset comes as a zip files with five different folder containing the article as text data for each topic. -For convenience, we createdThis script simply read all text data and store it into a Pandas Dataframe. +For convenience, we created this script simply read all text data and store it into a Pandas Dataframe. Import texthero and pandas. @@ -87,7 +87,7 @@ Recently, Pandas has introduced the pipe function. You can achieve the same resu df['clean_text'] = df['text'].pipe(hero.clean) ``` -> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and permit to construct complex pipeline. +> Tips. When we need to define a new column returned from a function, we prepend the name of the function to the column name. Example: df['tsne_col'] = df['col'].pipe(hero.tsne). This keep the code simple to read and allows us to construct complex pipelines. The default pipeline for the `clean` method is the following: @@ -120,46 +120,66 @@ or alternatively df['clean_text'] = df['clean_text'].pipe(hero.clean, custom_pipeline) ``` +##### Tokenize + +Next, we usually want to tokenize the text (_tokenizing_ means splitting sentences/documents into separate words, the _tokens_). Of course, texthero provides an easy function for that! + +```python +df['tokenized_text'] = hero.tokenize(df['clean_text']) +``` + + ##### Preprocessing API -The complete preprocessing API can be found at the following address: [api preprocessing](/docs/api-preprocessing). +The complete preprocessing API can be found here: [api preprocessing](/docs/api-preprocessing). ### Representation -Once cleaned the data, the next natural is to map each document into a vector. +Once the data is cleaned and tokenized, the next natural step is to map each document to a vector so we can compare documents with mathematical methods to derive insights. ##### TFIDF representation +TFIDF is a formula to calculate the _relative importance_ of the words in a document, taking +into account the words' occurrences in other documents. ```python -df['tfidf_clean_text'] = hero.tfidf(df['clean_text']) +df['tfidf'] = hero.tfidf(df['tokenized_text']) ``` +Now, we have calculated a vector for each document that tells us what words are characteristic for the document. +Usually, documents about similar topics use similar terms, so their tfidf-vectors will be similar too. + ##### Dimensionality reduction with PCA -To visualize the data, we map each point to a two-dimensional representation with PCA. The principal component analysis algorithms returns the combination of attributes that better account the variance in the data. +We now want to visualize the data. However, the tfidf-vectors are very high-dimensional (i.e. every +document might have a tfidf-vector of length 100). Visualizing 100 dimensions is hard! + +Thus, we perform dimensionality reduction (generating vectors with fewer entries from vectors with +many entries). For that, we can use PCA. PCA generates new vectors from the tfidf representation +that showcase the differences among the documents most strongly in fewer dimensions, often 2 or 3. ```python -df['pca_tfidf_clean_text'] = hero.pca(df['tfidf_clean_text']) +df['pca'] = hero.pca(df['tfidf']) ``` ##### All in one step -We can achieve all the three steps show above, _cleaning_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't fabulous? +We can achieve all the steps shown above, _cleaning_, _tokenizing_, _tf-idf representation_ and _dimensionality reduction_ in a single step. Isn't that fabulous? ```python df['pca'] = ( - df['text'] - .pipe(hero.clean) - .pipe(hero.tfidf) - .pipe(hero.pca) - ) + df['text'] + .pipe(hero.clean) + .pipe(hero.tokenize) + .pipe(hero.tfidf) + .pipe(hero.pca) +) ``` ##### Representation API -The complete representation module API can be found at the following address: [api representation](/docs/api-representation). +The complete representation module API can be found here: [api representation](/docs/api-representation). ### Visualization @@ -176,32 +196,43 @@ Also, we can "visualize" the most common words for each `topic` with `top_words` ```python NUM_TOP_WORDS = 5 -df.groupby('topic')['text'].apply(lambda x: hero.top_words(x)[:NUM_TOP_WORDS]) +df.groupby('topic')['clean_text'].apply(lambda x: hero.top_words(x, normalize=True)[:NUM_TOP_WORDS]) ``` ``` topic -athletics said 0.010068 - world 0.008900 - year 0.008844 -cricket test 0.008250 - england 0.008001 - first 0.007787 -football said 0.009515 - chelsea 0.006110 - game 0.005950 -rugby england 0.012602 - said 0.008359 - wales 0.007880 -tennis 6 0.021047 - said 0.013012 - open 0.009834 +athletics said 0.010330 + world 0.009132 + year 0.009075 + olympic 0.007819 + race 0.006392 +cricket test 0.008492 + england 0.008235 + first 0.008016 + cricket 0.007906 + one 0.007760 +football said 0.009709 + chelsea 0.006234 + game 0.006071 + would 0.005866 + club 0.005601 +rugby england 0.012833 + said 0.008512 + wales 0.008025 + ireland 0.007440 + rugby 0.007245 +tennis said 0.013993 + open 0.010575 + first 0.009608 + set 0.009028 + year 0.008447 +Name: clean_text, dtype: float64 ``` ##### Visualization API -The complete visualization module API can be found at the following address: [api visualization](/docs/api-visualization). +The complete visualization module API can be found here: [api visualization](/docs/api-visualization). ## Summary @@ -217,15 +248,19 @@ df = pd.read_csv( df['pca'] = ( df['text'] .pipe(hero.clean) + .pipe(hero.tokenize) .pipe(hero.tfidf) - .pipe(hero.pca) + .pipe(hero.pca, n_components=3) ) hero.scatterplot(df, col='pca', color='topic', title="PCA BBC Sport news") ``` +![](/img/scatterplot_bbcsport_3d.png) + + ##### Next section By now, you should have understood the main building blocks of texthero. -In the next sections, we will review each module, see how we can tune the default settings and we will show other application where Texthero might come in handy. +In the next sections, we will review each module, see how we can tune the default settings and we will show other applications where Texthero might come in handy. diff --git a/website/static/img/scatterplot_bbcsport_3d.png b/website/static/img/scatterplot_bbcsport_3d.png new file mode 100644 index 00000000..2642d992 Binary files /dev/null and b/website/static/img/scatterplot_bbcsport_3d.png differ