diff --git a/README.md b/README.md index 7085c8b..79364fd 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,9 @@ Currently, only compatible with Japanese inputs. Please cite [Manabe et al. (2021)](https://doi.org/10.2196/29500) (see the bottom references) if you use this library. +This collection needs more linguistic mesures to be complete, of course. +Issues and Pull Requests are welcome! + limco は著者の態度や心理的特性と関連があるとされるスタイロメトリック言語指標のコレクションです。 [浅石 (2017)](https://doi.org/10.20651/jslis.63.3_159) にまとめられている日本語の言語指標をベースにしています。 @@ -14,6 +17,8 @@ limco は著者の態度や心理的特性と関連があるとされるスタ このライブラリを使用した場合は、[Manabe et al. (2021)](https://doi.org/10.2196/29500) (下記参考文献) を引用してください。 +本ライブラリは言語指標のコレクションとしてはまだまだ不完全です.Issue や PR を歓迎します! + ## Installation / インストール You need Python 3.9 or later. @@ -46,83 +51,84 @@ Specify the paths to the resources with the following options. The following linguistic measures are implemented. / 次の言語指標が実装されています。 -- **Percentages of character types**: +- **Percentages of character types / 文字種の割合**: The ratio of hiragana, katakana, and kanji (Chinese characters) to - the characters in text, respectively. + the characters in text, respectively. / ひらがな、カタカナ、漢字(中国語の文字)のそれぞれの、総文字数に対する割合。 - **Type Token Ratio (TTR)**: - The ratio of the different words to the total number of words in text. + The ratio of the distinct words to the total number of words in text. + We cover several variants of TTRs. / 異なり語数(単語の種類数)を、総単語数で割った値。いくつかの補正バリエーションを実装。 -- **Percentage of content words**: +- **Percentage of content words / 内容語の割合**: The ratio of content words (i.e., nouns, verbs, adjectives, and - adverbs) to the total number of words in text. + adverbs) to the total number of words in text. / 内容語(名詞、動詞、形容詞、副詞)のそれぞれの、総単語数に対する割合。 -- **Modifying words and Verb Ratio (MVR)**: - The ratio of verbs to adjectives, adverbs, and conjunctions for the +- **Modifying words and Verb Ratio (MVR) / 相の類に対する用の類の割合**: + The ratio of verbs to adjectives, adverbs, and pre-noun adjectival for the words in text. It has been used as one of the indicators of - author estimation. + author estimation. / 用の類(形容詞、副詞、連体詞)に対する動詞の割合。著者推定の指標として用いられている。 -- **Percentage of proper nouns**: - The ratio of proper nouns (named entities) to all words in text. +- **Percentage of proper nouns / 固有名詞の割合**: + The ratio of proper nouns (named entities) to all words in text. / 固有名詞のそれぞれの、総単語数に対する割合。 -- **Word abstraction**: +- **Word abstractness / 単語抽象度**: The abstraction degrees of the words in text. We specifically used the maximum value of the most abstract word, and the average of the top five abstract words. The abstraction degrees were obtained - from the Japanese word-abstraction dictionary [AWD-J EX](http://sociocom.jp/~data/2019-AWD-J/). + from the Japanese word-abstraction dictionary [AWD-J EX](http://sociocom.jp/~data/2019-AWD-J/). / 単語抽象度辞書 AWD-J EX から得られる単語抽象度。最も抽象的な単語の最大値、上位 5 語の平均値を使用。 -- **Ratios of emotional words**: +- **Emotion scores / 感情スコア**: The ratios, to all the words in text, of the words that are associated with each of the seven kinds of emotions: sadness, - anxiety, anger, disgust, trust, surprise, and happiness. The seven + anxiety, anger, disgust, trust, surprise, and joy. The seven values are transformed to meet the property of probability (each value spans between 0 and 1; the sum of all values is to be 1). The degree of association with emotion was determined according to the - Japanese emotional-word dictionary JIWC. + Japanese emotional-word dictionary JIWC. / 感情辞書 JIWC から得られる感情スコア。7 種類の感情(悲しみ、不安、怒り、嫌悪、信頼、驚き、喜び)に対するそれぞれの単語の割合。7 つの値は確率の性質を満たすように変換されている(各値は 0 から 1 の間にあり、合計は 1 になる)。 -- **Number of sentences**: - The total number of sentences that make up text. +- **The number of sentences / 総文数**: + The total number of sentences that make up text. / 文の総数。 -- **Length of sentences**: +- **Length of sentences / 文の長さ**: Descriptive statistics (mean, standard deviation, interquartile, minimum, and maximum) for the number of characters in each sentence that constitutes text. In particular, the average sentence length has been suggested to be linked to the writer’s creative - attitude and personality . + attitude and personality. / 文の長さの統計量(平均、標準偏差、四分位範囲、最小値、最大値)。特に、平均文長は著者の創造的態度や性格と関連しているとされている。 -- **Percentage of conversational sentences**: +- **Percentage of conversational sentences / 会話文の割合**: Percentage of the total number of conversational sentences contained - in text. + in text. / 会話文(「」『』で括られたテキスト)の総文数に対する割合。 -- **Depth of syntax tree**: +- **Depth of syntax tree / 係り受け構造の深さ**: Descriptive statistics calculated for the depth of the dependency - tree for each sentence in text. + tree for each sentence in text. / 係り受け構造の深さの統計量。 -- **Mean of the number of chunks per sentence**: +- **The number of chunks per sentence / 文ごとの文節数**: Descriptive statistics calculated for the average values of the - number of chunks for each sentence in text. + number of chunks for each sentence in text. / 文ごとの文節数の統計量。 -- **Mean of the words per chunk**: +- **The tokens per chunk / 文節ごとの単語数**: Descriptive statistics calculated for the average values of the - number of words per chunk in text. + number of words per chunk in text. / 文節ごとの単語数の統計量。 ### Summary table -| Stylometric | Sub-measures (value format) | -| :---------------------------------------- | :-------------------------------------------------------------------------- | -| Percentages of character types | Hiragana, katakana, and kanji (Chinese characters) (%) | -| Type Token Ration (TTR) | (%) | -| Percentages of content words | (%) | -| Modifying words and Verb Ratio (MVR) | (%) | -| Percentage of proper nouns | (%) | -| Word abstraction | The maximum, and the average of the top five abstract words (real number) | -| Ratios of emotional words | sadness, anxiety, anger, disgust, trust, surprise, and happiness (%) | -| Number of sentences | (integer) | -| Length of sentences | mean, standard deviation, interquartile, minimum, and maximum (real number) | -| Percentage of conversational sentences | (%) | -| Depth of syntax tree | mean, standard deviation, interquartile, minimum, and maximum (real number) | -| Mean of the number of chunks per sentence | mean, standard deviation, interquartile, minimum, and maximum (real number) | -| Mean of the words per chunk | mean, standard deviation, interquartile, minimum, and maximum (real number) | +| Stylometric | Sub-measures (value format) | +| :------------------------------------- | :------------------------------------------------------------------------------------------------- | +| Percentages of character types | Hiragana, katakana, and kanji (Chinese characters) (%) | +| Type Token Ration (TTR) | Plain TTR, Guiraud's R, Herdan's C_H, Rubet's k, Maas's a^2, Tuldava's LN, Brunet's W, Dugast's U, | +| Percentages of content words | | +| Modifying words and Verb Ratio (MVR) | (%) | +| Percentage of proper nouns | (%) | +| Word abstractness | The maximum, and the average of the top five abstract words (real number) | +| Emotion scores | sadness, anxiety, anger, disgust, trust, surprise, and joy (%) | +| The number of sentences | (integer) | +| Length of sentences | mean, standard deviation, interquartile, minimum, and maximum (real number) | +| Percentage of conversational sentences | (%) | +| Depth of syntax tree | mean, standard deviation, interquartile, minimum, and maximum (real number) | +| The number of chunks per sentence | mean, standard deviation, interquartile, minimum, and maximum (real number) | +| The tokens per chunk | mean, standard deviation, interquartile, minimum, and maximum (real number) | --- @@ -130,3 +136,7 @@ The following linguistic measures are implemented. / 次の言語指標が実装 - [Asaishi, 2017]: 浅石卓真. 2017. テキストの特徴を計量する指標の概観. 日本図書館情報学会誌, 63(3), 159–169. https://doi.org/10.20651/jslis.63.3_159 - [Manabe+, 2021]: Masae Manabe, Kongmeng Liew, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki. 2021. Estimation of Psychological Distress in Japanese Youth Through Narrative Writing: Text-Based Stylometric and Sentiment Analyses. JMIR Formative Research, 5(8):e29500. https://doi.org/10.2196/29500 + +## Developer + +- [Shuntaro Yada](https://shuntaroy.com) diff --git a/limco.py b/limco.py index 50420b0..6a4df73 100644 --- a/limco.py +++ b/limco.py @@ -375,6 +375,7 @@ def from_file( def main(): with warnings.catch_warnings(): warnings.simplefilter("ignore") + # Surpress warnings mainly from numpy for CLI fire.Fire(from_file) diff --git a/pyproject.toml b/pyproject.toml index f489e1f..7509414 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,6 +5,11 @@ description = "Linguistic Measure Collection" authors = ["Shuntaro Yada "] license = "MIT" readme = "README.md" +homepage = "https://github.com/sociocom/limco" +classifiers = [ + "Topic :: Text Processing :: Linguistic", + "Natural Language :: Japanese" +] [tool.poetry.dependencies] python = "^3.9"