Skip to content

Commit

Permalink
publish to pypi
Browse files Browse the repository at this point in the history
  • Loading branch information
shuntaroy committed Mar 13, 2023
1 parent 1655e9b commit d6c2b2d
Show file tree
Hide file tree
Showing 3 changed files with 58 additions and 42 deletions.
94 changes: 52 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,18 @@ Currently, only compatible with Japanese inputs.

Please cite [Manabe et al. (2021)](https://doi.org/10.2196/29500) (see the bottom references) if you use this library.

This collection needs more linguistic mesures to be complete, of course.
Issues and Pull Requests are welcome!

limco は著者の態度や心理的特性と関連があるとされるスタイロメトリック言語指標のコレクションです。
[浅石 (2017)](https://doi.org/10.20651/jslis.63.3_159) にまとめられている日本語の言語指標をベースにしています。

現在、日本語のみに対応しています。

このライブラリを使用した場合は、[Manabe et al. (2021)](https://doi.org/10.2196/29500) (下記参考文献) を引用してください。

本ライブラリは言語指標のコレクションとしてはまだまだ不完全です.Issue や PR を歓迎します!

## Installation / インストール

You need Python 3.9 or later.
Expand Down Expand Up @@ -46,87 +51,92 @@ Specify the paths to the resources with the following options.

The following linguistic measures are implemented. / 次の言語指標が実装されています。

- **Percentages of character types**:
- **Percentages of character types / 文字種の割合**:
The ratio of hiragana, katakana, and kanji (Chinese characters) to
the characters in text, respectively.
the characters in text, respectively. / ひらがな、カタカナ、漢字(中国語の文字)のそれぞれの、総文字数に対する割合。

- **Type Token Ratio (TTR)**:
The ratio of the different words to the total number of words in text.
The ratio of the distinct words to the total number of words in text.
We cover several variants of TTRs. / 異なり語数(単語の種類数)を、総単語数で割った値。いくつかの補正バリエーションを実装。

- **Percentage of content words**:
- **Percentage of content words / 内容語の割合**:
The ratio of content words (i.e., nouns, verbs, adjectives, and
adverbs) to the total number of words in text.
adverbs) to the total number of words in text. / 内容語(名詞、動詞、形容詞、副詞)のそれぞれの、総単語数に対する割合。

- **Modifying words and Verb Ratio (MVR)**:
The ratio of verbs to adjectives, adverbs, and conjunctions for the
- **Modifying words and Verb Ratio (MVR) / 相の類に対する用の類の割合**:
The ratio of verbs to adjectives, adverbs, and pre-noun adjectival for the
words in text. It has been used as one of the indicators of
author estimation.
author estimation. / 用の類(形容詞、副詞、連体詞)に対する動詞の割合。著者推定の指標として用いられている。

- **Percentage of proper nouns**:
The ratio of proper nouns (named entities) to all words in text.
- **Percentage of proper nouns / 固有名詞の割合**:
The ratio of proper nouns (named entities) to all words in text. / 固有名詞のそれぞれの、総単語数に対する割合。

- **Word abstraction**:
- **Word abstractness / 単語抽象度**:
The abstraction degrees of the words in text. We specifically
used the maximum value of the most abstract word, and the average of
the top five abstract words. The abstraction degrees were obtained
from the Japanese word-abstraction dictionary [AWD-J EX](http://sociocom.jp/~data/2019-AWD-J/).
from the Japanese word-abstraction dictionary [AWD-J EX](http://sociocom.jp/~data/2019-AWD-J/). / 単語抽象度辞書 AWD-J EX から得られる単語抽象度。最も抽象的な単語の最大値、上位 5 語の平均値を使用。

- **Ratios of emotional words**:
- **Emotion scores / 感情スコア**:
The ratios, to all the words in text, of the words that are
associated with each of the seven kinds of emotions: sadness,
anxiety, anger, disgust, trust, surprise, and happiness. The seven
anxiety, anger, disgust, trust, surprise, and joy. The seven
values are transformed to meet the property of probability (each
value spans between 0 and 1; the sum of all values is to be 1). The
degree of association with emotion was determined according to the
Japanese emotional-word dictionary JIWC.
Japanese emotional-word dictionary JIWC. / 感情辞書 JIWC から得られる感情スコア。7 種類の感情(悲しみ、不安、怒り、嫌悪、信頼、驚き、喜び)に対するそれぞれの単語の割合。7 つの値は確率の性質を満たすように変換されている(各値は 0 から 1 の間にあり、合計は 1 になる)。

- **Number of sentences**:
The total number of sentences that make up text.
- **The number of sentences / 総文数**:
The total number of sentences that make up text. / 文の総数。

- **Length of sentences**:
- **Length of sentences / 文の長さ**:
Descriptive statistics (mean, standard deviation, interquartile,
minimum, and maximum) for the number of characters in each sentence
that constitutes text. In particular, the average sentence
length has been suggested to be linked to the writer’s creative
attitude and personality .
attitude and personality. / 文の長さの統計量(平均、標準偏差、四分位範囲、最小値、最大値)。特に、平均文長は著者の創造的態度や性格と関連しているとされている。

- **Percentage of conversational sentences**:
- **Percentage of conversational sentences / 会話文の割合**:
Percentage of the total number of conversational sentences contained
in text.
in text. / 会話文(「」『』で括られたテキスト)の総文数に対する割合。

- **Depth of syntax tree**:
- **Depth of syntax tree / 係り受け構造の深さ**:
Descriptive statistics calculated for the depth of the dependency
tree for each sentence in text.
tree for each sentence in text. / 係り受け構造の深さの統計量。

- **Mean of the number of chunks per sentence**:
- **The number of chunks per sentence / 文ごとの文節数**:
Descriptive statistics calculated for the average values of the
number of chunks for each sentence in text.
number of chunks for each sentence in text. / 文ごとの文節数の統計量。

- **Mean of the words per chunk**:
- **The tokens per chunk / 文節ごとの単語数**:
Descriptive statistics calculated for the average values of the
number of words per chunk in text.
number of words per chunk in text. / 文節ごとの単語数の統計量。

### Summary table

| Stylometric | Sub-measures (value format) |
| :---------------------------------------- | :-------------------------------------------------------------------------- |
| Percentages of character types | Hiragana, katakana, and kanji (Chinese characters) (%) |
| Type Token Ration (TTR) | (%) |
| Percentages of content words | (%) |
| Modifying words and Verb Ratio (MVR) | (%) |
| Percentage of proper nouns | (%) |
| Word abstraction | The maximum, and the average of the top five abstract words (real number) |
| Ratios of emotional words | sadness, anxiety, anger, disgust, trust, surprise, and happiness (%) |
| Number of sentences | (integer) |
| Length of sentences | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| Percentage of conversational sentences | (%) |
| Depth of syntax tree | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| Mean of the number of chunks per sentence | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| Mean of the words per chunk | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| Stylometric | Sub-measures (value format) |
| :------------------------------------- | :------------------------------------------------------------------------------------------------- |
| Percentages of character types | Hiragana, katakana, and kanji (Chinese characters) (%) |
| Type Token Ration (TTR) | Plain TTR, Guiraud's R, Herdan's C_H, Rubet's k, Maas's a^2, Tuldava's LN, Brunet's W, Dugast's U, |
| Percentages of content words | |
| Modifying words and Verb Ratio (MVR) | (%) |
| Percentage of proper nouns | (%) |
| Word abstractness | The maximum, and the average of the top five abstract words (real number) |
| Emotion scores | sadness, anxiety, anger, disgust, trust, surprise, and joy (%) |
| The number of sentences | (integer) |
| Length of sentences | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| Percentage of conversational sentences | (%) |
| Depth of syntax tree | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| The number of chunks per sentence | mean, standard deviation, interquartile, minimum, and maximum (real number) |
| The tokens per chunk | mean, standard deviation, interquartile, minimum, and maximum (real number) |

---

## References

- [Asaishi, 2017]: 浅石卓真. 2017. テキストの特徴を計量する指標の概観. 日本図書館情報学会誌, 63(3), 159–169. https://doi.org/10.20651/jslis.63.3_159
- [Manabe+, 2021]: Masae Manabe, Kongmeng Liew, Shuntaro Yada, Shoko Wakamiya, Eiji Aramaki. 2021. Estimation of Psychological Distress in Japanese Youth Through Narrative Writing: Text-Based Stylometric and Sentiment Analyses. JMIR Formative Research, 5(8):e29500. https://doi.org/10.2196/29500

## Developer

- [Shuntaro Yada](https://shuntaroy.com)
1 change: 1 addition & 0 deletions limco.py
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,7 @@ def from_file(
def main():
with warnings.catch_warnings():
warnings.simplefilter("ignore")
# Surpress warnings mainly from numpy for CLI
fire.Fire(from_file)


Expand Down
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@ description = "Linguistic Measure Collection"
authors = ["Shuntaro Yada <[email protected]>"]
license = "MIT"
readme = "README.md"
homepage = "https://github.com/sociocom/limco"
classifiers = [
"Topic :: Text Processing :: Linguistic",
"Natural Language :: Japanese"
]

[tool.poetry.dependencies]
python = "^3.9"
Expand Down

0 comments on commit d6c2b2d

Please sign in to comment.