GoEmotions λ°μ΄ν°μ μ νκ΅μ΄λ‘ λ²μν ν, KoELECTRAλ‘ νμ΅
June 19, 2020 - Transformers v2.9.1 κΈ°μ€μΌλ‘ λͺ¨λΈ νμ΅ μ [NAME]
, [RELIGION]
κ³Ό κ°μ Special tokenμ μΆκ°νμμμλ pipelineμμ λ€μ μ¬μ©ν λ μ μ©μ΄ λμ§ μλ μ΄μκ° μμμΌλ, Transformers v2.11.0μμ ν΄λΉ μ΄μκ° ν΄κ²°λμμ΅λλ€.
Feb 9, 2021 - Transformers v3.5.1 κΈ°μ€μΌλ‘ KoELECTRA-v1
, KoELECTRA-v3
λ₯Ό κ°μ§κ³ νμ΅νμ¬ μλ‘ λͺ¨λΈμ μ
λ‘λ νμμ΅λλ€.
58000κ°μ Reddit commentsλ₯Ό 28κ°μ emotionμΌλ‘ λΌλ²¨λ§ν λ°μ΄ν°μ
- admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, surprise, neutral
- torch==1.7.1
- transformers=3.5.1
- googletrans==2.4.1
- attrdict==2.0.1
$ pip3 install -r requirements.txt
π¨ Reddit λκΈλ‘ λ§λ λ°μ΄ν°μ¬μ λ²μλ κ²°κ³Όλ¬Όμ νμ§μ΄ μ’μ§ μμ΅λλ€. π¨
- pygoogletransλ₯Ό μ¬μ©νμ¬ νκ΅μ΄ λ°μ΄ν° μμ±
pygoogletrans v2.4.1
μ΄ pypiμ μ λ°μ΄νΈλμ§ μμ κ΄κ³λ‘ repositoryμμ 곧λ°λ‘ λΌμ΄λΈλ¬λ¦¬λ₯Ό μ€μΉνλ κ²μ κΆμ₯ (requirements.txt
μ λͺ μλμ΄ μμ)
- API νΈμΆ κ°μ 1.5μ΄μ κ°κ²©μ μ£Όμμ΅λλ€.
- ν λ²μ requestμ μ΅λ 5000μλ₯Ό λ£μ μ μλ μ μ κ³ λ €νμ¬ λ¬Έμ₯λ€μ
\r\n
μΌλ‘ μ΄μ΄ λΆμ¬ inputμΌλ‘ λ£μμ΅λλ€.
- ν λ²μ requestμ μ΅λ 5000μλ₯Ό λ£μ μ μλ μ μ κ³ λ €νμ¬ λ¬Έμ₯λ€μ
ββ​
(Zero-width space)κ° λ²μ λ¬Έμ₯ μμ μμΌλ©΄ λ²μμ΄ λμ§ μλ μ€λ₯κ° μμ΄μ μ΄λ μ κ±°νμμ΅λλ€.- λ²μμ μλ£ν λ°μ΄ν°λ
data
λλ ν 리μ μ΄λ―Έ μμ΅λλ€. νΉμ¬λ μ§μ λ²μμ λλ¦¬κ³ μΆλ€λ©΄ μλμ λͺ λ Ήμ΄λ₯Ό μ€ννλ©΄ λ©λλ€.
$ bash download_original_data.sh
$ pip3 install git+git://github.com/ssut/py-googletrans
$ python3 tranlate_data.py
- λ°μ΄ν°μ
μ
[NAME]
,[RELIGION]
μ Special Tokenμ΄ μ‘΄μ¬νμ¬, μ΄λ₯Όvocab.txt
μ[unused0]
μ[unused1]
μ κ°κ° ν λΉνμμ΅λλ€.
- Sigmoidλ₯Ό μ μ©ν Multi-label classification (thresholdλ 0.3μΌλ‘ μ§μ )
model.py
μElectraForMultiLabelClassification
μ°Έκ³
- configμ κ²½μ°
config
λλ ν 리μ json νμΌμμ λ³κ²½νλ©΄ λ©λλ€.
$ python3 run_goemotions.py --config_file koelectra-base.json
$ python3 run_goemotions.py --config_file koelectra-small.json
Macro F1
μ κΈ°μ€μΌλ‘ κ²°κ³Ό μΈ‘μ (Best result)
Macro F1 (%) | Dev | Test |
---|---|---|
KoELECTRA-small-v1 | 39.99 | 41.02 |
KoELECTRA-base-v1 | 42.18 | 44.03 |
KoELECTRA-small-v3 | 40.27 | 40.85 |
KoELECTRA-base-v3 | 42.85 | 42.28 |
MultiLabelPipeline
ν΄λμ€λ₯Ό μλ‘ λ§λ€μ΄ Multi-label classificationμ λν inferenceκ° κ°λ₯νκ² νμμ΅λλ€.- Huggingface s3μ λͺ¨λΈμ μ
λ‘λνμμ΅λλ€.
monologg/koelectra-small-v1-goemotions
monologg/koelectra-base-v1-goemotions
monologg/koelectra-small-v3-goemotions
monologg/koelectra-base-v3-goemotions
from multilabel_pipeline import MultiLabelPipeline
from transformers import ElectraTokenizer
from model import ElectraForMultiLabelClassification
from pprint import pprint
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-goemotions")
model = ElectraForMultiLabelClassification.from_pretrained("monologg/koelectra-base-v3-goemotions")
goemotions = MultiLabelPipeline(
model=model,
tokenizer=tokenizer,
threshold=0.3
)
texts = [
"μ ν μ¬λ―Έ μμ§ μμ΅λλ€ ...",
"λλ βμ§κΈ κ°μ₯ ν° λλ €μμ λ΄ μμ μμ μ¬λ κ²β μ΄λΌκ³ λ§νλ€.",
"κ³±μ°½... νμκ°λ° κΈ°λ€λ¦΄ λ§μ μλ!",
"μ μ νλ 곡κ°μ μ μ νλ μ¬λλ€λ‘ μ±μΈλ",
"λ무 μ’μ",
"λ₯λ¬λμ μ§μ¬λμ€μΈ νμμ
λλ€!",
"λ§μμ΄ κΈν΄μ§λ€.",
"μλ μ§μ§ λ€λ€ λ―Έμ³€λ봨γ
γ
γ
",
"κ°λ
ΈμΌ"
]
pprint(goemotions(texts))
# Output
[{'labels': ['disapproval'], 'scores': [0.97151965]},
{'labels': ['fear'], 'scores': [0.9519822]},
{'labels': ['disapproval', 'neutral'], 'scores': [0.452921, 0.5345312]},
{'labels': ['love'], 'scores': [0.8750478]},
{'labels': ['admiration'], 'scores': [0.93127275]},
{'labels': ['love'], 'scores': [0.9093589]},
{'labels': ['nervousness', 'neutral'], 'scores': [0.76960915, 0.33462417]},
{'labels': ['disapproval'], 'scores': [0.95657086]},
{'labels': ['annoyance', 'disgust'], 'scores': [0.39240348, 0.7896941]}]