CADD_dataset

CADD: A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

=======================================================

Note
- 2021/09/05 New version update
- The data is in the CSV format (encoding='latin_1').
- Please MAKE SURE that you are fully aware of and agree to the ethical guidelines.
- Please DO NOT modify this file directly.

Ethical Guidelines

Make no attempt to contact any user in the dataset

Make no attempt to deanonymize or learn the identity of any user in the dataset

Make no attempt to link users in the dataset with any external information (e.g., an account on another website)

Will not share any portion of the data, including example posts or excerpts from posts, with any other party

CADD

Dataset overview

Values Description

Title str: Context Contextual information (Title + Body)

Body str: Context Contextual information (Title + Body)

Comment str: Text A target sentence to be classified.

L1: Type {0,1,2,3} 0: Non-abusive, 1: Hate speech, 2: Derogatory, 3: Profanity

L2: Abusiveness {0,1} 0: Non-abusive, 1: Abusive

L3: Target {0,1} 0: Non-targeted, 1: Targeted

L4: Demographic Characteristics {0,1,2,3,4,5,6,7,8} 0: None, 1:Gender, 2: Sexual orientation, 3: Race, 4: Religion 5: Disability, 6: Age, 7: Others, 8:Unclear

L5: Implicitness {0,1} 0: None, 1: Implicit (Containing implicit attacks.)

L6: Profanity {0,1} 0: None, 1: Profanity (Containing any words expressing abusiveness.)

Data statistics

Type Train Validation Test Total

Hate speech 2,515 388 772 3,675

Derogatory 1,632 241 494 2,367

Profanity 4,595 631 1,339 6,565

Non-abusive 8,412 1,190 2,297 11,899

All 17,154 2,450 4,902 24,506

Annotation scheme

Reference

https://aclanthology.org/2021.conll-1.43.pdf

@inproceedings{song2021large,
  title={{A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit}},
  author={Song, Hoyun and Ryu, Soo Hyun and Lee, Huije and Park, Jong C},
  booktitle={Proceedings of the 25th Conference on Computational Natural Language Learning},
  pages={552--561},
  year={2021}
}

License

These resources are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Dataset		Dataset
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CADD_dataset

Ethical Guidelines

CADD

Reference

License

About

Releases

Packages

Contributors 2

	Values	Description
Title	str: Context	Contextual information (Title + Body)
Body	str: Context	Contextual information (Title + Body)
Comment	str: Text	A target sentence to be classified.
L1: Type	{0,1,2,3}	0: Non-abusive, 1: Hate speech, 2: Derogatory, 3: Profanity
L2: Abusiveness	{0,1}	0: Non-abusive, 1: Abusive
L3: Target	{0,1}	0: Non-targeted, 1: Targeted
L4: Demographic Characteristics	{0,1,2,3,4,5,6,7,8}	0: None, 1:Gender, 2: Sexual orientation, 3: Race, 4: Religion 5: Disability, 6: Age, 7: Others, 8:Unclear
L5: Implicitness	{0,1}	0: None, 1: Implicit (Containing implicit attacks.)
L6: Profanity	{0,1}	0: None, 1: Profanity (Containing any words expressing abusiveness.)

Type	Train	Validation	Test	Total
Hate speech	2,515	388	772	3,675
Derogatory	1,632	241	494	2,367
Profanity	4,595	631	1,339	6,565
Non-abusive	8,412	1,190	2,297	11,899
All	17,154	2,450	4,902	24,506

License

nlpcl-lab/CADD_dataset

Folders and files

Latest commit

History

Repository files navigation

CADD_dataset

Ethical Guidelines

CADD

Reference

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages