Investigate options for labeled data set #8

cdolfi · 2020-11-12T00:16:55Z

With the late discovery on the data, the cleaned data set is now very different from the labeled data from September. The benefit of this data is that it has over 1000 labeled emails from about 20 different people contributing. I am looking for suggestions on different ways to go about handling the data issue. A few options could be:

Disregarding the format differences and using the old one
-making a small labeled set off of the new data myself to go with the old
-repeat the process of reaching out for help and hope to have a similar response

Here is the old labeled:
https://docs.google.com/spreadsheets/d/1Th2I1tgG0ivvV-Ubs-7ehVMor__tgCMU3SUpHWfXQmY/edit?usp=sharing

Reach out to me at [email protected] for information to get access to the new clean labeled data from my bucket

Acceptance Criteria:

Come up with plan on what labeled data to use

MichaelClifford · 2020-11-12T14:53:13Z

@cdolfi is there anything special about the types of offensive language or hate speech that is specific to the fedora mailing list? If not, then you could probably look for another external dataset (here is one possible example) that labels emails or tweets as hateful/offensive to train your model. Maybe I don't fully understand the goal of this project : ) but if developing a hateful language detector is the goal, that can then be applied to the Fedora mailing list, its not clear to me that the fedora mailing list would be the best source for training, due to the fact that it probably has a low occurrence of things like offensive language or hate speech (I'm guessing)

cdolfi · 2020-11-12T15:42:53Z

@MichaelClifford On the mailing list, the only thing I found to be unique about it compared to many data sets online was the style of communication. People communicate very differently on a mailing list than they do on a twitter feed. Another option I have been considering is using some public data set like you have above and some data set unique to communicating in a semi professional setting as the fedora mailing list has. My biggest concern on using a twitter data set is that it will not detect the hateful or discriminatory language on the mailing list as the way its written is different.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate options for labeled data set #8

Investigate options for labeled data set #8

cdolfi commented Nov 12, 2020

MichaelClifford commented Nov 12, 2020

cdolfi commented Nov 12, 2020 •

edited

Loading

Investigate options for labeled data set #8

Investigate options for labeled data set #8

Comments

cdolfi commented Nov 12, 2020

MichaelClifford commented Nov 12, 2020

cdolfi commented Nov 12, 2020 • edited Loading

cdolfi commented Nov 12, 2020 •

edited

Loading