Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate options for labeled data set #8

Open
1 task
cdolfi opened this issue Nov 12, 2020 · 2 comments
Open
1 task

Investigate options for labeled data set #8

cdolfi opened this issue Nov 12, 2020 · 2 comments

Comments

@cdolfi
Copy link
Collaborator

cdolfi commented Nov 12, 2020

With the late discovery on the data, the cleaned data set is now very different from the labeled data from September. The benefit of this data is that it has over 1000 labeled emails from about 20 different people contributing. I am looking for suggestions on different ways to go about handling the data issue. A few options could be:

  • Disregarding the format differences and using the old one
    -making a small labeled set off of the new data myself to go with the old
    -repeat the process of reaching out for help and hope to have a similar response

Here is the old labeled:
https://docs.google.com/spreadsheets/d/1Th2I1tgG0ivvV-Ubs-7ehVMor__tgCMU3SUpHWfXQmY/edit?usp=sharing

Reach out to me at [email protected] for information to get access to the new clean labeled data from my bucket

Acceptance Criteria:

  • Come up with plan on what labeled data to use
@MichaelClifford
Copy link
Member

@cdolfi is there anything special about the types of offensive language or hate speech that is specific to the fedora mailing list? If not, then you could probably look for another external dataset (here is one possible example) that labels emails or tweets as hateful/offensive to train your model. Maybe I don't fully understand the goal of this project : ) but if developing a hateful language detector is the goal, that can then be applied to the Fedora mailing list, its not clear to me that the fedora mailing list would be the best source for training, due to the fact that it probably has a low occurrence of things like offensive language or hate speech (I'm guessing)

@cdolfi
Copy link
Collaborator Author

cdolfi commented Nov 12, 2020

@MichaelClifford On the mailing list, the only thing I found to be unique about it compared to many data sets online was the style of communication. People communicate very differently on a mailing list than they do on a twitter feed. Another option I have been considering is using some public data set like you have above and some data set unique to communicating in a semi professional setting as the fedora mailing list has. My biggest concern on using a twitter data set is that it will not detect the hateful or discriminatory language on the mailing list as the way its written is different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants