You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the late discovery on the data, the cleaned data set is now very different from the labeled data from September. The benefit of this data is that it has over 1000 labeled emails from about 20 different people contributing. I am looking for suggestions on different ways to go about handling the data issue. A few options could be:
Disregarding the format differences and using the old one
-making a small labeled set off of the new data myself to go with the old
-repeat the process of reaching out for help and hope to have a similar response
@cdolfi is there anything special about the types of offensive language or hate speech that is specific to the fedora mailing list? If not, then you could probably look for another external dataset (here is one possible example) that labels emails or tweets as hateful/offensive to train your model. Maybe I don't fully understand the goal of this project : ) but if developing a hateful language detector is the goal, that can then be applied to the Fedora mailing list, its not clear to me that the fedora mailing list would be the best source for training, due to the fact that it probably has a low occurrence of things like offensive language or hate speech (I'm guessing)
@MichaelClifford On the mailing list, the only thing I found to be unique about it compared to many data sets online was the style of communication. People communicate very differently on a mailing list than they do on a twitter feed. Another option I have been considering is using some public data set like you have above and some data set unique to communicating in a semi professional setting as the fedora mailing list has. My biggest concern on using a twitter data set is that it will not detect the hateful or discriminatory language on the mailing list as the way its written is different.
With the late discovery on the data, the cleaned data set is now very different from the labeled data from September. The benefit of this data is that it has over 1000 labeled emails from about 20 different people contributing. I am looking for suggestions on different ways to go about handling the data issue. A few options could be:
-making a small labeled set off of the new data myself to go with the old
-repeat the process of reaching out for help and hope to have a similar response
Here is the old labeled:
https://docs.google.com/spreadsheets/d/1Th2I1tgG0ivvV-Ubs-7ehVMor__tgCMU3SUpHWfXQmY/edit?usp=sharing
Reach out to me at [email protected] for information to get access to the new clean labeled data from my bucket
Acceptance Criteria:
The text was updated successfully, but these errors were encountered: