Przetak is a library for checking whether a text contains abusive or vulgar speech in Polish. While it is written in Go, it can be used by programs written in many other languages thanks to FFI (Foreign Function Interface).
Przetak is resilient to:
- replicating letters,
- spacing out the words,
- inserting non-letters between letters,
- homograph spoofing, i.e. replacing letters with similar characters.
Also, thanks to its use of character 5-grams, it handles some frequent misspellings and out-of-vocabulary words composed of morphemes with an abusive or vulgar meaning.
Przetak finished the Polish contest of cyberbullying detection PolEval 2019 in second place. Here is a paper about Przetak, and here are the slides from my presentation at AI & NLP Workshop Day 2019.
First, get the package:
$ go get github.com/MarcinCiura/przetak
Change directory to your ${GOPATH}/src/github.com/MarcinCiura/przetak
and run make
to build the shared library. Depending on your
operating system, the shared library will be called:
libprzetak.so
on Linux,libprzetak.dylib
on macOS,przetak.dll
on Windows.
Przetak's evaluate()
function returns an integer whose
bits with respective values 1, 2, or 4 are set if the input
UTF-8 string contains:
- abusive words,
- vulgar words with negative connotations,
- vulgar words with positive connotations.
The examples directory showcases the use of Przetak directly from Go and from several other programming languages via FFI (Foreign Function Interface).
Marcin Ciura
Przetak is licensed under Apache License, Version 2.0.