A transformer classifier trained on comments from the following 10 popular
subreddits: Bitcoin, CryptoCurrency, hiphopheads, pelotoncycle, soccer
chelseafc, formula1, movies, popheads, wallstreetbets. The subreddits choosen
all have a popular "discussion" chat post and at most 100 comments were scraped
using praw
from the last 100 discussion posts of the subreddits. Oddly
enough, these were the only subreddits that I could seem to find that had a
reocurring chat that conistantly had atleast 100 comments in it.
All of the comments scraped are found in ./data.tar.gz
and can be unzipped
with tar -xzvf data.tar.gz
. The unzipped data folder is of the following
form.
data
├── <subreddit name>
│ ├── test
│ │ └──<submission_id>_<comment_id>.txt
│ │ └──...
│ ├── train
│ │ └──<submission_id>_<comment_id>.txt
│ │ └──...
│ └── val
│ └──<submission_id>_<comment_id>.txt
│ └──...
Above, <submission_id>
and <comment_id>
refer praw.models.Submission.id
and praw.models.Comment.id
and each txt
has exactly one comment in it. The
comment has been lowed and stripped of all non asci characters which was
handled by src.data.utils.lower_text_and_remove_all_non_asci
.
All of the discussion posts except the movies
posts shared a common theme.
The movies
discussion posts were all discussions about a particular movie,
like this post,
while the discussion posts from the other subreddits were all general "daily
discussion" type posts like this soccer post
for examples. The latest comments in the dataset occured on April 11, 2024, so
any comment created after April 11 will be new to the model.
Below are the results of the transformer on the testing data
after 40 epochs of training and validating. The transformer had 6 "multihead
attention" layers with 4 heads each and an embedded dimension of 256. A max
token length of 256 was also chosen. The full configuration of the experiment
is in ./experiment/config.py
. I saved the best model based on validation
accuracy, which occured on the 29th epoch.
There is an app in ./app
. To try it out you will need the
weights found
here.
The following will clone the repo and download the weights with gdown
and
install some of the requirements.
git clone https://github.com/nickeisenberg/reddit_classifier.git
cd reddit_classifier
gdown https://drive.google.com/uc?id=1R7JVUWz5h02T8c7UqsWI8pZjkDjfKvKa
mv validation_ckp.pth app
pip install -r rec.app.txt
You also need to install the cpu
version of pytorch
for your OS found
here. Then you can start the app with python3 app
.
Alternatively you can pull the docker image and run it with
docker pull nickeisenberg/reddit-comment-classifier:v0.0
docker run -p 5000:5000 nickeisenberg/reddit-comment-classifier:v0.0
Here are two comments from April 27th's soccer and chelseafc daily discussion
-
Soccer: Dante since turning 35 has already racked up almost 200 appearances for OGC Nice. Honestly fair play to him he was never the greatest CB on the planet but staying at a mid-level club in a top league into your 40's is always very impressive
-
Chelseafc: Knowing he is for sale , Martín Zubimendi is the defensive midfielder we should have went in for over Caicedo.