Subreddit Classifier

We used a publically available dataset of the top 1000 posts from the 50 largest subreddits, and trained a model to classify reddit post titles and (string) bodies into these fifty subreddits. This dataset is included in our model in archive/ for training purposes.

Additionally, this repository includes a frontend to interact with the model that displays the probability estimates for the most likely classifications along with a warning if the model is likely unsure in its prediction.

Setup

Run each cell in the Subreddits.ipynb notebook in order, culminating in a trained model and a running Flask server.
With the Flask server in the final notebook cell running, serve the frontend at frontend/index.html

Note that our Flask backend is currently hardcoded to localhost:3000, which may need to be changed if that port is in use.

Open-Source Code

Our project is built off of several open-source libraries, namely:

flask and flask_cors (Web app framework to serve a model prediction API)
nltk (Natural language toolkit we used for extracting words from posts)
numpy
pandas
re (Python standard library regex operations)
scikit-learn

Additionally, our model is based off of initial code from a scikit-learn tutorial on text classification. We used this as a starting point for how to set up our model, but we preproccessed the data ourselves, performed our own parameter search on parameters we felt were useful, and created the Flask server and frontend ourselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Subreddit Classifier

Setup

Open-Source Code

Files

README.md

Latest commit

History

README.md

File metadata and controls

Subreddit Classifier

Setup

Open-Source Code