Skip to content

Latest commit

 

History

History
24 lines (18 loc) · 1.91 KB

README.md

File metadata and controls

24 lines (18 loc) · 1.91 KB

Subreddit Classifier

We used a publically available dataset of the top 1000 posts from the 50 largest subreddits, and trained a model to classify reddit post titles and (string) bodies into these fifty subreddits. This dataset is included in our model in archive/ for training purposes.

Additionally, this repository includes a frontend to interact with the model that displays the probability estimates for the most likely classifications along with a warning if the model is likely unsure in its prediction.

Setup

  1. Run each cell in the Subreddits.ipynb notebook in order, culminating in a trained model and a running Flask server.
  2. With the Flask server in the final notebook cell running, serve the frontend at frontend/index.html

Note that our Flask backend is currently hardcoded to localhost:3000, which may need to be changed if that port is in use.

Open-Source Code

Our project is built off of several open-source libraries, namely:

Additionally, our model is based off of initial code from a scikit-learn tutorial on text classification. We used this as a starting point for how to set up our model, but we preproccessed the data ourselves, performed our own parameter search on parameters we felt were useful, and created the Flask server and frontend ourselves.