This project consist of Flair Prediction on reddit posts of r/india. (The term flair, used in some subreddits for 'categorizing' posts submitted by users). EDA has been performed on the subreddit data collected. PRAW: The Python Reddit API Wrapper is used for data collection of the following flairs:
"AskIndia", "Coronavirus", "Non-Political", "Scheduled", "Photography", "Science/Technology", "Politics", "Business/Finance", "Policy/Economy", "Sports", "Food", "AMA".
EDA:
- Most Popular Words (WordCloud)
- Posts with less than 10 votes (histogram)
- Most Popular Posts (BarPlot)
- Most Commented Posts (Barplot)
- No. of Comments vs Score (Regression Plots)
- Top 10 Authors
- Text Cleaning and Analysis
- Bag of Words on 2 Posts of same Flair
- XG Boost Classifier
- Feature Importance for Flair Prediction, etc.
Prediction Algorithm includes:
- Logistic Regression (sklearn: CountVectorizer -> TFIDF Transformer -> Logistic Regression)
- SVM (sklearn: CountVectorizer -> TFIDF Transformer -> SVM)
- Naive Bayes (sklearn: CountVectorizer -> TFIDF Transformer -> Naive Bayes) Code - IPYNB
- 1-D Convolution (Tensorflow/Keras: Word2Vec -> GloVE embeddings -> 1-D Convolution)
- LSTM (Tensorflow/Keras: Word2Vec -> GloVE embeddings -> LSTM)
- Bidirectional LSTM (Tensorflow/Keras: Word2Vec -> GloVE embeddings -> Bidirectional LSTM)
- Visit: http://13.234.217.64/
- Enter a reddit post from r/India Subreddit.
- Clone the repository: git clone https://github.com/ankurbhatia24/Reddit_EDA.git
- Edit the file test.txt with the subreddit(india) posts links in each line.
- Run the python file: python3 post_text_file.py
- Clone the repository: git clone https://github.com/ankurbhatia24/Reddit_EDA.git
- Create a virtual environment: virtualenv env
- Activate virtual environment: source env/bin/activate
- Install the requirements: pip3 install -r requirements.txt
- "Flair_csv" is the Database file: To collect your own data, execute the "Reddit Data Collection.ipynb". Change the reddit instance credentials accordingly.(Read https://towardsdatascience.com/scraping-reddit-with-praw-76efc1d1e1d9 to setup reddit app)
- After installation of the required libraries, you can test the code in the 'IPYNB'
directory.
In the 'Data_pre-procesing and Model Evaluation.ipynb', there are three models defined after the preprocessing of the Data.
- The Reddit data collected is cleaned and Preprocessed according to the models needs.
The text is cleaned for any punctuations, emojis, special characters, STOPWORDS(A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)), etc.2 - The raw text is tokenized (each word/token is assigned a number according to a dictionary) and converted from word to vector using GloVe embeddings (300d).
- The Flairs (flairs = ["AskIndia", "Coronavirus", "Non-Political", "Scheduled", "Photography", "Science/Technology", "Politics", "Business/Finance", "Policy/Economy", "Sports", "Food", "AMA"]) to be predicted are converted to One-Hot encodings.
- Finally the data is splitted into Training and Testing.