Reddit_EDA

This project consist of Flair Prediction on reddit posts of r/india. (The term flair, used in some subreddits for 'categorizing' posts submitted by users). EDA has been performed on the subreddit data collected. PRAW: The Python Reddit API Wrapper is used for data collection of the following flairs:
"AskIndia", "Coronavirus", "Non-Political", "Scheduled", "Photography", "Science/Technology", "Politics", "Business/Finance", "Policy/Economy", "Sports", "Food", "AMA".
EDA:

Most Popular Words (WordCloud)
Posts with less than 10 votes (histogram)
Most Popular Posts (BarPlot)
Most Commented Posts (Barplot)
No. of Comments vs Score (Regression Plots)
Top 10 Authors
Text Cleaning and Analysis
Bag of Words on 2 Posts of same Flair
XG Boost Classifier
Feature Importance for Flair Prediction, etc.

Prediction Algorithm includes:

Logistic Regression (sklearn: CountVectorizer -> TFIDF Transformer -> Logistic Regression)
SVM (sklearn: CountVectorizer -> TFIDF Transformer -> SVM)
Naive Bayes (sklearn: CountVectorizer -> TFIDF Transformer -> Naive Bayes) Code - IPYNB
1-D Convolution (Tensorflow/Keras: Word2Vec -> GloVE embeddings -> 1-D Convolution)
LSTM (Tensorflow/Keras: Word2Vec -> GloVE embeddings -> LSTM)
Bidirectional LSTM (Tensorflow/Keras: Word2Vec -> GloVE embeddings -> Bidirectional LSTM)

To Test the Application part 1:

Visit: http://13.234.217.64/
Enter a reddit post from r/India Subreddit.

To Test the Application part 2:

Clone the repository: git clone https://github.com/ankurbhatia24/Reddit_EDA.git
Edit the file test.txt with the subreddit(india) posts links in each line.
Run the python file: python3 post_text_file.py

To Reproduce the Developement Environment:

Clone the repository: git clone https://github.com/ankurbhatia24/Reddit_EDA.git
Create a virtual environment: virtualenv env
Activate virtual environment: source env/bin/activate
Install the requirements: pip3 install -r requirements.txt
"Flair_csv" is the Database file: To collect your own data, execute the "Reddit Data Collection.ipynb". Change the reddit instance credentials accordingly.(Read https://towardsdatascience.com/scraping-reddit-with-praw-76efc1d1e1d9 to setup reddit app)
After installation of the required libraries, you can test the code in the 'IPYNB'
directory.

Understanding Models

In the 'Data_pre-procesing and Model Evaluation.ipynb', there are three models defined after the preprocessing of the Data.

Process:

The Reddit data collected is cleaned and Preprocessed according to the models needs.
The text is cleaned for any punctuations, emojis, special characters, STOPWORDS(A stop word is a commonly used word (such as “the”, “a”, “an”, “in”)), etc.2
The raw text is tokenized (each word/token is assigned a number according to a dictionary) and converted from word to vector using GloVe embeddings (300d).
The Flairs (flairs = ["AskIndia", "Coronavirus", "Non-Political", "Scheduled", "Photography", "Science/Technology", "Politics", "Business/Finance", "Policy/Economy", "Sports", "Food", "AMA"]) to be predicted are converted to One-Hot encodings.
Finally the data is splitted into Training and Testing.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Application		Application
IPYNB		IPYNB
Media		Media
Flair_data.csv		Flair_data.csv
README.md		README.md
post_text_file.py		post_text_file.py
requirements.txt		requirements.txt
test.txt		test.txt
title_embedding_matrix.pickle		title_embedding_matrix.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit_EDA

To Test the Application part 1:

To Test the Application part 2:

To Reproduce the Developement Environment:

Understanding Models

In the 'Data_pre-procesing and Model Evaluation.ipynb', there are three models defined after the preprocessing of the Data.

Process:

1. Convolution Model

2. LSTM Model

3. Bidirectional LSTM Model

Other Predictors:

About

Releases

Packages

Languages

ankurbhatia24/Reddit_EDA

Folders and files

Latest commit

History

Repository files navigation

Reddit_EDA

To Test the Application part 1:

To Test the Application part 2:

To Reproduce the Developement Environment:

Understanding Models

In the 'Data_pre-procesing and Model Evaluation.ipynb', there are three models defined after the preprocessing of the Data.

Process:

1. Convolution Model

2. LSTM Model

3. Bidirectional LSTM Model

Other Predictors:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages