Spam detection on Twitter
Python school project developed by 3 students from CentraleSupélec : Hélène, Valentin, and Delphine.
We consider as spam anything that is not related to a news event or a reaction to one.
Install dependencies using pip (preferably in a virtual environment):
pip install -r requirements.txt
Add a config.py
file which contains Twitter API and MongoDB keys.
import os
# OAuth authentification keys to Twitter API
ACCESS = (
# consumer_key,
# consumer_secret,
# oauth_token,
# oauth_token_secret
)
# Restricts tweets to the given language, given by an ISO 639-1 code.
# Language detection is best-effort.
# To fetch all tweets, choose LANG = None
LANG = 'fr'
# Directory where tweets files are stored
FILEDIR = str
PROJECT_DIR = "/path/to/tweet-noise/"
DATA_DIR = PROJECT_DIR + "data/"
# Number of tweets in each file
FILEBREAK = 1000
PROXY = {'http': '',
'https': ''}
# MongoDB cluster config
MONGODB = {
"USER": str,
"PASSWORD": str,
"HOST": str,
"PORT": 27017,
"DATABASE": str
}
# features file
current_file = FILEDIR + "tweets_data.csv"
ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
# For reading/writing in Google Spreadsheet
google_api_key_file = ROOT_DIR + '/client_secret.json'
main.py
: from a dataset split the dataset into test and train.- Save tweets in MongoDB - tweetsUpload :
streamingAPI.py
: fetch data from Twitter API and save in MongoDBloadLocalTweets.py
: save tweets from json file in MongoDB
dataLabelling.py
: small algorithm to ease the data labelling processclassification.py
: fetch the features tables and categorize the different features- Data Visualisation :
featuresAnalysis.py
: matplotlib and seaborndataViz.py
: matplotlib tests - TO BE REMOVEDrandomForestVisualization.py
: plot randome forest treeretweetFavoriteAnalysis.py
- Classification :
IfClassification.py
: simple if/else classificationscikitClassification.py
: test and compare different scikitlearn classifiersKNearest.py
: K Nearest Neighbours classifierrandomForest.py
: Random Forest classifiersupportVectorMachine.py
: Support Vector Machine classifierclassification2.py
- Classification (npm)
bayesnpm/
: implement a simple text classifier with the Bayes NPM package - Features :
dictBuilder.py
jsonBuilder.py
: fetch data from MongoDB and build 2 JSON files for spam and infoarrayBuilder.py
: fetch data from MongoDB and build an array of all the tweets (text and label only)featuresBuilder.py
: fetch data from MongoDB and build the features tableKeywords.py
: list of key words considered as spamwords, whitewords, stopwords and list of emojisMedias.py
: list of media twitter accountclusterFeatures.py
: from a csv of clusterized tweet return the number of medias, urls and hashtags per cluster
- Text Clustering :
textClustering.py
: text processing and tf-idf vectorizer fitted on a k-means modeldoc2vect.py
: build a csv of vectorized (300x300) tweets fetch from the data baseK-means.py
: from a csv of vectorize tweets return prediction of cluster made with K-means
- Tweets Clustering
Fetch data from Twitter API.
Run tweetsUpload/streamingAPI.py
.
Or upload a json file of tweets by running tweetsUpload/loadLocalTweets.py
.
Build features file by running features/featuresBuilder.py
.
Or build csv file with text by running arrayBuilder.py
.
Run features_analysis.py
from Data Visualisation/
.
Run classification.scikitClassification.py
to compare different scikit-learn classifiers.
Run testClustering.py
or tweets-clustering/__init__.py
.
Run main.py
.
* : used for analysis
° : to be considered
- _id
- created_at : str
- id : int
- id_str : str *
- text : str *
- source : str
- truncated : bool
- in_reply_to_status_id
- in_reply_to_status_id_str
- in_reply_to_user_id
- in_reply_to_user_id_str
- in_reply_to_screen_name
- user : object
- id : int
- id_str : str
- name : str
- screen_name : str
- location : str
- url : str
- description : str
- translator_type : str
- protected : bool
- verified : bool *
- followers_count : int *
- friends_count : int *
- listed_count : int
- favourites_count : int
- statuses_count : int *
- created_at : str *
- utc_offset
- time_zone
- geo_enabled : bool
- lang : str
- contributors_enabled : bool
- is_translator : bool
- profile_background_color : str
- profile_background_image_url : str
- profile_background_image_url_https : str
- profile_background_tile : bool
- profile_link_color : str
- profile_sidebar_border_color : str
- profile_sidebar_fill_color : str
- profile_text_color : str
- profile_use_background_image : bool
- profile_image_url : str
- profile_image_url_https : str
- profile_banner_url : str
- default_profile : bool
- default_profile_image : bool
- following
- follow_request_sent
- notifications
- geo
- coordinates
- place
- contributors
- is_quote_status : bool
- extended_tweet : object
- full_text : str
- display_text_range : list
- entities : object
- hashtags : list
- urls : list
- user_mentions : list
- symbols : list
- media : list
- extended_entities : object
- media : list
- quote_count : int
- reply_count : int
- retweet_count : int
- favorite_count : int
- entities : object °
- hashtags : list *
- text : str
- indices : list
- urls : list
- url : str
- extended_url : str
- display_irl : str
- indices : list
- user_mentions : list
- screen_name : str
- name : str
- id : int
- id_str : str
- indices : list
- symbols : list
- media : list
- hashtags : list *
- favorited : bool
- retweeted : bool
- possibly_sensitive : bool
- filter_level : str
- lang : str
- timestamp_ms : str *
- spam : bool *
- type : str *