Twitter activity during Protests and mobilization in Belarus (2020)

Description

This repo contains code used for the collection, descriptive quantitative analysis, and topic modelling of Twitter data related to the crisis of 2020 in Belarus. This was done as part of the MOBILISE research project. Each step of our analysis is extensively described here. The main goal of this repo is to allow researchers to replicate our methodology, as well as re-purpose code for future projects.

1/ Collection of tweets

Note : It will now be faster to collect tweets using the official Tweet Downloader. Input queries (see belarus_queries.py) and direct download CSV files. Make sure to check all fields and extensions except polls.

Click to expand

Original batch

Early data was collected using the R plugin RTweet which requires legacy API V1 credentials from Twitter.

rtweet_collect.R was executed weekly. Input your own credentials.

Second batch

Additional data was collected using Twitter API V2 Academic access and the Python library Twarc. We cover missing dates and performed keyword augmentation (new hashtags).

pip install twarc==2.9.5
make_queries.py to execute a batch of requests and get 1 CSV per query group.
new_hashtag_list contains hashtags and keywords that were used to extend the dataset

Converting JSON (V2) to (V1) CSV

We use the Twarc_CSV plugin to flatten JSON data into a format we can merge with the CSV produced by Rtweet. Requires the separate plugin twarc-csv :

pip install twarc-csv==0.5.2
csv_conversion.ipynb this notebook can be used to replicate this step and get .csv from JSON

Count of tweets

When we only need the count of tweets for a certain request (timeframe, language), we only make a Count request.
This is demonstrated in counts.ipynb.

Dependencies :

srch_v2.py custom wrapper over Twarc with search functions
belarus_queries.py contains hashtags we followed
config.yaml for credentials
rtweet_conversion.py contains a data translation table + specific fields from API V2 to Rtweet format (API V1)
v2_csv_converter.py custom wrapper over twarc_csv to extract specific data, flatten JSON into CSV

Example config.yaml :

api_key: 0000000000000000000000000
api_secret: 00000000000000000000000000000000000000000000000000
bearer_token : 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
access_token : 00000000000000000000-AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
access_token_secret : 000000000000000000000000000000000000000000000
sql_debug: false
rapidapi_key: "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

2/ Data cleaning and exploration

Exploratory data analysis (EDA) and cleaning is performed in the included notebook beltweets_analysis.ipynb

Included code :

Merging data from API V1 and API V2
Data cleaning
Input of missing languages
Visualisations for statistical analysis, tweets over time, users, top RT...
Export of a subset for NLP (tweets in english and russian)
Other early EDA steps that we abandonned (bot detection models...)

Dependencies :

rtweet_conversion.py contains a data translation table + specific fields from API V2 to Rtweet format (API V1)
dtypes.py contains the data scheme used when importing .csv with Pandas

3/ Topic modelling and other NLP analyses

Pre-processing

nlp_preprocessing.py contains most text cleaning functions. The main func preprocess() will be used before every topic model, with different params such as maximum document frequency, n-grams... See the doc for an extensive description of this pipeline. Lemmatization, which relies on Spacy, is responsible for most of the execution time.

Note : When planning for many experiments, nlp_spacy_parser.py was meant to be used to parse the entire dataset once and for all with Spacy (POS tagging, lemmatization, named entity recognition), and save the result either in an array column of the .csv, or separate .spacy files. We can alter the preprocess() function to use read_spacy_from_arrays() and load a Doc() item for each tweet, and the doc.lemma_ item, instead of performing lemmatization each time.

Hyperparameter search

For CTM, hyperparameters were chosen using the library OCTIS.
For instance, parameters used in TM_run_CTM.py : dropout = 0.09891608522984201, num_neurons=300
The script TM_run_OCTIS_optimization.py provides an example of an early hyperparameter search.

Topic models

These scripts are used to run batches of models for multiple subsets of data (per day/month, and per language), and multiple values of K topics. They include automatic selection of "best" K, output topics and PyLDAVis visualisations. Links to the required libraries are included.

Biterm Topic Model (BTM) : see TM_run_BTM.py
CTM (Octis version) for daily data : run TM_run_CTM_octis.py
CTM (Origin version) for daily data : run TM_run_CTM_vanilla.py
Mixture models (Octis implementation) : see TM_run_MM.py

Interpretation and visualization of topics

Comparison of performance (coherence, diversity) is visualized in included notebooks TM_CTM_octis_viz.ipynb, TM_BTM_viz.ipynb
They also contain exploratory analysis of the topics, such as plotting topic dynamics.

Sentiment analysis

See nlp_sentdetect.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter activity during Protests and mobilization in Belarus (2020)

Description

1/ Collection of tweets

Original batch

Second batch

Converting JSON (V2) to (V1) CSV

Count of tweets

Dependencies :

2/ Data cleaning and exploration

3/ Topic modelling and other NLP analyses

Pre-processing

Hyperparameter search

Topic models

Interpretation and visualization of topics

Sentiment analysis

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
docs		docs
.gitignore		.gitignore
README.md		README.md
TM_BTM_viz.ipynb		TM_BTM_viz.ipynb
TM_CTM_octis_viz.ipynb		TM_CTM_octis_viz.ipynb
TM_custom_models.py		TM_custom_models.py
TM_run_BTM.py		TM_run_BTM.py
TM_run_CTM_octis.py		TM_run_CTM_octis.py
TM_run_CTM_vanilla.py		TM_run_CTM_vanilla.py
TM_run_MM.py		TM_run_MM.py
TM_run_OCTIS_optimization.py		TM_run_OCTIS_optimization.py
belarus_queries.py		belarus_queries.py
beltweets_analysis.ipynb		beltweets_analysis.ipynb
counts.ipynb		counts.ipynb
csv_conversion.ipynb		csv_conversion.ipynb
dtypes.py		dtypes.py
make_queries.py		make_queries.py
new_hashtags_list.py		new_hashtags_list.py
nlp_preprocessing.py		nlp_preprocessing.py
nlp_sentdetect.py		nlp_sentdetect.py
nlp_spacy_parser.py		nlp_spacy_parser.py
rtweet_collect.R		rtweet_collect.R
rtweet_conversion.ipynb		rtweet_conversion.ipynb
rtweet_conversion.py		rtweet_conversion.py
srch_v2.py		srch_v2.py
v2_csv_converter.py		v2_csv_converter.py

rchvist/belarus_twitter

Folders and files

Latest commit

History

Repository files navigation

Twitter activity during Protests and mobilization in Belarus (2020)

Description

1/ Collection of tweets

Original batch

Second batch

Converting JSON (V2) to (V1) CSV

Count of tweets

Dependencies :

2/ Data cleaning and exploration

3/ Topic modelling and other NLP analyses

Pre-processing

Hyperparameter search

Topic models

Interpretation and visualization of topics

Sentiment analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages