Social Media Data Collection and Analysis

1. Introduction

This repository saves data and codes for the social media data (including Twitter and Weibo) collection and analysis.

2. Data Collection and Preprocessing

2.1 Twitter and Weibo Data Collection

All the codes for social media data collection can be found in Data_Collection folder.

For Twitter data, we use the Twitter Streaming API to collect the tweets. the secret key and access token are required. Please visit Twitter Developer Page for more information about registering the Twitter developer account and getting the access key and token.

For the Weibo data, this GitHub repository - WeiboSpider offers codes to collect the Weibo data based on user ids. The Weibo crawler requires to register the cookie. The detailed steps of generating the cookie is given here.

2.2 Data Overview

We started collecting the Twitter data from May 2018. Please visit the Tweet_Summary_Table.xlsx for more specific information about the number of all tweets & geocoded tweets posted in the cities.

For the Weibo data, we only collected Weibos between August 2011 and September 2014. The number of geocoded Weibos posted in the bounding boxes of major cities are given in Geocoded_Weibo_Major_Cities_Summaries.csv.

2.3 Data Preprocessing

The general data preprocessing steps involve the following steps:

For the user information:

Remove the bot accounts. We use the following rules to delete the bot accounts (Coşkun & Özturan, 2018):
- whose number of tweets is 2 standard deviations away from the mean of the users from that city
- whose most identified geographic footprint (latitude & longitude pair) accounts for over 60% of all the footprints

For the spatial and semantic information:

Remove the messages not containing the latitude and longitude information.
Only consider the messages posted within, for instance, the city boundary
Only extract the messages posted in some specific languages (e.g., English and Chinese)

For the temporal information:

Convert the time string to the structured python datetime object

The specific steps of data preprocessing are given in Tweet_filtering.ipynb.

2.4 Raw Data Storage

All the collected Twitter data is stored in the server. Please login to the server and visit the following directory to get the tweets posted in each city:

/home/data_center/Social_Media/happyplacestweets

The Weibo data is currently stored on a hard drive. Please contact the owner of this repository to retrieve the Weibo data.

3. Social Media Data Analysis

The Analysis folder saves the codes for Twitter & Weibo data analysis.

The Tweet_Data_Description.ipynb saves a description of collected tweet data and some basic time, text, and spatial analysis.
The Tweet_filtering.ipynb saves the general tweet filtering process, including considering tweets posted in preferred languages, keeping the geocoded tweets, and removing the bot accounts.
The cities_bounds.py saves the profiles of each city, including the bounding box, timezone, path to the tweet data, open space shapefile.
The count_tweets.py and count_weibos.py have the codes for counting the tweets posted in the cities and their open space.
The spatial_analysis.py stores the codes of finding the tweets or Weibos posted in a city's open space, based on the geopandas spatial join function.
The find_bot_accounts.py presents some functions to find the bot accounts. Here are some papers for reference:
- #europehappinessmap: A Framework for Multi-Lingual Sentiment Analysis via Social Media Big Data (A Twitter Case Study)
- Incorporating twitter-based human activity information in spatial analysis of crashes in urban areas
The visualizations.py saves the functions to plotting the number of tweets or Weibos posted in the city and in the city's open space.
The geopandas introduction page presents some basic spatial analysis conducted by using geopandas.
The utils.py and data_paths.py saves some help functions and paths for this project.

The Data Collection folder saves codes to collect the Twitter data.

The Shapefiles folder contains the shapefiles for the spatial analysis.

HK_tpu_project: epsg=2326

4. Requirements

The following packages should be considered before managing this project:

Python 3.7 or later (for social media data analysis)
R 3.4.1 or later (for Twitter data collection)
Spatial data analysis: geopandas, rasterio, pyproj, ArcGIS, QGIS
- The installation process for geopandas on windows is quite complicated, especially with pip. Please check this stackoverflow post for more information.
- The rasterio package conducts computation for the raster data.
- The pyproj is used for coordinate transformation.
- ArcGIS and QGIS can be used for map editing and simple spatial data analysis.
Data frame and computation: numpy and pandas
Cope with time: datetime and pytz. A list of pytz time zone is given here
Visualizations: matplotlib

5. Other Data Sources

The following links give the source of other data sources for urban data analysis:

5.1 World

Satellite Imagery Datasets: List of aerial and satellite imagery datasets with annotations for computer vision and deep learning.
Openstreetmap: Map of a world with various information, including land use, Point-of-Interests (POIs), etc. The package OSMPythonTools offers a neat way to access the Openstreetmap through a Python API

5.2 Chinese Cities

ChinaAdminDivisionSHP: Chinese administrative division vector diagram, Shapefile format, four levels: country, province, city, district/county
Hong Kong Monthly Traffic and Transport Digest: The monthly key statistics of HK transportation data
Hong Kong Census Data: Census data for Hong Kong, including population, median income, etc.
GoPUP: A very useful tool to collect the data in China, such as Weibo index data, Baidu index data, and Baidu search data

5.3 Foreign cities

bbike openstreetmap extractor: The OpenStreetMap data extractor
New York Open Data: New York Open Data is free public data published by New York City agencies and other partners for research. Other US cities such as Chicago also offers Chicago Data Portal

6. Reference

Here is a list of reference:

Please visit the reference.bib for the bibtex citations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Social Media Data Collection and Analysis

1. Introduction

2. Data Collection and Preprocessing

2.1 Twitter and Weibo Data Collection

2.2 Data Overview

2.3 Data Preprocessing

2.4 Raw Data Storage

3. Social Media Data Analysis

4. Requirements

5. Other Data Sources

5.1 World

5.2 Chinese Cities

5.3 Foreign cities

6. Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
Analysis		Analysis
Data_Collection		Data_Collection
Shapefiles		Shapefiles
Geocoded_Weibo_Major_Cities_Summaries.csv		Geocoded_Weibo_Major_Cities_Summaries.csv
README.md		README.md
Tweet_Summary_Table.xlsx		Tweet_Summary_Table.xlsx
reference.bib		reference.bib

bright1993ff66/social_media_collection_analysis

Folders and files

Latest commit

History

Repository files navigation

Social Media Data Collection and Analysis

1. Introduction

2. Data Collection and Preprocessing

2.1 Twitter and Weibo Data Collection

2.2 Data Overview

2.3 Data Preprocessing

2.4 Raw Data Storage

3. Social Media Data Analysis

4. Requirements

5. Other Data Sources

5.1 World

5.2 Chinese Cities

5.3 Foreign cities

6. Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages