Skip to content

A repository showing how to conduct the social media data collection and analysis

Notifications You must be signed in to change notification settings

bright1993ff66/social_media_collection_analysis

Repository files navigation

Social Media Data Collection and Analysis

1. Introduction

This repository saves data and codes for the social media data (including Twitter and Weibo) collection and analysis.

2. Data Collection and Preprocessing

2.1 Twitter and Weibo Data Collection

All the codes for social media data collection can be found in Data_Collection folder.

For Twitter data, we use the Twitter Streaming API to collect the tweets. the secret key and access token are required. Please visit Twitter Developer Page for more information about registering the Twitter developer account and getting the access key and token.

For the Weibo data, this GitHub repository - WeiboSpider offers codes to collect the Weibo data based on user ids. The Weibo crawler requires to register the cookie. The detailed steps of generating the cookie is given here.

2.2 Data Overview

We started collecting the Twitter data from May 2018. Please visit the Tweet_Summary_Table.xlsx for more specific information about the number of all tweets & geocoded tweets posted in the cities.

For the Weibo data, we only collected Weibos between August 2011 and September 2014. The number of geocoded Weibos posted in the bounding boxes of major cities are given in Geocoded_Weibo_Major_Cities_Summaries.csv.

2.3 Data Preprocessing

The general data preprocessing steps involve the following steps:

For the user information:

  • Remove the bot accounts. We use the following rules to delete the bot accounts (Coşkun & Özturan, 2018):
    • whose number of tweets is 2 standard deviations away from the mean of the users from that city
    • whose most identified geographic footprint (latitude & longitude pair) accounts for over 60% of all the footprints

For the spatial and semantic information:

  • Remove the messages not containing the latitude and longitude information.
  • Only consider the messages posted within, for instance, the city boundary
  • Only extract the messages posted in some specific languages (e.g., English and Chinese)

For the temporal information:

  • Convert the time string to the structured python datetime object

The specific steps of data preprocessing are given in Tweet_filtering.ipynb.

2.4 Raw Data Storage

All the collected Twitter data is stored in the server. Please login to the server and visit the following directory to get the tweets posted in each city:

  • /home/data_center/Social_Media/happyplacestweets

The Weibo data is currently stored on a hard drive. Please contact the owner of this repository to retrieve the Weibo data.

3. Social Media Data Analysis

The Analysis folder saves the codes for Twitter & Weibo data analysis.

The Data Collection folder saves codes to collect the Twitter data.

The Shapefiles folder contains the shapefiles for the spatial analysis.

  • HK_tpu_project: epsg=2326

4. Requirements

The following packages should be considered before managing this project:

  • Python 3.7 or later (for social media data analysis)
  • R 3.4.1 or later (for Twitter data collection)
  • Spatial data analysis: geopandas, rasterio, pyproj, ArcGIS, QGIS
    • The installation process for geopandas on windows is quite complicated, especially with pip. Please check this stackoverflow post for more information.
    • The rasterio package conducts computation for the raster data.
    • The pyproj is used for coordinate transformation.
    • ArcGIS and QGIS can be used for map editing and simple spatial data analysis.
  • Data frame and computation: numpy and pandas
  • Cope with time: datetime and pytz. A list of pytz time zone is given here
  • Visualizations: matplotlib

5. Other Data Sources

The following links give the source of other data sources for urban data analysis:

5.1 World

  • Satellite Imagery Datasets: List of aerial and satellite imagery datasets with annotations for computer vision and deep learning.
  • Openstreetmap: Map of a world with various information, including land use, Point-of-Interests (POIs), etc. The package OSMPythonTools offers a neat way to access the Openstreetmap through a Python API

5.2 Chinese Cities

5.3 Foreign cities

6. Reference

Here is a list of reference:

  1. #europehappinessmap: A Framework for Multi-Lingual Sentiment Analysis via Social Media Big Data (A Twitter Case Study)
  2. Incorporating twitter-based human activity information in spatial analysis of crashes in urban areas
  3. WeiboSpider.
  4. Twitter Developer Page.
  5. Python – User object in Tweepy.

Please visit the reference.bib for the bibtex citations.

About

A repository showing how to conduct the social media data collection and analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published