This repository saves data and codes for the social media data (including Twitter and Weibo) collection and analysis.
All the codes for social media data collection can be found in Data_Collection folder.
For Twitter data, we use the Twitter Streaming API to collect the tweets. the secret key and access token are required. Please visit Twitter Developer Page for more information about registering the Twitter developer account and getting the access key and token.
For the Weibo data, this GitHub repository - WeiboSpider offers codes to collect the Weibo data based on user ids. The Weibo crawler requires to register the cookie. The detailed steps of generating the cookie is given here.
We started collecting the Twitter data from May 2018. Please visit the Tweet_Summary_Table.xlsx for more specific information about the number of all tweets & geocoded tweets posted in the cities.
For the Weibo data, we only collected Weibos between August 2011 and September 2014. The number of geocoded Weibos posted in the bounding boxes of major cities are given in Geocoded_Weibo_Major_Cities_Summaries.csv.
The general data preprocessing steps involve the following steps:
For the user information:
- Remove the bot accounts. We use the following rules to delete the bot accounts (Coşkun & Özturan, 2018):
- whose number of tweets is 2 standard deviations away from the mean of the users from that city
- whose most identified geographic footprint (latitude & longitude pair) accounts for over 60% of all the footprints
For the spatial and semantic information:
- Remove the messages not containing the latitude and longitude information.
- Only consider the messages posted within, for instance, the city boundary
- Only extract the messages posted in some specific languages (e.g., English and Chinese)
For the temporal information:
- Convert the time string to the structured python datetime object
The specific steps of data preprocessing are given in Tweet_filtering.ipynb.
All the collected Twitter data is stored in the server. Please login to the server and visit the following directory to get the tweets posted in each city:
- /home/data_center/Social_Media/happyplacestweets
The Weibo data is currently stored on a hard drive. Please contact the owner of this repository to retrieve the Weibo data.
The Analysis folder saves the codes for Twitter & Weibo data analysis.
- The Tweet_Data_Description.ipynb saves a description of collected tweet data and some basic time, text, and spatial analysis.
- The Tweet_filtering.ipynb saves the general tweet filtering process, including considering tweets posted in preferred languages, keeping the geocoded tweets, and removing the bot accounts.
- The cities_bounds.py saves the profiles of each city, including the bounding box, timezone, path to the tweet data, open space shapefile.
- The count_tweets.py and count_weibos.py have the codes for counting the tweets posted in the cities and their open space.
- The spatial_analysis.py stores the codes of finding the tweets or Weibos posted in a city's open space, based on the geopandas spatial join function.
- The find_bot_accounts.py presents some functions to find the bot accounts. Here are some papers for reference:
- The visualizations.py saves the functions to plotting the number of tweets or Weibos posted in the city and in the city's open space.
- The geopandas introduction page presents some basic spatial analysis conducted by using geopandas.
- The utils.py and data_paths.py saves some help functions and paths for this project.
The Data Collection folder saves codes to collect the Twitter data.
The Shapefiles folder contains the shapefiles for the spatial analysis.
- HK_tpu_project: epsg=2326
The following packages should be considered before managing this project:
- Python 3.7 or later (for social media data analysis)
- R 3.4.1 or later (for Twitter data collection)
- Spatial data analysis: geopandas, rasterio, pyproj, ArcGIS, QGIS
- The installation process for geopandas on windows is quite complicated, especially with pip. Please check this stackoverflow post for more information.
- The rasterio package conducts computation for the raster data.
- The pyproj is used for coordinate transformation.
- ArcGIS and QGIS can be used for map editing and simple spatial data analysis.
- Data frame and computation: numpy and pandas
- Cope with time: datetime and pytz. A list of pytz time zone is given here
- Visualizations: matplotlib
The following links give the source of other data sources for urban data analysis:
- Satellite Imagery Datasets: List of aerial and satellite imagery datasets with annotations for computer vision and deep learning.
- Openstreetmap: Map of a world with various information, including land use, Point-of-Interests (POIs), etc. The package OSMPythonTools offers a neat way to access the Openstreetmap through a Python API
- ChinaAdminDivisionSHP: Chinese administrative division vector diagram, Shapefile format, four levels: country, province, city, district/county
- Hong Kong Monthly Traffic and Transport Digest: The monthly key statistics of HK transportation data
- Hong Kong Census Data: Census data for Hong Kong, including population, median income, etc.
- GoPUP: A very useful tool to collect the data in China, such as Weibo index data, Baidu index data, and Baidu search data
- bbike openstreetmap extractor: The OpenStreetMap data extractor
- New York Open Data: New York Open Data is free public data published by New York City agencies and other partners for research. Other US cities such as Chicago also offers Chicago Data Portal
Here is a list of reference:
- #europehappinessmap: A Framework for Multi-Lingual Sentiment Analysis via Social Media Big Data (A Twitter Case Study)
- Incorporating twitter-based human activity information in spatial analysis of crashes in urban areas
- WeiboSpider.
- Twitter Developer Page.
- Python – User object in Tweepy.
Please visit the reference.bib for the bibtex citations.