Let's say you're someone who knows some Python and would like to work with streaming data from Twitter using cloud-based software (either because you're curious or, like me, you wanted practice building a streaming data pipeline from start to finish). I set out to do this very thing and I found some good high-level resources to help guide me. But most tended to skip over some of the nitty gritty details needed to make the thing work.
So I decided to build this pipeline and document everything along the way with the intention of making it a little easier for others to do the same in the future. In the end, I came out with a blueprint for a data pipeline that lets you stream in tweets on topics that interest you. (The chart above, for example, shows a sample run streaming tweets about various stock ticker symbols over the course of a few hours.)
So without further ado, here's my attempt at providing a step by step guide to building a cloud-based pipeline for analyzing streaming data from Twitter.
This pipeline mainly relies on: the Twitter API, Python, and Google Cloud Platform.
The Python script in this repo (stream-to-pubsub.py
) runs on a GCP Compute Engine
VM and accesses the Twitter API (credentials are stored in and pulled from Secret Manager
), listens for tweets with specific search terms (the script will prompt you to enter which phrases or hashtags you want the script to listen for), and sends them to Pub/Sub
. Those Pub/Sub messages are then delivered to BigQuery
via Dataflow
and, finally, can be visualized using tools like Data Studio
.
Here's a way to visualize the pipeline:
You'll need: 1) Twitter Developer credentials and 2) a GCP account set up:
Once you have a Twitter Developer account, you'll need an app, API consumer key (and secret), and access token (and secret).
(You may need to enable to the relevant APIs on your GCP account - Compute Engine, GCS, Dataflow, and BigQuery. You can do that using the search bar at the top of the GCP interface.)
Now let's start setting everything up, from the VM to Pub/Sub to the Dataflow job. Most of the work is done upfront with setting up the VM, so there will be a lot of detail in this first section.
The first part of the pipeline involves sending tweets from the Twitter API to Pub/Sub using the Python script stream-to-pubsub.py
. For this script to run successfully, you'll need to:
- Setup a VM where you can run the script
- Enter your Twitter API credentials in Secret Manager
- Create a Pub/Sub topic to receive messages from the Twitter API
- Under the
Compute Engine
tab on GCP, create a Debian 10 VM. Make sure to allow access to all APIs. (In my case, I set up ane2-small
machine in zoneus-east-4a
.) - Now,
SSH
into the VM. From the command line, check for Python 3, install pip, and setup a packaging tool (in your VM, run each of the commands listed here inStep 1
). - Then install
tweepy
,google-cloud-secret-manager
, andgoogle-cloud-pubsub
usingpip3
.- Note: I ran into a problem where Secret Manager wouldn't finish installing (others have had the same issue). But I upgraded pip (with
pip3 install --upgrade pip
), reran the install, and it finished quickly.
- Note: I ran into a problem where Secret Manager wouldn't finish installing (others have had the same issue). But I upgraded pip (with
- Finally, install
git
(sudo apt install git
) and clone this repo.
Now you need to add your Twitter API access credentials to GCP Secret Manager
. Within Secret Manager, create a secret for each of the 4 credentials you'll need for the script to access the Twitter API and name them accordingly (twitter-api-key
, twitter-api-secret
, twitter-access-token
, and twitter-access-token-secret
). Lastly, add Secret Manager Secret Accessor
to the Compute Engine default service account in IAM.
Create a Pub/Sub
topic and name it accordingly (something like twitter
will work). Then modify the project-specs.json
file in this repo using the project and Pub/Sub topic info from your GCP account (you can find that info in the Home
tab on GCP).
Now, try running the script (python3 stream-to-pubsub.py
) to make sure that it actually works on the VM.
It should look something like this
If that worked, then move on to the next part.
Now that the VM works and the Python script can run on the VM, you need a Dataflow
job to take the messages that Pub/Sub will receive while the streaming script is running and transform and send those messages to BigQuery. To do that, you can use a GCP template for connecting Pub/Sub to BigQuery:
Note that you'll need a GCS bucket to temporarily store files coming in from Dataflow and you'll also need an empty table in BigQuery (use bq-schema.json
when creating the BQ table) to receive the streaming data.
Once those are ready, you can start the Dataflow job and run the script from your VM. Wait a few minutes and you should start to see rows populating in your BigQuery table:
So far, I've played around with visualizing the data in Data Studio but haven't figured out how to make the timeseries chart update continuously to reflect the arrival of new data. I think there might be a way to do it in Data Studio but I'll have to keep tinkering.
Receiving 401s when you run the script?
- Make sure you've entered the correct API credential information in Secret Manager (you may need to regenerate credentials. Twitter seems to reject credentials after a certain period of time but I'm not sure exactly how long credentials do or don't last.
- If your credentials check out, the problem may be that the VM's internal clock is inaccurate. The VM's clock can be delayed if you pause the VM. Try stopping and restarting the VM. This should reset the clock (run
timedatectl
within the VM and check to see ifSystem clock synchronized
is set toyes
). If you do want to put the VM on hold in the future, it may be better to stop it altogether rather than pause.
Getting connection errors?
This one has been trickier to solve. I've added a few parameters to the script to try and prevent the stream from breaking. The only other thing that might help is, if you're running the VM within a Chrome browser (like I did), make sure you have the most up-to-date version of Chrome.
If your Dataflow job doesn't run, you should double check to make sure that the necessary Dataflow APIs have been enabled (using the command gcloud services enable dataflow.googleapis.com
).
One thing to keep in mind is that the tweets that are streamed into this pipeline come from a random sample that's just 1% of all tweets. So if you streamed in tweets containing #trending
and you got back 10,000 tweets in the first minute, those 10,000 tweets originated in just a 1% random sample of all of Twitter's new content posted in that minute. In other words, the results don't include every single tweet containing #trending. They include every tweet within a 1% sample that contained #trending.
To get this project off the ground, I initially borrowed code from here:
https://github.com/TDehaene/blogposts/tree/master/got_sentiment