This repository implements a data pipeline for African Financial Market data.
The objective of this project is to provide data to equity analysts and investment professionals focusing on African Capital Markets. This project implements an ELT pipeline by scraping data using Scrapy and uploading it to Google Cloud Platform. Analysts can then perform high-level transformations and analysis in BigQuery using dbt and visualize the data using Tableau.
The data is scraped from african-markets.
This project is motivated by my experience as a treasury analyst, which involved analyzing investment instruments on the African markets where treasury reserves could be invested for high ROI with low risk. Additionally, as a freelancer on Upwork, I was assigned to a gig as a Research/Investment Analyst, which necessitated access to financial data behind paywalls.
This project demonstrates the implementation of a highly scalable and containerized web crawler using Scrapy to scrape data at regular intervals and upload the data to Google Cloud. The scraped data is first loaded into PostgreSQL, then queried and uploaded to Google Cloud Storage, and subsequently to BigQuery.
- Terraform: Used to provision the cloud infrastructure, specifically Google Cloud Storage and BigQuery. Cloud Composer could also be provisioned instead of using Airflow, but Airflow is more cost-effective.
- Scrapy: Used to scrape the website and manage the scraping process.
- Airflow: Orchestrates the workflow from data collection to loading data into BigQuery. The Google alternative is Cloud Composer.
- Docker: Containerizes the data pipeline to ensure portability and consistency.
- PGAdmin: Manages the PostgreSQL database.
- Google Cloud Resources:
- Google Cloud Storage: Google's immutable object storage for unstructured data.
- BigQuery: Google's cloud data warehouse.
mkdir African_Markets_data
cd African_Markets_data
git clone [email protected]:amaboh/african_financial_markets.git
cd african_financial_markets
African_Markets_data/
|-- airflow/
| |-- dags/
| | |-- __init__.py
| | |-- market_data_dag.py
| | |-- market_data/
| | |-- __init__.py
| | |-- items.py
| | |-- pipelines.py
| | |-- settings.py
| | |-- middlewares.py
| | |-- extract_to_json.py # Script for extracting data to JSON
| | |-- spiders/
| | |-- __init__.py
| | |-- companyspider.py
| | |-- marketspider.py
| |-- Dockerfile
| |-- requirements.txt
| |-- docker-compose.yml
|-- db/
| |-- Dockerfile
| |-- custom-entrypoint.sh
| |-- jobs.sql
|-- terraform/
| |-- main.tf
| |-- variables.tf
| |-- keys/
| |-- gcp-creds.json
- Go to Google Cloud Console and create a New Project.
- Go to IAM and create a Service Account Key with Cloud Storage Admin and BigQuery Admin privileges.
- Move the downloaded key to
terraform/keys/
and rename it togcp-creds.json
(this path is included in the.gitignore
file).
- Go to
terraform/variables.tf
, copy the Project ID from the console, replace "african-indices-420410", and update your region.
variable "project" {
description = "project"
default = "african-indices-420410"
}
- Provision resources:
cd terraform/
terraform init
terraform plan
terraform apply
Choose yes and verify resources (Cloud Storage GCS and BigQuery) in the console.
- Adjust the docker-compose.yaml file to reference Google service keys on your local machine:
x-airflow-common: &airflow-common
build: .
environment: &airflow-common-env
GOOGLE_APPLICATIONS_CREDENTIALS: /opt/airflow/keys/gcp-creds.json # Ensure this path is correct
volumes:
- /full/path/to/terraform/keys:/opt/airflow/keys # Replace with the full path on your machine
- Build and run the Docker containers:
docker-compose build
docker-compose up -d
- Go to
localhost:8080
in your web browser; useuser: airflow
andpassword: airflow
. - Go to Admin > Connections > + (add new record) and enter the details:
{
"Connection Id": "ingest_db_conn_id",
"Connection Type": "Postgres",
"Host": "postgres",
"Database": "ingest_db",
"Login": "airflow",
"Password": "airflow"
}
- Go to Admin > Connections > + (add new record) and enter the details:
{
"Connection Id": "google_cloud_default",
"Project Id": "african-indices-420410",
"Keyfile Path": "/opt/airflow/keys/gcp-creds.json"
}
- From the Airflow UI, trigger the DAG with the Play Button and ensure all processes complete successfully.