Skip to content

TitoLulu/Data-Pipelines-with-Airflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Data-Pipelines-with-Airflow

This project aims to introduce more automation and monitoring of ETL pipelines to Sparkify music streaming application Data Warehouse using Apache Airflow.

Of importance is ensuring data pipelines are dynamic and built from reusable tasks,can be monitored and allows easy backfill.

Approach

To complete the project I had to build own custom operators for creating tables in redshift, staging the data, loading dimension and fact data and finally perfoming data quality checks.

Data Source

The data for the project resides in S3 and consists of JSON logs and JSON metadata

  1. s3://udacity-dend/log_data
  2. s3://udacity-dend/song_data

The data exists in region us-west-2

How to run project

  1. In directory airflow/workspace, run the command /opt/airflow/start.sh to launch Airflow
  2. Create an IAM role in AWS Management console and save or copy credentials
  3. Under Admin tab in Airflow add a connection with name aws_credentials and pass the credentials from above to login and passowrd fields
  4. Create a redshift cluster
  5. Add connection to airflow and pass cluster connection to host field, then password to password field
  6. turn the dag on and trigger it to run
  7. Data Quality Check

    To confirm excution of dag and loading of data into the their disparate tables run a "select count(*) from {replace_with_table_you_wish_to_check}"

    More than zero records indicate the dag run to completion without failure

    Reference

    -Reference #1: How-to Guide for PostgresOperator

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages