Skip to content

TitoLulu/Udacity-Data-Engineering-Nanodegree-Capstone-Project

Repository files navigation

Udacity-Data-Engineering-Nanodegree-Capstone-Project

The aim of this project is to build analytics tables comprising immigration data from the USA and enriched with other data sets such as airport codes, global land temperature and city demographics. Analysts can perfom analysis like checking the frequency of immigration partitioned by month and year.

a sample query would be

select 
  count(cic_id) as "number of arrivals", 
  month, 
  year
from fact_immigration f
join dim_immigrants di on (f.cic_id = di.cic_id)
join dim_countries dc on (di.res = dc.code)
group by year,month,country;

the result

alt text

Approach

Build an Amazon S3 hosted data lake and ETL pipeline that loads data from S3, processes it using Spark and loads back to S3 the transformed data either as dimensional tables or fact tables.

Data Source

Immigratin_data

Airport data

US Cities demography

temperature

Conceptual Data Model

The conceptual data model is that of a star schema comprising a central fact table immigration_fact and auxiliary dimension tables. Star Schema has the following advantages

  1. Fast aggregations
  2. Simple queries
  3. Denormalized

alt text

Run ETL

To run the ETL enter your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY into capstone.cfg file. Then run the etl.py scrip to trigger spark job that processes and transforms the data into a combination of facts and dimension tables. Check for the output in your s3 data lake to confirm successful load.

Clean Up Process

The clean up process involves dropping duplicate rows, reading sas labels files to produce dimension tables for city and country codes, and finally checking for missing data on the temperature dataset.

Ideal set up

In a production environment I would inlude a workflow management tool such as airflow to schedule the frequency of DAGs jobs. Assuming it is batch pipeline I would schedule it on a daily to run at 12:00 AM UTC time so as to capture previous days activity data. I would also include redshift warehouse for warehousing allowing analysts to further perform their analysis by either directly querying views or doing that over a BI tool set on top of the warehouse

inceasing data by 100x

If data was increased by 100x Spark's powerful processing engine would be able to handle it. We could also process the data in partitions so as to work within memory limits. Also, in place of running the Spark job on single user machined we would host Spark in AWS EMR which allows running peta-byte scale data analytics faster

The data populates a dashboard that must be updated on a daily basis by 7am every day

As the data immigration is updated on a monthly basis, I would schedule the monthly DAG to run before 7 AM, so give or take 2 hours before to append new data in the tables so that the downstream dependent task does not fail or load empty

The database needed to be accessed by 100+ people

Implement access management by adding users to groups with different data needs,then grant read permissions to tables frequently used by the different groups. Hence based on group/role all I would need to do when a new analyst/employee joins the company is add them to any of the predefined group and they inherit all access rights relating to the group.

Reference

ref #1:PySpark split() Column into Multiple Columns

ref #2:How to Turn Python Functions into PySpark Functions (UDF)

ref #3:ZIP Function

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published