This project applies data warehousing with Redshift and builds an ETL pipeline using Python. The ETL pipeline is mainly responsible for following tasks:
- Copy data from S3 files to staging tables in Redshift
- Insert data from these staging tables in Redshift into our modelled fact and dimesional tables within Redshift suited for analysis
The data is for a demo startup called Sparkify where analysts would like to perform queries on user activity of songs. The data files have song data and user activity log data
-
Drop and create tables before running the ETL
python create_tables.py
-
Run the etl
python etl.py
-
You can check if data has been loaded correctly using Redshift query editor in AWS console
-
Data directories:
-
song_data : The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. The contents of one such file looks like the following:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
-
log_data: The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset you'll be working with are partitioned by year and month. Following is a snapshot log_data
-
-
create_tables.py - Script to drop and create tables according to data modelled. You need to make sure that this script is executed before running the ETL
-
sql_queries.py - A collection of all DDL and DML queries used in the project
-
etl.py - The actual ETL file which loads data from data files and inserts into Redshift by first copying to staging tables and then from staging tables to corresponding Fact and Dimension tables
We create 2 staging tables to copy data as is from AWS S3 buckets.
- staging_events - to copy data of all user activity from log files present in 'log_data' directory
in S3 bucket located at
s3://udacity-dend/log_data
- staging_songs - to copy data of all song tracks prsent in files located at following
S3 destination
s3://udacity-dend/song_data >
Using the song and log datasets, we create a star schema optimized for queries on song play analysis.
Following is how we modelled the data in 1 fact table and 4 corresponding dimension tables:
- songplays - records in log data associated with song plays i.e. records with the page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users in the app
- user_id, first_name, last_name, gender, level
- songs - songs in music database
- song_id, title, artist_id, year, duration
- artists - artists in music database
- artist_id, name, location, latitude, longitude
- time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday
The ETL process is written in python, has Redshift as a data warehouse and makes use 'psycopg2' library for connecting and writing to Redshift.
The ETL process is modelled mainly with two components/processors:
-
Copy data from S3 files to staging tables using Redshift COPY command
e.g.copy staging_songs from {SONG_DATA} iam_role {IAM_ROLE} region 'us-west-2' json 'auto';
-
Insert data from these staging tables in Redshift into our modelled fact and dimesional tables within Redshift suited for analysis
e.g.
INSERT INTO users (user_id, first_name, last_name, gender, level)
SELECT user_id, first_name, last_name, gender, level
FROM staging_events where user_id is NOT NULL;