Building a Data Warehouse ETL pipeline for Sparkify

Introduction

Spariky provides music streaming data to its users. The songs details and the user activity details are collected and stored in the form of JSON. This project helps spariky to make adhoc queries against their data using data warehousing concepts.

Description

Sparkify is a facing lot of problems querying its own data stored in json. They require to query and do lots of analytics and also their analytics drive their data science team as well. So, inorder to make things easy for them we are building an etl pipeline here which consumer json data store in s3 and store them as staging tables in redshift and later on creating facts and dimension tables out of the staging tables and inserting data. Facts and dimensional tables reside in redshift which uses columnar storage which increases our querying capability providing quick results.

Datasets

Log Dataset

{"artist":"Pavement", "auth":"Logged In", "firstName":"Sylvie", "gender", "F", "itemInSession":0, "lastName":"Cruz", "length":99.16036, "level":"free", "location":"Klamath Falls, OR", "method":"PUT", "page":"NextSong", "registration":"1.541078e+12", "sessionId":345, "song":"Mercy:The Laundromat", "status":200, "ts":1541990258796, "userAgent":"Mozilla/5.0(Macintosh; Intel Mac OS X 10_9_4...)", "userId":10}

Song Dataset

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Schema for fact table and dimensional tables

fact table - songplays

songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimensional tables - users, songs, artists, time

users - user_id, first_name, last_name, gender, level songs - song_id, title, artist_id, year, duration artists - artist_id, name, location, latitude, longitude time - start_time, hour, day, week, month, year, weekday

Schema for staging tables - stagingevents, stagingsongs

stagingevents - event_id, artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userAgent, userId

stagingsongs - num_songs, artist_id, artist_latitude, artist_longitude, artist_location, artist_name, song_id, title, duration, year

Required Steps to run the project

Configuration setup - Fill the dwh.cfg with the necessary information to start a redshift cluster
Run create_redshift_cluster - Run this jupyter notebook and create the cluster
Run create_tables.py - Use this python file to drop and create tables
Run etl.py - Run this python file to create the etl pipeline to insert data into the created tables.
Run test_analytics.py - This python file to run some basic analytical queries.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
create_redshift_cluster.ipynb		create_redshift_cluster.ipynb
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.cpython-36.pyc		sql_queries.cpython-36.pyc
sql_queries.py		sql_queries.py
test_analytics.py		test_analytics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building a Data Warehouse ETL pipeline for Sparkify

Introduction

Description

Datasets

Log Dataset

Song Dataset

Schema for fact table and dimensional tables

fact table - songplays

Dimensional tables - users, songs, artists, time

Schema for staging tables - stagingevents, stagingsongs

Required Steps to run the project

Don't forget to run the last steps in the jupyter notebook to delete the cluster.

About

Releases

Packages

Languages

kurvaraviteja355/Building-a-Data-Warehouse

Folders and files

Latest commit

History

Repository files navigation

Building a Data Warehouse ETL pipeline for Sparkify

Introduction

Description

Datasets

Log Dataset

Song Dataset

Schema for fact table and dimensional tables

fact table - songplays

Dimensional tables - users, songs, artists, time

Schema for staging tables - stagingevents, stagingsongs

Required Steps to run the project

Don't forget to run the last steps in the jupyter notebook to delete the cluster.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages