Skip to content

0xCakin/Data_Lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sparkify Data Lake | ETL Pipeline

Summary

Project Overview

In this project, data has been extracted from a AWS S3 bucket. The data processed, fact and dimension tables have been created. The final output has been load back into S3. This process has been deployed in Spark session.

Pre-requisite

  1. Install Python 3
  2. Install pyspark, os, pyspark.sql pip install pyspark
    Optional:
  3. Jupyter Notebook
  4. PyCharm

ETL Pipeline

  1. Read data from S3

    • Song data: s3://udacity-dend/song_data
    • Log data: s3://udacity-dend/log_data
  2. Transform the data using Spark

    • Create five different tables

    Fact Table

    songplays - data lives in log data.

    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

    Dimension Tables

    users - users Fields - user_id, first_name, last_name, gender, level

    songs - songs in the database Fields - song_id, title, artist_id, year, duration

    artists - artists in the database Fields - artist_id, name, location, lattitude, longitude

    time - timestamps of records in songplays broken down into specific units Fields - start_time, hour, day, week, month, year, weekday

Setup Instructions:

  1. Populate the dwh.cfg config file with AWSAccessKeyId and AWSSecretKey
  2. Setup ETL.py and run
[KEY]
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages