Skip to content

An E-Commerce Data Pipeline using AWS services such as S3, Lambda, Glue, Redshift, Event Bridge, and SNS

Notifications You must be signed in to change notification settings

yash872/E-Commerce-DataPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

E-Commerce-DataPipeline


Project Overview

This project is an overview of an e-commerce data pipeline that involves building a sophisticated event-driven data ingestion and transformation pipeline focusing on e-commerce transactional data. We will design a system using AWS services such as S3, Lambda, Glue, Redshift, and SNS to ingest, transform, validate, and upsert data into Amazon Redshift for analytical purposes.


Architectural Diagram

Architecture Design


Key Steps

1. Dimension Tables with Sample Records and Fact Table

  • Pre-load these dimension tables into Redshift as part of the setup process.

  • Products Dimension Table (dim_products):

    • Columns: product_id, product_name, category, price, supplier_id Dim_Customer
  • Customers Dimension Table (dim_customers):

    • Columns: customer_id, first_name, last_name, email, membership_level Dim_Customer
  • Transactions Fact Table (fact_transactions):

    • Columns: transaction_id, customer_id, customer_email, product_id, product_name, total_price, transaction_date, payment_type, status

2. Mock Data Generation

  • Generate Mock Data for the customers based on products. Use Code build for lambda. This lambda function will generate data and upload csv files into S3 in Hive Partitioning manner.

    • CODE BUILD: Codebuild

    • Mock CSV files generated from the Lambda, the code in lambda is updated from GitHub repo by the CICD setup with AWS CodeBuild. stored using the following hive-style partitioning in S3: s3://your-bucket/transactions/year=2023/month=03/day=15/transactions_2023-03-15.csv.

    RawCSVFiles


3. Create AWS Glue Crawlers

  • Create a Glue crawler for the S3 bucket input file directory (data stored in HIVE style with multiple partitioning)
  • Create a Glue crawler for the fact_transaction Redshift table. NOTE: we have to create the Redshift connector and 2 important points to remember
    • Security Group associated with Redshift should be exposed to the Redshift PORT 5439.
    • VPC associated with Redshift should have a S3 Endpoint defined.

crawlers


4. Create Glue ETL Flow

  • GlueETL

5. Create lambda to start the glue job when data is generated in S3:

  • s3lambdaGlueJob

6. Create SNS Topic:

  • Create an SNS Topic to send emails when Glue ETL job is done, and it can be used in the final lambda to archive the data
    • s3lambdaGlueJob

7. Create EventBridge Rule:

  • Create EventBridge Rule to trigger SNS when Glue Job Changes Status
    • EventRuleforSNS

8. Create lambda to archive data after SNS notification:

  • s3lambdaGlueJob

9. Create S3 Bucket to Archive the data:

  • dataArchive

10. Final Fact_Transaction table in Redshift:

  • The data ingestion process is following the UPSERT method.
    • fact_ouput

About

An E-Commerce Data Pipeline using AWS services such as S3, Lambda, Glue, Redshift, Event Bridge, and SNS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages