This project is an overview of an e-commerce data pipeline that involves building a sophisticated event-driven data ingestion and transformation pipeline focusing on e-commerce transactional data. We will design a system using AWS services such as S3, Lambda, Glue, Redshift, and SNS to ingest, transform, validate, and upsert data into Amazon Redshift for analytical purposes.
-
Pre-load these dimension tables into Redshift as part of the setup process.
-
Products Dimension Table (dim_products):
-
Customers Dimension Table (dim_customers):
-
Transactions Fact Table (fact_transactions):
- Columns: transaction_id, customer_id, customer_email, product_id, product_name, total_price, transaction_date, payment_type, status
-
Generate Mock Data for the customers based on products. Use Code build for lambda. This lambda function will generate data and upload csv files into S3 in Hive Partitioning manner.
-
Mock CSV files generated from the Lambda, the code in lambda is updated from GitHub repo by the CICD setup with AWS CodeBuild. stored using the following hive-style partitioning in S3: s3://your-bucket/transactions/year=2023/month=03/day=15/transactions_2023-03-15.csv.
- Create a Glue crawler for the S3 bucket input file directory (data stored in HIVE style with multiple partitioning)
- Create a Glue crawler for the fact_transaction Redshift table.
NOTE: we have to create the Redshift connector and 2 important points to remember
- Security Group associated with Redshift should be exposed to the Redshift PORT 5439.
- VPC associated with Redshift should have a S3 Endpoint defined.
- Create an SNS Topic to send emails when Glue ETL job is done, and it can be used in the final lambda to archive the data