This project extracts, transforms, and loads (ETL) data into an AWS-based datalake. It includes linting, tests, containerization, and infrastructure-as-code with Terraform.
- Extract data from multiple sources (APIs, Databases, CSV files)
- Transform data (cleaning, normalization, type conversion)
- Load data into Parquet format and store it in AWS S3
- AWS Glue integration for ETL orchestration
- Terraform for AWS infrastructure provisioning
- CI/CD pipeline with GitHub Actions
- Dockerized environment for consistency
- Python 3.9+
- AWS CLI configured with necessary permissions
- Terraform installed for infrastructure deployment
- Docker (optional, for containerized execution)
pip install -r requirements.txt
cd infrastructure
terraform init
terraform apply
make run
make test
make lint
make docker-build
make docker-run