This Apache Airflow-based ETL (Extract, Transform, Load) pipeline is designed to scrape and process hourly dataset updates from Flipkart, providing an automated and scalable data management solution.
- Hourly Data Extraction: Automated scraping of Flipkart product data
- Docker Compose Support: Easy setup and deployment
- Robust ETL Process: Comprehensive data extraction, transformation, and loading
- Scalable Architecture: Flexible pipeline design for seamless data management
- Docker Desktop
- Docker CLI
- Python 3.8+
- Apache Airflow 2.x
- Clone the repository:
git clone https://github.com/SachinPrasanth777/Flipkart-ETL-Pipeline
- Build and start the services:
docker-compose up --build
- Access Services:
- URL:
http://localhost:8080
- Username:
airflow
- Password:
airflow
- URL:
http://localhost:5050
- Username:
[email protected]
- Password:
root
- URL:
http://localhost:9090
- Username:
airflow1234
- Password:
airflow1234
- URL:
ETL-Pipeline/
│
├── dags/
│ ├── functions/
│ │ ├── constants.py
│ │ └── functions.py
│ └── task.py
│
├── docker-compose.yml
├── Dockerfile
└── requirements.txt
Contains constant variables used across the ETL pipeline, such as:
- Scraping configurations
- Database connection parameters
- Predefined paths and URLs
Includes utility functions that support the ETL process:
- Data preprocessing methods
- Scraping helpers
- Data validation functions
- Logging and error handling utilities
Modify dags/task.py
and supporting files in functions/
to:
- Adjust scraping parameters
- Configure target product categories
- Set up data storage locations
- Extract
- Scrape product data from Flipkart
- Handle rate limiting and anti-scraping measures
- Capture product details, prices, ratings
- Transform
- Clean and normalize scraped data
- Remove duplicates
- Convert data types
- Load
- Store processed data in PostgreSQL
- Store and update the same in MiniO
- Frequency: Hourly data updates
- Configurable Intervals: Easily modify in DAG definition
- Airflow Web UI for task tracking
- Detailed logging
- Task success/failure notifications
- PgAdmin UI for database monitoring
- MiniO UI for viewing buckets
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Distributed under the MIT License.
Sachin Prasanth
- GitHub: @SachinPrasanth777
Disclaimer: Ensure compliance with Flipkart's terms of service and robots.txt when scraping data.