In order to run pipelines there are some steps needs to be done
- Set up virtual enviroments (Python >= 3.7)
cd <path to the project>/stuart_casestudy pyenv virtualenv <python-version> <virtualenv-name> pyenv activate <virtualenv-name>
- Install python dependencies
cd <path to the project>/stuart_casestudy pip install -r requirements.txt
- Set the enviroment variable $PATHONPATH
export PYTHONPATH=<path to the project>/stuart_casestudy/
- Run the docker-compose to have a presudo data warehouse in the backend
cd <path to the project>/stuart_casestudy docker-compose up .
- Create an Schema in the data warehouse
CREATE SCHEMA IF NOT EXISTS kaggle;
- Run the pipline in folder dag/run.py
cd <path to the project>/stuart_casestudy/dag python run.py
The idea is to crack the challenge of modulized the pipeline be able to reuse and reproduce and provide a visualization There are folders and files in the root directory here I'm going to introduce them one by one
-
- Here include all the actions of pipeline (Extraction, Transformation, Load)
- The idea is to achived modulization hence need to set up $PATHONPATH include where this etl location
-
- The idea here is to add some test cases on methods of etl actions
-
- Where all the pipline script
-
- Here I provided simple data insight with visualization in python notebook
-
- Due to the raw data has saved into file system the database is just ephemeral hence I decide to use a docker container so we can have more flexibility in the future.
-
- Python libraries need to be installed before tun the pipeline
Please be noted that here the stroage solution still using local file for the presentation hence the files from kaggle will be in .etl/Extraction/data/<source_name>/
- Use Python because there are a lots of packages regards to the data engineering
- Use Pandas because faster & easier to develpoe
- Use Pandora because it integrate with Pandas seamleassly
- Data consistency
- Schema consistency
- Able to rewind to the previous version of source dataset
- Unit test on some of the actions in ETL
Here I would have more discussion with the person who have better knowledge on traffic accident to determine the corretness with that said for now I would Benchmarking the data set by profiling the source data including (I will go with the great_expectation):
- Data row count (more or less 20% or raise error)
- Null data (more or less 10% or raise error)
- Categorical data defination (ex: Animal should include cat, dog, ... but if there is a iphone in the column this should raise an error...)
- Common sense, for example I aware there are drivers with age below 15 which dont make sense so I filter out.
The main idea of this project is to present the concept of how I will handle these tasks doesn't mean I will do this on production. It just a simple project for demostrate purpose.
- For dags I probably will use Airflow because its Python & ability of the scale (K8S or CeleryExecutor)
- For Transformation I will use Snowflake/Redshift + dbt or PySpark (DBT to me is a game changer which I would definitely like to have more hands-on experience in the future)
- For Load & Extraction In general I will use Fivetran or Stitch however in this case with kaggle API I will still use my own scripts
- For the stroage solution I will definately go with S3 furthermore instead of csv files I will convert in parquet file or split into smaller file first and store in S3
- For Visualization I will use Tableau or powerBI, Tableau able to connect to various of sources easily to fetch data in data mart and build the dashboard without the pain.