- Project Charter
- Directory Structure
- Clone the Repository
- Random Forest Model Pipeline
- Customer Churn Predictor App
- Backlog
Vision:
Customer attrition refers to the loss of customers by a business. No matter whether a customer is a one-time purchaser or a loyal program member, customers will eventually churn and not remain active indefinitely. The loss of customers is undesirable, as in most of the cases, the cost to retain a customer is lower than that to acquire a new customer. Companies in various industries such as telecom companies, insurance companies and restaurants often analyze customer attrition to get a deeper insight into the churn. This project specifically aims to help a telecom company make reliable predictions for customer churn so that the company can implement remedial actions for customer retention.
Mission:
This project uses the Telco customer churn data compiled by BlastChar on Kaggle.com (https://www.kaggle.com/blastchar/telco-customer-churn). This data set contains information about a telecom company which provides services to 7,043 customers in California. For each customer, it has the binary indicator of whether a customer has churned, along with several demographic predictors such as gender and service predictors such as the monthly payment. The dimension of the dataset is 7,043 rows by 21 columns.
This project enables business stakeholders at the telecom company to predict whether a given customer will churn and get the associated predicted probability of churn by entering parameters such as the customer’s gender, contract term, and monthly charge. Based on the predicted results, business stakeholders can then implement plans to retain customers who are likely to churn. The prediction is based on a supervised machine learning model trained and validated on historical customer attrition data.
Success criteria:
- Model performance metric: 80% cross-validated correct classification accuracy on the training data
- Business outcome metrics: 10% decrease in customer attrition rate in the month following the deployment of the project
├── README.md <- You are here
├── app
│ ├── templates/ <- HTML files that is templated and changes based on a set of inputs
│ ├── Dockerfile_App <- Dockerfile for building image to run app
│ ├── Dockerfile_Pipeline <- Dockerfile for building image to run the random forest model pipeline
│
├── config <- Directory for configuration files
│ ├── logging/ <- Configuration of python loggers
│ ├── .aws <- Configurations for AWS and RDS
│ ├── flaskconfig.py <- Configurations for Flask API
│ ├── config.yml <- Configurations for developing and evaluating the model
│ ├── reproducibility_test_config.yml <- Configurations for reproducibility tests
│
├── data <- Folder that contains data used or generated.
│
├── deliverables/ <- Presentation Slide
│
├── eda/ <- Folder that contains exploratory data analysis outputs
│
├── models/ <- Trained model objects, model predictions, feature importance and model evaluations
│
├── src/ <- Source code for the project
│
├── test/ <- Files necessary for running unit tests and reproducibility tests
│ ├── reproducibility_true/ <- expected files for reproduciblity tests
│ ├── unit_test_true <- input and expect files for unit tests
│
├── app.py <- Flask wrapper for running the model
├── run.py <- Simplifies the execution of one or more of the src scripts
├── requirements.txt <- Python package dependencies
In order to run the app, you first need to clone the repo to your local machine by running the following bash command.
# clone the repo
git clone [email protected]:LSQI15/Customer-Churn-Predictor-App.git
# update working directory
cd Customer-Churn-Predictor-App
By default:
- all data-related files (raw data, preprocessed data, and featurized data) will be saved in the
data
folder - all EDA-related files (histograms, correlation heatmap and summary statistics table) will be saved in the
eda
folder - all model-related files (train/test data, random forest model, predictions, feature importance, and model evaluations)
will be saved in the
models
folder
If you want to make changes to file paths (eg. store data-related files to data2
), edit DATA_PATH
, EDA_PATH
,
and MODEL_PATH
variables on top of the Makefile
by running the following bash command:
vi Makefile
Note: type i
to enter the insert mode to make changes to the file. After finishing editing, press ESC
to exit and
type :wq
to save the change.
If you change default file path configurations, please also update all test_dir
in config/reproducibility_test_config.yml
accordingly so that the reproducibility tests can run as expected.
vi config/reproducibility_test_config.yml
Note: type i
to enter the insert mode to make changes to the file. After finishing editing, press ESC
to exit and
type :wq
to save the change.
The Dockerfile for running the model pipeline is in the app/
folder. To build the image, run:
docker build -f app/Dockerfile_Pipeline -t customer_churn .
To run the entire model pipeline (clean working directory, download data, preprocess data, create features, conducting exploratory data analysis, training a random_forest model, and evaluate model performance) using default configurations, run:
note: you will need to set your environment variable AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
, if you haven't
done so.
export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY --mount type=bind,source="$(pwd)",target=/app/ customer_churn all_pipeline
As indicated in section 1, after running the above command, by default, all data-related files (raw data, preprocessed
data, and featurized data) will be store in data/
. All models related files such as EDA graphs, training set, test set
, random forest model, and model evaluations will can be found in models/
.
The first step in the model pipeline is to clean the working directory. The following command will remove any existing
files in data
,eda
, and model
folders so that you can have a fresh working directory to execute the model pipeline.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn clean
To download the raw data from the default S3 bucket, run the following bash command. Unless you specified another directory
in the Makefile, the raw data will be downloaded to the data
folder by default.
note: you will need to set your environment variable AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
, if you haven't
done so.
export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY --mount type=bind,source="$(pwd)",target=/app/ customer_churn download
The following command will preprocess the raw data. Specifically, it will
- process TotalCharges column - convert spaces to NaN and drop 11 missing values
- process SeniorCitizen column - convert from binary 1/0 to Yes/No
- replace 'No internet service' to 'No' for the following 6 columns 'OnlineSecurity', 'OnlineBackup','DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'
- drop the drop the customerID column
By default, the preprocessed file will be saved in the data
folder.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn preprocess
The following bash command will featurize the preprocessed data and save the featurized data in the data
folder by
default.
- encode binary response variable and binary feature
- one-hot encode multi-category features
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn feature
The following command will conduct exploratory data analysis including:
- correlation heatmap among variables
- summary statistics for each variable
- histograms of predictors colored by the whether the customer is churned or not
By default, EDA results will be saved in the models
folder.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn eda
The next step in the model pipeline is to train a random forest model using the featurized data. The following command will conduct
- train-test split
- build a random forest model on the training set
- make predictions on the test set
- calculate and visualize feature importance.
By default, all these output files will be saved in the models
folder.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn random_forest
The final step in the model pipeline is to evaluate model performance. The following bash command will
- calculate the AUC and Accuracy
- create a confusion matrix
- generate classification report
By default, all these model evaluation metrics will be saved in the models
folder.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn evaluate
To conduct reproducibility test, run the following bash command. The reproducibility test will test output files generated
during each stage the of model development against the expected output saved at test/reproducibility_true/
.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn reproducibility_test
Note: if you made changes to the default file path DATA_PATH
, EDA_PATH
, and MODEL_PATH
in the Makefile, please make
sure you also updated all test_dir
in config/reproducibility_test_config.yml
accordingly. If not, please edit them and
then rebuild the Docker image before running the above command.
vi config/reproducibility_test_config.yml
docker build -f app/Dockerfile_Pipeline -t customer_churn .
The unit testing module tests 9 functions used during different stages of the model pipeline. Each function will be tested twice - once with valid input and once with invalid input. This leads to a total of 18 tests. To conduct unit tests, run the following bash command.
docker run --mount type=bind,source="$(pwd)",target=/app/ customer_churn unit_test
Please edit the config/flaskconfig.py
file if you want to make change to the SQLite database name, the host, or the
port number, etc. Otherwise, the flask app will use the following default configurations:
PORT = 5000
HOST = "0.0.0.0"
LOCAL_DATABASE="customer.db"
The Dockerfile for running the model pipeline is in the app/
folder. To build the image, run:
docker build -f app/Dockerfile_App -t predictor_app .
Note: If you want to use the default RDS instance to run the app, go directly to Step 4. Running the App. Otherwise, please follow instructions below to initialize a RDS or SQLite database
To store user inputs and their corresponding predictions, you can use either a local SQLite database or an AWS RDS
database. By default, the following command will create a table named customer
in the database of your choice. You can
also conduct an initial ingestion if you would like to do so.
note: you will need to set your environment variable SQLALCHEMY_DATABASE_URI
, if you haven't done so.
export SQLALCHEMY_DATABASE_URI=<YOUR_SQLALCHEMY_DATABASE_URI>
docker run --mount type=bind,source="$(pwd)",target=/app/ -e SQLALCHEMY_DATABASE_URI predictor_app run.py create_db
docker run --mount type=bind,source="$(pwd)",target=/app/ -e SQLALCHEMY_DATABASE_URI predictor_app run.py initial_ingest
After initializing the database, to run the customer churn predictor app, enter:
note: you will need to set your environment variable SQLALCHEMY_DATABASE_URI
, if you haven't done so.
export SQLALCHEMY_DATABASE_URI=<YOUR_SQLALCHEMY_DATABASE_URI>
docker run --mount type=bind,source="$(pwd)",target=/app/ -e SQLALCHEMY_DATABASE_URI -p 5000:5000 --name myapp predictor_app app.py
By default, the webapp will be running on a local host at http://0.0.0.0:5000/
in your browser. You can press CTRL+C
at any time to quit.
After using the app, please enter the following bash command to remove the Docker container.
docker rm myapp
Main Initiative:
Deploy a machine learning model to help business stakeholders identify customers who are likely to churn. By applying this model, business stakeholders can take remedial actions for customer retention in advance and ultimately decrease the customer attrition rate.
Initiative 1: Data Manipulation & Exploratory Data Analysis
- Data overview and descriptive summary statistics (1 point)
- Explore each individual variable (4 points)
- check the distribution and examine outliers, missing values, etc
- clean the data and generate new features if necessary
- Assess the relationship between each predictor and the response variable (1 point)
- visualize through a matrix of scatterplots
- Assess the interaction among variables (1 point)
- visualize the correlation matrix
- Based on EDA, derive useful insights into customer churn, if any (1 point)
Initiative 2: Model Development
- Model building
- Split data into the training set and the test set (0 points)
- Build a logistic regression model as the baseline model (1 point)
- Train random forest models (4 points)
- Export variable importance and derive useful insights if any (2 points)
- Model evaluation
- Evaluate the model performance based on metrics such as correct classification rate and F-1 score. (2 points)
- Pick the best model based on performance metrics (0 points)
Initiative 3: Product Development
- Product building
- Construct data pipeline (2 points)
- Use a S3 bucket to store the raw source data
- Web app (Flask) Development (8 points - needs to be broken down more when it comes to execution)
- Design and build user interface
- Achieve all desired functionalities
- Construct data pipeline (2 points)
- Product testing and refinement
- Conduct unit tests to evaluate each functionality and fix bugs (8 points - needs to be broken down more when it comes to execution)
- Enhance functionality and refine user interface if time allows (4 points)
- Final roll-out (2 points)