Customer_segmentation

(unsupervised machine learning modeling)

Scenario:

Our client is an Insurance company that has provided Health Insurance to its customers now they want to expand line with Vehicle Insurance products. For this reason existing cuatomers were surveyed about their potential interest in buying Vehicle insurance policy by the same company. Customer's response were collected and framed in single file. We were asked to provide segmentation amongst customers who showed an interest in buying the new product. The project we are working on will reveal simillarities and dissimillarities customers share in clusters they will be assigned to. Thus, our client will be able to better distinguish between customers profiles, and biuld and conduct marketing campany more effictievly.

Data

The data was taken at Kaggle https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction?select=sample_submission.csv

Data set consist of almost 400K unique observations and 9 paramenters wich are : [Id', 'Gender', 'Age', 'Driving_Licensse', 'Region_Code', 'Previously_Insured','Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Policy_Sales_Channel','Vintage','Response']

Principal Components Analysis(PCA)

PCA is a method of extracting important variables from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible.

Visualisation.

1.Scree plot

It can be seen from plot that, PCA-1 explains (41.65%) most of the variance than subsequent components. In other words, most of the features are explained and encompassed by PCA1. The rule of thumb is to capture 70-80%,thus we pick first 3 of the 11 PC. Great move, we reduce data set dimensonality in four times

2.Effect of features on each components

We can see the influence on each of the components by features. PCA 1 places more weight on Age, and approximately equal weight on Vehacle Age (1-2 and more than 2 years), with much less weight on Vehacle Age less than a year. Hence this feature roughly correspond to overall customer desicion.PCA 2 places most weight on Vintage, and PCA 3 on Previuosly Insured feature. But we remember that the PCA 1 bears the most variance.

PCA Biplot

Biplot is an interesting plot. It contains lot of useful information. Since, our customer is based in USA, we are wondering they want to know feature distribution. Overall, we see that Age, Vehacle Demage and Vehacle Age (1-2 years) are located close to each other, and Vintage is far from these three.This indicates that the variables are correlated with each other- the older populaton the their cars demaged more oftenly in age between 1-2 year,and Vintage is less correlated with them. We also can examine difference between states. The states with the largest first scores of the PCA 1, such as Maryland, West Virginia and Alabama have more population in older age with demaged car in first 1-2 years,while Wisconsin, Oregon and New York with negative scores of PCA 1 - have relatively more yanger people own cars less then a 1 year.

Clustering

Clustering looks to find homogeneous subgroups among the observation. KMeans clustering is a simple and elegant approach for partitioning the data set into K-distinct, non-overlaping clusters. To perform KMeans clusters we must first specify the desired number of K- clusters. Silhouettee score and Yellowbrick scores both clearly compute max score for 3 K-distinct clusters.

Visualization

Scatter plot shows how the customers are clustered.

I assigned cluster's ID to the data set observations. Now we can check relations of most weighted features given from PCA.

Silhouette plot

Based on silhouette score i have decided to choose K clusters = 3

Box plot

We also observe a positive correlation between Age and Vehicle Demage.With increase in Age the rate of demaged vehicle rise up as well.

All custmers who answered Yes are detached in three subgroups by behavioral and demographic segmentations.

There are names I have assigned to the subgroups. The 'Young achivers' is cluster #0, the "Fanincially mature' is cluster #1 and 'Ho Hum' is cluster #2 respectively.

Geographical segmentation of groups accross the country.

Younger achivers	Ho Hum	Financially mature
Arkanzas	Alabama	Arizona
California	Alaska	Colorado
Delaware	Maryland	Connecticut
Florida	Michigan	Georgia
Hawaii	Tennessee	Illinoise
Idaho	Puerto Rico	Indiana
Kentucky	West Virginia	Iowa
Luisiana		Kansas
Maine		Massachusetts
Minissota		Nebraska
Mississippi		New Jersey
Missouri		New York
Monatana		North Dakota
Nevada		Oklahoma
New Hampshire		Oregon
New Mexico		Utah
North Carolina		Wysconsin
Ohio
Pennsilvania
South Carolina
Texas
Vermont
Virginia
Washington
Wyoming

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.ipynb_checkpoints		.ipynb_checkpoints
auto_insurance		auto_insurance
images		images
.DS_Store		.DS_Store
MVP_Kmeans_byStateYes.ipynb		MVP_Kmeans_byStateYes.ipynb
MVP_PCA_byStatesYes.ipynb		MVP_PCA_byStatesYes.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer_segmentation

(unsupervised machine learning modeling)

Scenario:

Data

Principal Components Analysis(PCA)

Visualisation.

Clustering

Visualization

All custmers who answered Yes are detached in three subgroups by behavioral and demographic segmentations.

Geographical segmentation of groups accross the country.

About

Releases

Packages

Languages

evgenygrobov/Customer_clustering.

Folders and files

Latest commit

History

Repository files navigation

Customer_segmentation

(unsupervised machine learning modeling)

Scenario:

Data

Principal Components Analysis(PCA)

Visualisation.

Clustering

Visualization

All custmers who answered Yes are detached in three subgroups by behavioral and demographic segmentations.

Geographical segmentation of groups accross the country.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages