Sparkify : Predict User Churn for a Music Streaming Service

An Apache Spark ML Project to predict user churn for a music streaming service by developing a classifier using multiple features extracted from the user activity logs

Overview

Sparkify is a digital music service. Many of the users stream their favorite songs in Sparkify service everyday, either using free tier that places advertisements in between the songs, or using the premium subscription model where they stream music as free, but pay a monthly flat rate. User can upgrade, downgrade or cancel their service at anytime.

So, our job is mine the customers' data and implement appropriate model to predict customer churn as follow steps:

Clean data: fill the nan values , correct the data types, drop the outliers.
EDA: exploratory data to look features' distributions and correlation with key label (churn).
Feature engineering: extract and found customer-features and customer-behavior-features; Implement standscaler on numerical features.
Train and measure models: I choose logistic regression, random forest classifier and gradient bossted tress classifier to train a baseline model and tuning a better model from best of them. It is worth mentioning that this data is unbalanced because of less churn customers, so we choose f1 score as a metrics to measure models' performance.

Installation

Apache Spark 2.x
Python 3.5+
PySpark ML
Jupyter
Pandas
Numpy
Matplotlib
Seaborn

Features used in the Models

Churn is defined as Cancellation Confirmation events in medium_sparkify_event_data.json data.

Following 11 features are defined to build models

Average listened songs per session
Listened total songs by users
Number of Add Friend transactions
Number of Add Playlist transactions
Number of Thumbs Down transactions
Number of Thumbs Up transactions
Register duration (days) - between last event date of user and registration date.
Length of listen time
Gender
Account level
Downgraded Event

Results

The accuracy and F1 score of the three classification models used : Logistic Regression, Random Forest Classifier, Gradient Boosted Trees are,

Model	Accuracy	F1-score	Training Time
Logistic Regression	0.744	0.672	13.9 s
Random Forrest	0.796	0.758	12.76 s
Gradient Boosted Trees	0.827	0.799	31.33 s

Here Gradient Boosted Trees has best f1-score that's why I choose it for the next steps.

On futher hyperparameter tuning of the GBTClassifier we get the following metrics,

Model	Accuracy	F1-score
Gradient Boosted Trees	0.828	0.818

This is our final model and will be used to predict the user churn for this streaming service.

References

You can find a summarised analysis here Medium Dataset provided by Udacity.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
Sparkify.ipynb		Sparkify.ipynb
datasets.zip		datasets.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkify : Predict User Churn for a Music Streaming Service

Overview

Installation

Features used in the Models

Results

References

About

Releases

Packages

Languages

abhinavrohatgi30/sparkify-user-churn

Folders and files

Latest commit

History

Repository files navigation

Sparkify : Predict User Churn for a Music Streaming Service

Overview

Installation

Features used in the Models

Results

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages