churnPrediction.Rmd

---
output: html_document
---
## Churn Prediction

The churn rate, also known as the rate of attrition, is the percentage of subscribers to a service who discontinue their subscriptions to that service within a given time period. For a company to expand its clientele, its growth rate, as measured by the number of new customers, must exceed its churn rate.

Attribute Name | Description
------------- | ---------------------------------------------------
State | categorical, for the 50 states and the District of Columbia
VoiceMail Plan | dichotomous categorical, yes or no
Account length | integer-valued, how long account has been active
Number of voice mail messages | integer-valued
Area code | categorical
Total day minutes | continuous, minutes customer used service during the day
Phone number | essentially a surrogate for customer ID
Total day calls | integer-valued
International Plan | dichotomous categorical, yes or no
Total day charge | continuous, perhaps based on foregoing two variables
Total evening minutes | continuous, minutes customer used service during the evening.
Total night charge | continuous, perhaps based on foregoing two variables
Total evening calls | integer-valued
Total international minutes | continuous, minutes customer used service to make international calls.
Total evening charge | continuous, perhaps based on foregoing two variables
Total international calls | integer-valued
Total night minutes | continuous, minutes customer used service during the night
Total international charge | continuous, perhaps based on foregoing two variables
Total night calls | integer-valued
Number of calls to customer service | integer-valued
Churn | Label indicating if customer churned

```{r installAndLoadPackages, warning=FALSE}
#-------------------------Package Requirements--------------------------------------------
#required packages: ggplot2, randomForest, RWeka, dplyr
installedPackages = installed.packages()
installedPackages = installedPackages[,1]
requiredPackages = as.matrix(c('ggplot2','randomForest','RWeka','dplyr'))
installPackages<-function(package){
  searchResult<- grep(paste(package,"$",sep = ""),installedPackages)
  #print(length(searchResult))
  if(length(searchResult) == 0){
    print (paste(package,"not installed"))
    print("Downloading and Installing the package")
    install.packages(package)
  }
}
loadPackages<-function(package){
  print (paste("Loading",package))
  require(package,character.only = TRUE)
}
installingPackages <- apply(requiredPackages, 1, installPackages)
loadingPackages <- apply(requiredPackages, 1, loadPackages)
```

```{r loadData, echo=TRUE}
# Check for the Directory of the R language
getwd()
#copy the churn_tel.csv to this location, view the files present in the directory
dir()
#import the file in the above directory, then read it here
churn <- read.csv("churn_tel.csv")
# Compactly Display the Structure of churn dataset
str(churn)
#Names of all the attributes in the data set
names(churn)
#summary of the dataset
summary(churn)
#Omit and row which has missing value(there are none), it returns 0 rows
churn[!complete.cases(churn),]
```

```{r contAttrExploration, echo=TRUE}
#Histograms (run them one by one)

hist(
  churn$Day.Mins,
  border = "blue",
  col = "green",
  main = "Histogram for Day Minutes",
  xlab = "Day minutes"
)

hist(
  churn$Day.Calls,
  border = "blue",
  col = "green",
  main = "Histogram for Day Calls",
  xlab = "Day Calls"
)

hist(
  churn$Day.Charge,
  border = "blue",
  col = "green",
  main = "Histogram for Day Charge",
  xlab = "Day Charge"
)

hist(
  churn$Eve.Mins,
  border = "blue",
  col = "green",
  main = "Histogram for Eve Minutes",
  xlab = "Eve minutes"
)

hist(
  churn$Eve.Calls,
  border = "blue",
  col = "green",
  main = "Histogram for Eve Calls",
  xlab = "Eve Calls"
)

hist(
  churn$Eve.Charge,
  border = "blue",
  col = "green",
  main = "Histogram for Eve Charge",
  xlab = "Eve Charge"
)

hist(
  churn$Night.Mins,
  border = "blue",
  col = "green",
  main = "Histogram for Night Minutes",
  xlab = "Night minutes"
)

hist(
  churn$Night.Calls,
  border = "blue",
  col = "green",
  main = "Histogram for Night Calls",
  xlab = "Night Calls"
)

hist(
  churn$Night.Charge,
  border = "blue",
  col = "green",
  main = "Histogram for Night Charge",
  xlab = "Night Charge"
)

hist(
  churn$Intl.Mins,
  border = "blue",
  col = "green",
  main = "Histogram for Night Minutes",
  xlab = "International minutes"
)

hist(
  churn$Intl.Calls,
  border = "blue",
  col = "green",
  main = "Histogram for Night Calls",
  xlab = "International Calls"
)

hist(
  churn$Intl.Charge,
  border = "blue",
  col = "green",
  main = "Histogram for Night Charge",
  xlab = "International Charge"
)


hist(
  churn$CustServ.Calls,
  border = "blue",
  col = "green",
  main = "Histogram for Customer Service Calls",
  xlab = "Customer Service Calls"
)


#scatter plot amongst the seemingly similar variables(continuous)

churnScatter1 <- churn[, c("Day.Mins", "Day.Calls", "Day.Charge")]
colnames(churnScatter1) <-
  c("Day minutes", "Day Calls", "Day Charge")
plot(churnScatter1)

churnScatter2 <- churn[, c("Eve.Mins", "Eve.Calls", "Eve.Charge")]
colnames(churnScatter2) <-
  c("Eve minutes", "Eve Calls", "Eve Charge")
plot(churnScatter2)


churnScatter3 <-
  churn[, c("Intl.Mins", "Intl.Calls", "Intl.Charge")]
colnames(churnScatter3) <-
  c("Intl minutes", "Intl Calls", "Intl Charge")
plot(churnScatter3)

churnScatter4 <-
  churn[, c("Night.Mins", "Night.Calls", "Night.Charge")]
colnames(churnScatter4) <-
  c("Night minutes", "Night Calls", "Night Charge")
plot(churnScatter4)

#Correlation matrix (description in the document)
#On the basis of corelation we eliminate 4 variables, since there were a linear
#function of other 4 variables


cor(churnScatter1[sapply(churnScatter1, is.numeric)])
cor(churnScatter2[sapply(churnScatter2, is.numeric)])
cor(churnScatter3[sapply(churnScatter2, is.numeric)])
cor(churnScatter4[sapply(churnScatter2, is.numeric)])

DayMinsChurn=lm(churn$Day.Mins~churn$Churn)
summary(DayMinsChurn)
plot(churn$Day.Mins~churn$Churn)

EveMinsChurn=lm(churn$Eve.Mins~churn$Churn)
summary(EveMinsChurn)
plot(churn$Eve.Mins~churn$Churn)

NightMinsChurn=lm(churn$Night.Mins~churn$Churn)
summary(NightMinsChurn)
plot(churn$Night.Mins~churn$Churn)

IntlMinsChurn=lm(churn$Intl.Mins~churn$Churn)
summary(IntlMinsChurn)
plot(churn$Intl.Mins~churn$Churn)

#churn predictions based on Charges

DayChargeChurn=lm(churn$Day.Charge~churn$Churn)
summary(DayChargeChurn)
plot(churn$Day.Charge~churn$Churn)

NightChargeChurn=lm(churn$Night.Charge~churn$Churn)
summary(NightChargeChurn)
plot(churn$Night.Charge~churn$Churn)

IntlChargeChurn=lm(churn$Intl.Charge~churn$Churn)
summary(IntlChargeChurn)
plot(churn$Intl.Charge~churn$Churn)

EveChargeChurn=lm(churn$Eve.Charge~churn$Churn)
summary(EveChargeChurn)
plot(churn$Eve.Charge~churn$Churn)

#churn predictions based on calls

DayCallsChurn=lm(churn$Day.Calls~churn$Churn)
summary(DayCallsChurn)
plot(churn$Day.Calls~churn$Churn)

NightCallsChurn=lm(churn$Night.Calls~churn$Churn)
summary(NightCallsChurn)
plot(churn$Night.Calls~churn$Churn)

EvenCallsChurn=lm(churn$Eve.Calls~churn$Churn)
summary(EvenCallsChurn)
plot(churn$Eve.Calls~churn$Churn)

IntlCallsChurn=lm(churn$Intl.Calls~churn$Churn)
summary(IntlCallsChurn)
plot(churn$Intl.Calls~churn$Churn)

CustServChurn=lm(churn$CustServ.Calls~churn$Churn)
summary(CustServChurn)
plot(churn$CustServ.Calls~churn$Churn)


#---------------------Graphical Evidence to retain ------------------------------
#---------------------above variables(Customer Service Call)---------------------
#Customer Service Calls vs Churn

ggplot() +
  geom_bar(data = churn,
           aes(
             x = factor(churn$CustServ.Calls),
             fill = factor(churn$Churn)
           ),
           position = "fill") +
  scale_x_discrete("Customer Service Calls") +
  scale_y_continuous("Percent") +
  guides(fill = guide_legend(title = "Churn")) +
  scale_fill_manual(values = c("green", "red"))
#Conclusion: Customer Service Calls is predictive of churn

```

```{r tTests, echo=TRUE}
#If the p-value is greater than .1, it will not be predictive of churn

t.test(churn$Intl.Calls ~ churn$Churn)
#Retain International Calls

t.test(churn$Day.Calls ~ churn$Churn)
#Eliminate Day Calls Calls

t.test(churn$Night.Calls ~ churn$Churn)
#Eliminate Night Calls

t.test(churn$Eve.Calls ~ churn$Churn)
#Eliminate Eve Calls

t.test(churn$Day.Mins ~ churn$Churn)
#Retain Day Minutes

t.test(churn$Eve.Mins ~ churn$Churn)
#retain Eve minutes

t.test(churn$Night.Mins ~ churn$Churn)
#retain Night minutes

t.test(churn$CustServ.Calls ~ churn$Churn)
#retain Customer Service Calls

#conclusion: Retain Intl Calls, Eve minutes, Nigh, Eve, Day Minutes
#, Customer Service Calls
```

```{r catAttrExploration, echo=TRUE}

#Intl Plan

#table for counts of Churn and International Plan
countsIntlPlan <- table(churn$Churn,
                        churn$Int.l.Plan,
                        dnn = c("Churn", "International Plan"))


#Pie chart wich shows that people who have international plan may churn
slices <- c(countsIntlPlan[1, 2] , countsIntlPlan[2, 2])
lbls <- c("churn: False", "churn: True")
pct <- round(slices / sum(slices) * 100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls, "%", sep = "") # ad % to labels
pie(slices,
    labels = lbls,
    col = rainbow(length(lbls)),
    main = "People Having International Plan")


#Overlayed bar chart
barplot(
  countsIntlPlan,
  legend = rownames(countsIntlPlan),
  col = c("blue", "red"),
  ylim = c(0, 3300),
  ylab = "Count",
  xlab = "International Plan",
  main = "Comparison Bar Chart:
  Churn Proportions by International Plan"
)
box(which = "plot",
    lty = "solid",
    col = "black")


#Clustered Bar Chart of Churn and Intl Plan with legend
barplot(
  t(countsIntlPlan),
  col = c("blue", "green"),
  ylim = c(0, 3300),
  ylab = "Counts",
  xlab = "Churn",
  main = "International Plan Count by Churn",
  beside = TRUE
)
legend(
  "topright",
  c(rownames(countsIntlPlan)),
  col = c("blue", "green"),
  pch = 15,
  title = "Intl Plan"
)
box(which = "plot",
    lty = "solid",
    col = "black")


#Vmail Plan
#weak evidence, but still vmail plan may be predictive
#'cause we can see int row.margin[2,1] and row.margin[2,2]
#that the people who dont have the vmail plan and will churn % = 84
#have vmail and will churn % = 16

countsVmailPlan <- table(churn$Churn, churn$VMail.Plan,
                         dnn = c("Churn", "Vmail Plan"))


row.margin <- round(prop.table(countsVmailPlan, margin = 1), 4)*100
row.margin

#Vmail message's histogram gives us a spike
#For the analysis we say that If Voice Mail Messages > 0 then VoiceMailMessages_Flag = 1;
#otherwise VoiceMailMessages_Flag = 0
#it reveals that it is similar to the vmal plan, hence we can eliminate vmail message

churn$flag[churn$VMail.Message>0] <- 1
churn$flag[churn$VMail.Message<=0] <- 0
table(churn$flag,churn$Churn)
#------------------Multivariate relationships-------------------------------------

#cust serv calls vs day calls
#Conclusion: hiher the
qplot(churn$Day.Mins,
      churn$CustServ.Calls,
      data = churn,
      colour = churn$Churn.)


#Day min Vs Eve min
#conclusion: Higher the day min and evening min, more the churn
qplot(churn$Eve.Mins,churn$Day.Mins,
      data = churn,
      colour = churn$Churn., xlab = "Evening Minutes",
      ylab= "Day Minutes")


```

```{r randomForestTest, echo=TRUE} 
#creates a new file firstforest to test against the test data
#----------------------------------------------------------------------------------

#churn <- read.csv("churn_tel.csv")
churnTrain <- churn[800:3300, ]
churnTest <- churn[1:500, ]


fit <-
  randomForest(
    as.factor(Churn.) ~ Int.l.Plan + VMail.Plan + CustServ.Calls + Day.Mins +
      Eve.Mins + VMail.Message + Night.Mins + Intl.Mins + Intl.Calls,
    data = churnTrain,
    importance = TRUE,
    ntree = 500,
    nodesize = 3
  )
Prediction <- predict(fit, churnTest)
print(nrow(churnTest))
submit <- data.frame(id = churnTest$Phone, Churn = Prediction)
write.csv(submit, file = "firstforest.csv", row.names = FALSE)
fr <- read.csv("firstforest.csv")
count <- table(churnTest$Churn, fr$Churn)
count

#Accuracy of Random Forest(using pie chart)


slices <- c(count[1, 1] + count[2, 2], count[1, 2] + count[2, 1])
lbls <-  c("correct prediction", "incorrect prediction")
pct <- round(slices / sum(slices) * 100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls, "%", sep = "") # ad % to labels
pie(slices,
    labels = lbls,
    col = rainbow(length(lbls)),
    main = "Random Forest Accuracy")


```

```{r j48decisionTree, echo=TRUE}
#creates a new file decisionTree to test against the test data
#----------------------------------------------------------------------------------

#churn <- read.csv("churn_tel.csv")
churnTrain2 <- churn[800:3300, ]
churnTest2 <- churn[1:500, ]


decisionTree <- J48(`Churn.` ~ ., data = churnTrain2)
prediction_tree <- predict(decisionTree,  churnTest2)
count2 <- table(churnTest2$Churn, prediction_tree)
count2


#Accuracy of J48(using pie chart)


slices <-
  c(count2[1, 1] + count2[2, 2], count2[1, 2] + count2[2, 1])
lbls <- c("correct prediction", "incorrect prediction")
pct <- round(slices / sum(slices) * 100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls, "%", sep = "") # ad % to labels
pie(slices,
    labels = lbls,
    col = rainbow(length(lbls)),
    main = "J48 Decision Tree Accuracy")

```