IDS702_Final_Q2_Tennis.qmd

---
title: "Study of tennis performance on different surfaces and factors that affect wins"
author: 
  - "Alejandro Paredes La Torre, Liangcheng (Jay) Liu"
  - "Nzarama Michaella Desire Kouadio, Jahnavi Maddhuri"
subtitle: "2024-12-15"
format: pdf
header-includes:
      - \usepackage{float}
      - \usepackage{authblk}
      - \floatplacement{table}{H}
execute:
  echo: false
geometry: margin=0.8in
---

## Abstract

This study explores how player rankings and aces affect tennis match outcomes using the Association of Tennis Professionals (ATP) data. The first research topic examines the impact of ranking differences on match duration. The research then evolves to investigate the relationship between the number of aces by a tennis player and that player’s odds of winning. 

Methodology to the study includes exploratory data analysis combined with linear and logistic regression models, supported by visualizations. These findings highlight the connection between player rankings, aces, and match outcomes, while emphasizing the interactive impact of surface type.

Results show that larger ranking gaps lead to shorter matches, though surface type and match conditions also influence duration. Additionally, the findings indicate that hitting more aces improves the odds of winning, with surface types like clay and hard courts playing a significant role in this relationship.

## Introduction

A substantial body of research has explored the prediction of tennis match outcomes using statistical models, highlighting the importance of player attributes and match statistics. Early studies, such as those by Newton and Keller (2005)\[2\], O'Malley (2008)\[3\], and Riddle (1988)\[4\], demonstrate that under the assumption of independent and identically distributed (iid) point outcomes on a player's serve, the probability of winning a match can be derived from the probabilities of winning points on serve.

Kovalchik (2016) \[5\] conducted a comparison of 11 published tennis prediction models, categorizing them into three classes: point-based models relying on the iid assumption, regression-based models, and paired comparison models. The study found that while point-based models had lower accuracy and higher log loss, regression and paired comparison models generally outperformed them.

## Methods

### Data and preprocessing

The dataset utilized in this study is the Tennis ATP Dataset curated by Jeff Sackmann (Sackmann, 2021) \[1\]. This dataset serves as a comprehensive repository of professional tennis data, encompassing a wide range of player information, historical rankings, match outcomes, and statistical metrics. Specifically, it includes a player file containing detailed biographical data, such as unique player identifiers, names, handedness, birth dates, nationalities, and physical attributes like height. Additionally, ranking files provide a historical record of ATP rankings over time, while the results file covers match outcomes across tour-level, challenger, and futures events. This dataset forms a robust foundation for exploring various aspects of professional tennis performance and trends.

The dataset selected includes ATP match data from 2014-2024, the subset chosen are challenger matches and professional and tournament class A such as Davis Cup, Roland Garros and others. The records from this period consist in 116,103 matches where each match has 49 variables.

The initial collection of data contains features at the match level, therefore it has information from the winner player and the loser player. In order to analyze the effect of match win this structured has been modified to portrait the results at the player level, it has been added the feature win which represents the outcome of the match in respect to the player (either win or loose).

In order to improve the quality of the data, those players that do not have a rank and rank points have been set to zero as they explain unranked players new to these tournaments. An inputing technique has been used for the player height using the average of the country of birth of the player.

Furthermore, analyzing that matches whose number of aces for the winner nor the loser are missing lack all the statistical information from the the other aspects of the match therefore those records have been filtered. The same rationale applies to the serving games as it signals records that have missing information overall therefore those records have been filtered outside of the analysis data.

### Variable selection

Taking as reference previous research regarding the most relevant features involved in the outcome of a tennis match (Newton et al., 2005\[1\]; O'Malley 2008\[2\]; Kovalchik, 2016) a pre selection was made observing the limitation of the available data. In order to further refine the process of feature selection exploratory data analysis was conducted using correlation plots, box plots and scatterplots.

### Modeling and evaluation

The present study focuses on the effects of the duration of a match using linear regression to evaluate inference capabilities and determining the principal factors for a match win using logistic regression to evaluate the probability of a win. Variance Inflation Factor (VIF) was used to test multi-colinearity. For the linear regression task the assumptions for the model are tested via residual vs fitted plots and normal q-q plots, furthermore, the performance of the model is evaluated using the adjusted r squared metric. In terms of logistic regression, accuracy, recall sensitivity and specificity are

## Results

```{r load-packages, message = FALSE, warning = FALSE, echo = FALSE}
library(tidyverse)
library(dplyr)
library(tidyverse)
library(Hmisc)
library(cowplot)
library(corrplot)
library(ggplot2)
library(modelsummary)
library(car)
library(conflicted)
library(glmnet)
conflict_prefer("filter", "dplyr")
```

```{r data-1, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
#tennis_qualy_chall <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_qual_chall_2023.csv")
#tennis_futures <-  read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_futures_2023.csv")
#tennis_match <-  read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_2023.csv")
#tennis_player  <-  read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_players.csv")

# Combine the datasets using rbind
#all_matches <- rbind(tennis_qualy_chall, tennis_futures, tennis_match)


# Define the base URL for each type of match data
base_urls <- list(
  qualy_chall = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_qual_chall_",
  #futures = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_futures_",
  match = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_"
)

# Get the current year and create a sequence for the last 5 years
years <- (as.numeric(format(Sys.Date(), "%Y")) - 10):as.numeric(format(Sys.Date(), "%Y"))

# Create a function to download and combine data for all years and match types
download_data <- function(base_url, years) {
  data_list <- lapply(years, function(year) {
    url <- paste0(base_url, year, ".csv")
    tryCatch(
      read.csv(url),
      error = function(e) {
        message("Failed to download: ", url)
        NULL
      }
    )
  })
  # Combine all years into a single data frame
  do.call(rbind, data_list)
}

# Download and combine data for each match type
tennis_qualy_chall <- download_data(base_urls$qualy_chall, years)
#tennis_futures <- download_data(base_urls$futures, years)
tennis_match <- download_data(base_urls$match, years)

# Define a function to standardize column types
standardize_columns <- function(df) {
  df %>%
    mutate(
      tourney_level = as.character(tourney_level),
      winner_seed = as.character(winner_seed),
      loser_seed = as.character(loser_seed)
      ) # Convert `tourney_level` to character
}

# Apply the function to each dataset
tennis_qualy_chall <- standardize_columns(tennis_qualy_chall)
#tennis_futures <- standardize_columns(tennis_futures)
tennis_match <- standardize_columns(tennis_match)

# Combine all datasets
tennis <- bind_rows(tennis_qualy_chall,tennis_match) #tennis_futures,

# View the combined dataset
head(tennis)
dim(tennis)
#tennis %>% filter(tourney_level == 'G')
#colnames(tennis)
sum(is.na(tennis$minutes))

tennis %>%
  filter(is.na(winner_rank))
#colnames(tennis$l_ace)

#sum(is.na(tennis_futures$l_ace))
```

### Overview of exploratory data analysis

### **Research question 1: Effects of the difference in ranking over the length in minutes for a tennis match**

### **Research question 2: Aces and court surface type influence in match outcome**

The results for the final fitted model for win prediction are in the Anex I since the table is large. The selected variables along with the interaction term of the type of surface was included. Multiple iterations to find the best model was performed and multicolinearity evaluations were used to assess the model.


$$
\begin{aligned}
\log \left( \frac{P(\text{win})}{1 - P(\text{win})} \right) = \beta_0 \\
  &+\beta_1 \cdot \text{draw size} \\
  &+\beta_2 \cdot \text{tourney level} \\
  &+\beta_3 \cdot \text{match num} \\
  &+\beta_4 \cdot \text{player hand} \\
  &+\beta_5 \cdot \text{player height} \\
  &+\beta_6 \cdot \text{player age} \\
  &+\beta_7 \cdot \text{rank} \\
  &+\beta_8 \cdot \text{rank points} \\ 
  &+\beta_9 \cdot (\text{aces} \cdot \text{surface}) \\ 
  &+\beta_{10} \cdot \text{double faults}  \\
  &+\beta_{11} \cdot \text{serve points}   \\
  &+\beta_{12} \cdot \text{first serves} \\
  &+\beta_{13} \cdot \text{first serves points won}  \\ 
  &+\beta_{14} \cdot \text{second serves points won}   \\
  &+\beta_{15} \cdot \text{break points saved} \\
  &+\beta_{16} \cdot \text{break points faced}  \\
  &+\beta_{17} \cdot \text{match month}
\end{aligned}
$$

```{r , message=FALSE, warning=FALSE}
# Create a long-format dataset for both winners and losers
tennis_long <- tennis %>%
  filter(!is.na(minutes) & !is.na(winner_rank) & !is.na(loser_rank))%>%
  mutate(
    win = 1,
    player_id = winner_id,
    player_name = winner_name,
    player_seed = winner_seed,
    player_entry = winner_entry,
    player_hand = winner_hand,
    player_ht = winner_ht,
    player_ioc = winner_ioc,
    player_age = winner_age,
    aces = w_ace,
    df = w_df,
    svpt = w_svpt,
    first_in = w_1stIn,
    first_won = w_1stWon,
    second_won = w_2ndWon,
    svgms = w_SvGms,
    bp_saved = w_bpSaved,
    bp_faced = w_bpFaced,
    rank = winner_rank,
    rank_points = winner_rank_points,
    score = score,
    tourney_id = tourney_id,
    tourney_name = tourney_name,
    surface = surface,
    draw_size = draw_size,
    tourney_level = tourney_level,
    tourney_date = tourney_date,
    match_num = match_num
  ) %>%
  select(
    tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
    match_num, player_id, player_seed, player_entry, player_name, player_hand,
    player_ht, player_ioc, player_age, score, rank, rank_points, aces, df, svpt,
    first_in, first_won, second_won, svgms, bp_saved, bp_faced, win
  ) %>%
  bind_rows(
    # Create rows for the loser
    tennis %>%
      mutate(
        win = 0,
        player_id = loser_id,
        player_name = loser_name,
        player_seed = loser_seed,
        player_entry = loser_entry,
        player_hand = loser_hand,
        player_ht = loser_ht,
        player_ioc = loser_ioc,
        player_age = loser_age,
        aces = l_ace,
        df = l_df,
        svpt = l_svpt,
        first_in = l_1stIn,
        first_won = l_1stWon,
        second_won = l_2ndWon,
        svgms = l_SvGms,
        bp_saved = l_bpSaved,
        bp_faced = l_bpFaced,
        rank = loser_rank,
        rank_points = loser_rank_points,
        score = score,
        tourney_id = tourney_id,
        tourney_name = tourney_name,
        surface = surface,
        draw_size = draw_size,
        tourney_level = tourney_level,
        tourney_date = tourney_date,
        match_num = match_num
      ) %>%
      select(
        tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
        match_num, player_id, player_seed, player_entry, player_name, player_hand,
        player_ht, player_ioc, player_age, score, rank, rank_points, aces, df, svpt,
        first_in, first_won, second_won, svgms, bp_saved, bp_faced, win
      )
  )

# Create new columns for year, month, and day
tennis_long <- tennis_long %>%
  mutate(
    match_year = substr(tourney_date, 1, 4),       # Extract the first 4 characters as the year
    match_month = substr(tourney_date, 5, 6)     # Extract the 5th and 6th characters as the month
)

tennis_long <- tennis_long %>%
  rename(
    player_height = player_ht,
    double_faults = df,
    player_country = player_ioc,
    serve_points = svpt,
    first_serves=first_in,
    first_serves_points_won=first_won,
    second_serves_points_won=second_won,
    serve_games=svgms,
    break_points_saved=bp_saved,
    break_points_faced=bp_faced
  )

# Convert Win to a factor with appropriate labels
tennis_long <- tennis_long |>
  mutate(
    win = factor(win, levels = c(0, 1), labels = c("Loss", "Win")),
    rank = if_else(is.na(rank), 0, rank),
    rank_points = if_else(is.na(rank_points), 0, rank_points)
  )
tennis_long <- tennis_long %>%
  group_by(player_country) %>%
  mutate(
    player_height = if_else(is.na(player_height), 
                            mean(player_height, na.rm = TRUE), 
                            player_height)
  ) %>%
  ungroup()

tennis_long <- tennis_long |>
  mutate(
    player_height = if_else(is.na(player_height), mean(player_height, na.rm = TRUE), player_height)
  )

 
tennis_long <- tennis_long %>%
  filter(!is.na(aces) & !is.na(player_age) & !is.na(serve_games))

 
# Get the number of NAs in each column
#na_count <- sapply(tennis_long, function(x) sum(is.na(x)))
# Display the result
#na_count
#dim(tennis_long)
#tennis_long %>% 
#202

tennis_mod1 <- glm(win ~ draw_size + tourney_level + match_num + player_hand+ player_height +  player_age + rank + rank_points + aces*surface  + double_faults + serve_points + first_serves + first_serves_points_won + second_serves_points_won + break_points_saved + break_points_faced + match_month,
               data=tennis_long,
               family="binomial") #player_country+
#summary(tennis_mod1)
#exp(tennis_mod1$coefficients)  

# serve_games

# Extracting coefficients, standard errors, and p-values
coefficients <- summary(tennis_mod1)$coefficients
#exp_coefficients <- exp(coefficients[, 1])  # Exponentiate coefficients
# Get confidence intervals
#conf_int <- confint(tennis_mod1)

# Create a data frame for the model summary
summary_df <- data.frame(
  Term = rownames(coefficients),
  Estimate = coefficients[, 1],
  `Standard Error` = coefficients[, 2],
  `z-value` = coefficients[, 3],
  `P-value` = coefficients[, 4]
  #`Exp(Estimate)` = exp_coefficients,
  #`CI Lower` = conf_int[, 1],
  #`CI Upper` = conf_int[, 2]
)

# Create the kable table
library(knitr)
kable(summary_df, caption = "Logistic Regression Model Summary", digits = 3)
```

```{r}
# Convert tourney level to ordinal variable:
  ## 0 = D (Davis Cup): Team competition, often less prestigious on an individual level.
  ## 1 = C (Challengers): Lower-tier tournaments below the main ATP Tour.
  ## 2 = A (Tour-Level Events): Regular ATP events not part of Masters 1000s, Grand Slams, or Finals.
  ## 3 = M (Masters 1000s): High-prestige, top-tier ATP tournaments after Grand Slams.  
  ## 4 = G (Grand Slams): The most prestigious tournaments in tennis (Australian Open, French Open, Wimbledon, US Open).
  ## 5 = F (Tour Finals and Other Season-Ending Events): Exclusive tournaments like the ATP Finals, featuring only the top-ranked players of the season.

tourney_mapping <- c("D" = 0, "C" = 1, "A" = 2, "M" = 3, "G" = 4, "F" = 5)
tennis_long$tourney_level_ord <- as.numeric(tourney_mapping[tennis_long$tourney_level])

# Use VIF to assess multicollinearity amongst all viable variables

  # 1. Create a model with no interaction terms and all viable variables
tennis_mod_no_interaction <- glm(win ~ draw_size + tourney_level_ord + player_hand + player_height +  player_age + rank + rank_points + double_faults + serve_points + first_serves + first_serves_points_won + second_serves_points_won + break_points_saved + break_points_faced + aces + surface,
               data=tennis_long,
               family="binomial")
vif(tennis_mod_no_interaction)
  # 1.1. Results: High VIF score for serve_points, first_serves, first_serves_points_won, second_serves_points_won, break_points_saved, break_points_faced. 

  # 1.2. Combine/create ratios
tennis_long$first_serve_win_ratio = tennis_long$first_serves_points_won/tennis_long$serve_points
tennis_long$second_serve_win_ratio = tennis_long$second_serves_points_won/(
  tennis_long$serve_points - tennis_long$first_serves)
tennis_long$break_pt_save_ratio = tennis_long$break_points_saved/tennis_long$break_points_faced
  ##tennis[71196,] l_ serve counts do not make sense. Remove that column from tennis_long
tennis_long <- tennis_long[!is.infinite(tennis_long$second_serve_win_ratio), ]

  # 2. Calculate VIF for new variables
mod_new_var <- glm(win ~ draw_size + tourney_level_ord + player_hand + player_height +  player_age + rank + rank_points + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio + aces + surface,
               data=tennis_long,
               family="binomial")
vif(mod_new_var)
  # 2.1. RESULTS: draw_size and tourney_level_ord have high VIF scores
tennis_long %>%
  group_by(tourney_level_ord) %>%
  summarize(mean = mean(draw_size), min = min(draw_size), max = max(draw_size))
  # 2.2. Remove tourney_level as more granular info comes from draw_size

  # 3. Final model
mod_limit <- glm(win ~ draw_size + player_hand + player_height +  player_age + rank + rank_points  + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio + aces + surface, data=tennis_long, family="binomial")
vif(mod_limit)
```
```{r}
# Model based on above identified columns and lasso regularization to limit variables and prevent overfitting:
tennis_long_clean <- tennis_long[complete.cases(tennis_long[, c("win", "draw_size", "player_hand", "player_height", 
                                                               "player_age", "rank", "rank_points", "double_faults", 
                                                               "first_serve_win_ratio", "second_serve_win_ratio", 
                                                               "break_pt_save_ratio", "aces", "surface")]), ]
formula = win ~ draw_size + player_hand+ player_height +  player_age + rank + rank_points + aces*surface  + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio

# Recreate the design matrix and target variable with the cleaned data
X <- model.matrix(formula, data = tennis_long_clean)[, -1]  # Remove the intercept column
y <- tennis_long_clean$win

# Fit the logistic regression model with Lasso regularization using glmnet
lasso_model <- glmnet(X, y, family = "binomial", alpha = 1)

# Print the model details
print(lasso_model)

# You can use cv.glmnet for cross-validation to find the best lambda
cv_lasso_model <- cv.glmnet(X, y, family = "binomial", alpha = 1)

# Plot the cross-validation results
plot(cv_lasso_model)

# Get the best lambda (the optimal penalty)
best_lambda <- cv_lasso_model$lambda.min
print(best_lambda)

# Fit the model using the best lambda
final_lasso_model <- glmnet(X, y, family = "binomial", alpha = 1, lambda = best_lambda)

# Print final model coefficients
print(coef(final_lasso_model))
```

```{r}
# Final Model:
m = glm(win ~ draw_size + player_hand+ player_height +  player_age + rank + rank_points + aces*surface  + double_faults + break_pt_save_ratio,
               data=tennis_long_clean,
               family="binomial")

# Extract coefficients, p-values, and standard errors
coef_summary <- summary(m)

# Get the coefficients, standard errors, and p-values
coefficients <- as.numeric(coef_summary$coefficients[, "Estimate"])
standard_errors <- as.numeric(coef_summary$coefficients[, "Std. Error"])
p_values <- as.numeric(coef_summary$coefficients[, "Pr(>|z|)"])

# Exponentiate the coefficients to get odds ratios
exp_coefficients <- exp(coefficients)

# Create a data frame to organize the results
coeff_df <- data.frame(
  term = rownames(coef_summary$coefficients),
  coefficient = coefficients,
  exp_coefficient = exp_coefficients,
  p_value = p_values,
  stringsAsFactors = FALSE
)

# Display the table with kable
colnames(coeff_df) <- c("Variable", "Coefficient", "Odds Ratio", "P-value")
kable(coeff_df, caption = "Logistic Regression Coefficients, Odds Ratios, and p-values")

```


Analyzing the variables of interest aces and surface, holding every other variable constant, applying exponential we have an effect of 0.94 times of odds increase for every extra ace point. On the other hand  surface Clay and surface hard showed to be statistically significant and their effects is an increase of 4.52 times the odds of winning for clay surface and an increase of 2.5 times the odds of winning for Hard surface.

```{r}
tennis_long %>%
  group_by(tourney_level_ord) %>%
  summarize(mean = mean(draw_size), min = min(draw_size), max = max(draw_size))
```

```{r , message=FALSE, warning=FALSE}
# Load necessary libraries
library(caret)    # For confusionMatrix
library(knitr)    # For kable
library(kableExtra)
# Generate predictions
pred_probs <- predict(tennis_mod1, type = "response")
pred_binary <- ifelse(pred_probs > 0.5, 1, 0)
pred_binary <- factor(pred_binary, levels = c(0, 1), labels = c("Loss", "Win"))

# Create a confusion matrix
conf_matrix <- confusionMatrix(pred_binary, tennis_long$win)

conf_matrix_table <- conf_matrix$table
conf_matrix_df <- as.data.frame.matrix(conf_matrix_table)
#conf_matrix_df <- cbind(Actual = rownames(conf_matrix_df), conf_matrix_df)
conf_matrix_latex <- conf_matrix_df %>%
  kable("latex", booktabs = TRUE, 
        caption = "Confusion Matrix: Actual vs. Predicted Satisfaction",
        align = "c") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

# Extract and prepare the metrics for display
metrics <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1 Score", "Specificity", "Sensitivity", "Positive Predictive Value", "Negative Predictive Value"),
  Value = c(
    conf_matrix$overall['Accuracy'],
    conf_matrix$byClass['Pos Pred Value'],
    conf_matrix$byClass['Sensitivity'],
    conf_matrix$byClass['F1'],
    conf_matrix$byClass['Specificity'],
    conf_matrix$byClass['Sensitivity'],
    conf_matrix$byClass['Pos Pred Value'],
    conf_matrix$byClass['Neg Pred Value']
  )
)

# Display the metrics using kable
#cat("\nMetrics:\n")
kable(metrics, format = "markdown", col.names = c("Metric", "Value"))
```

The performance of the logistic regression model was evaluated using standard classification metrics, including accuracy, precision, recall, and F1 score. The model achieved an accuracy of 0.8117, indicating that approximately 81.17% of predictions matched the true outcomes. The precision of the model was 0.8230, reflecting its ability to correctly identify positive cases while minimizing false positives. The recall was measured at 0.8155, demonstrating the model's capability to correctly identify a high proportion of actual positive cases. Finally, the F1 score, a harmonic mean of precision and recall, was calculated to be 0.8192, indicating a balanced performance between these two metrics. Together, these results suggest the model performs reliably in predicting match outcomes based on the given features.

## Conclusion


## References

\[1\] Sackmann, J. (n.d.). Tennis databases, files, and algorithms \[Data set\]. Tennis Abstract. Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Based on a work at https://github.com/JeffSackmann.

\[2\] Newton, P. K., & Keller, J. B. (2005). Probability of winning at tennis I. Theory and data. Studies in applied Mathematics, 114(3), 241-269.

\[3\] O'Malley, A. J. (2008). Probability formulas and statistical analysis in tennis. Journal of Quantitative Analysis in Sports, 4(2).

\[4\] Riddle, L. H. (1988). Probability models for tennis scoring systems. Journal of the Royal Statistical Society Series C: Applied Statistics, 37(1), 63-75.

\[5\] Kovalchik, S. A. (2016). Searching for the GOAT of tennis win prediction. Journal of Quantitative Analysis in Sports, 12(3), 127-138.