-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIDS702_Final_Q2_Tennis.qmd
520 lines (427 loc) · 24.3 KB
/
IDS702_Final_Q2_Tennis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
---
title: "Study of tennis performance on different surfaces and factors that affect wins"
author:
- "Alejandro Paredes La Torre, Liangcheng (Jay) Liu"
- "Nzarama Michaella Desire Kouadio, Jahnavi Maddhuri"
subtitle: "2024-12-15"
format: pdf
header-includes:
- \usepackage{float}
- \usepackage{authblk}
- \floatplacement{table}{H}
execute:
echo: false
geometry: margin=0.8in
---
## Abstract
This study explores how player rankings and aces affect tennis match outcomes using the Association of Tennis Professionals (ATP) data. The first research topic examines the impact of ranking differences on match duration. The research then evolves to investigate the relationship between the number of aces by a tennis player and that player’s odds of winning.
Methodology to the study includes exploratory data analysis combined with linear and logistic regression models, supported by visualizations. These findings highlight the connection between player rankings, aces, and match outcomes, while emphasizing the interactive impact of surface type.
Results show that larger ranking gaps lead to shorter matches, though surface type and match conditions also influence duration. Additionally, the findings indicate that hitting more aces improves the odds of winning, with surface types like clay and hard courts playing a significant role in this relationship.
## Introduction
A substantial body of research has explored the prediction of tennis match outcomes using statistical models, highlighting the importance of player attributes and match statistics. Early studies, such as those by Newton and Keller (2005)\[2\], O'Malley (2008)\[3\], and Riddle (1988)\[4\], demonstrate that under the assumption of independent and identically distributed (iid) point outcomes on a player's serve, the probability of winning a match can be derived from the probabilities of winning points on serve.
Kovalchik (2016) \[5\] conducted a comparison of 11 published tennis prediction models, categorizing them into three classes: point-based models relying on the iid assumption, regression-based models, and paired comparison models. The study found that while point-based models had lower accuracy and higher log loss, regression and paired comparison models generally outperformed them.
## Methods
### Data and preprocessing
The dataset utilized in this study is the Tennis ATP Dataset curated by Jeff Sackmann (Sackmann, 2021) \[1\]. This dataset serves as a comprehensive repository of professional tennis data, encompassing a wide range of player information, historical rankings, match outcomes, and statistical metrics. Specifically, it includes a player file containing detailed biographical data, such as unique player identifiers, names, handedness, birth dates, nationalities, and physical attributes like height. Additionally, ranking files provide a historical record of ATP rankings over time, while the results file covers match outcomes across tour-level, challenger, and futures events. This dataset forms a robust foundation for exploring various aspects of professional tennis performance and trends.
The dataset selected includes ATP match data from 2014-2024, the subset chosen are challenger matches and professional and tournament class A such as Davis Cup, Roland Garros and others. The records from this period consist in 116,103 matches where each match has 49 variables.
The initial collection of data contains features at the match level, therefore it has information from the winner player and the loser player. In order to analyze the effect of match win this structured has been modified to portrait the results at the player level, it has been added the feature win which represents the outcome of the match in respect to the player (either win or loose).
In order to improve the quality of the data, those players that do not have a rank and rank points have been set to zero as they explain unranked players new to these tournaments. An inputing technique has been used for the player height using the average of the country of birth of the player.
Furthermore, analyzing that matches whose number of aces for the winner nor the loser are missing lack all the statistical information from the the other aspects of the match therefore those records have been filtered. The same rationale applies to the serving games as it signals records that have missing information overall therefore those records have been filtered outside of the analysis data.
### Variable selection
Taking as reference previous research regarding the most relevant features involved in the outcome of a tennis match (Newton et al., 2005\[1\]; O'Malley 2008\[2\]; Kovalchik, 2016) a pre selection was made observing the limitation of the available data. In order to further refine the process of feature selection exploratory data analysis was conducted using correlation plots, box plots and scatterplots.
### Modeling and evaluation
The present study focuses on the effects of the duration of a match using linear regression to evaluate inference capabilities and determining the principal factors for a match win using logistic regression to evaluate the probability of a win. Variance Inflation Factor (VIF) was used to test multi-colinearity. For the linear regression task the assumptions for the model are tested via residual vs fitted plots and normal q-q plots, furthermore, the performance of the model is evaluated using the adjusted r squared metric. In terms of logistic regression, accuracy, recall sensitivity and specificity are
## Results
```{r load-packages, message = FALSE, warning = FALSE, echo = FALSE}
library(tidyverse)
library(dplyr)
library(tidyverse)
library(Hmisc)
library(cowplot)
library(corrplot)
library(ggplot2)
library(modelsummary)
library(car)
library(conflicted)
library(glmnet)
conflict_prefer("filter", "dplyr")
```
```{r data-1, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
#tennis_qualy_chall <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_qual_chall_2023.csv")
#tennis_futures <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_futures_2023.csv")
#tennis_match <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_2023.csv")
#tennis_player <- read.csv("https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_players.csv")
# Combine the datasets using rbind
#all_matches <- rbind(tennis_qualy_chall, tennis_futures, tennis_match)
# Define the base URL for each type of match data
base_urls <- list(
qualy_chall = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_qual_chall_",
#futures = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_futures_",
match = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_"
)
# Get the current year and create a sequence for the last 5 years
years <- (as.numeric(format(Sys.Date(), "%Y")) - 10):as.numeric(format(Sys.Date(), "%Y"))
# Create a function to download and combine data for all years and match types
download_data <- function(base_url, years) {
data_list <- lapply(years, function(year) {
url <- paste0(base_url, year, ".csv")
tryCatch(
read.csv(url),
error = function(e) {
message("Failed to download: ", url)
NULL
}
)
})
# Combine all years into a single data frame
do.call(rbind, data_list)
}
# Download and combine data for each match type
tennis_qualy_chall <- download_data(base_urls$qualy_chall, years)
#tennis_futures <- download_data(base_urls$futures, years)
tennis_match <- download_data(base_urls$match, years)
# Define a function to standardize column types
standardize_columns <- function(df) {
df %>%
mutate(
tourney_level = as.character(tourney_level),
winner_seed = as.character(winner_seed),
loser_seed = as.character(loser_seed)
) # Convert `tourney_level` to character
}
# Apply the function to each dataset
tennis_qualy_chall <- standardize_columns(tennis_qualy_chall)
#tennis_futures <- standardize_columns(tennis_futures)
tennis_match <- standardize_columns(tennis_match)
# Combine all datasets
tennis <- bind_rows(tennis_qualy_chall,tennis_match) #tennis_futures,
# View the combined dataset
head(tennis)
dim(tennis)
#tennis %>% filter(tourney_level == 'G')
#colnames(tennis)
sum(is.na(tennis$minutes))
tennis %>%
filter(is.na(winner_rank))
#colnames(tennis$l_ace)
#sum(is.na(tennis_futures$l_ace))
```
### Overview of exploratory data analysis
### **Research question 1: Effects of the difference in ranking over the length in minutes for a tennis match**
### **Research question 2: Aces and court surface type influence in match outcome**
The results for the final fitted model for win prediction are in the Anex I since the table is large. The selected variables along with the interaction term of the type of surface was included. Multiple iterations to find the best model was performed and multicolinearity evaluations were used to assess the model.
$$
\begin{aligned}
\log \left( \frac{P(\text{win})}{1 - P(\text{win})} \right) = \beta_0 \\
&+\beta_1 \cdot \text{draw size} \\
&+\beta_2 \cdot \text{tourney level} \\
&+\beta_3 \cdot \text{match num} \\
&+\beta_4 \cdot \text{player hand} \\
&+\beta_5 \cdot \text{player height} \\
&+\beta_6 \cdot \text{player age} \\
&+\beta_7 \cdot \text{rank} \\
&+\beta_8 \cdot \text{rank points} \\
&+\beta_9 \cdot (\text{aces} \cdot \text{surface}) \\
&+\beta_{10} \cdot \text{double faults} \\
&+\beta_{11} \cdot \text{serve points} \\
&+\beta_{12} \cdot \text{first serves} \\
&+\beta_{13} \cdot \text{first serves points won} \\
&+\beta_{14} \cdot \text{second serves points won} \\
&+\beta_{15} \cdot \text{break points saved} \\
&+\beta_{16} \cdot \text{break points faced} \\
&+\beta_{17} \cdot \text{match month}
\end{aligned}
$$
```{r , message=FALSE, warning=FALSE}
# Create a long-format dataset for both winners and losers
tennis_long <- tennis %>%
filter(!is.na(minutes) & !is.na(winner_rank) & !is.na(loser_rank))%>%
mutate(
win = 1,
player_id = winner_id,
player_name = winner_name,
player_seed = winner_seed,
player_entry = winner_entry,
player_hand = winner_hand,
player_ht = winner_ht,
player_ioc = winner_ioc,
player_age = winner_age,
aces = w_ace,
df = w_df,
svpt = w_svpt,
first_in = w_1stIn,
first_won = w_1stWon,
second_won = w_2ndWon,
svgms = w_SvGms,
bp_saved = w_bpSaved,
bp_faced = w_bpFaced,
rank = winner_rank,
rank_points = winner_rank_points,
score = score,
tourney_id = tourney_id,
tourney_name = tourney_name,
surface = surface,
draw_size = draw_size,
tourney_level = tourney_level,
tourney_date = tourney_date,
match_num = match_num
) %>%
select(
tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
match_num, player_id, player_seed, player_entry, player_name, player_hand,
player_ht, player_ioc, player_age, score, rank, rank_points, aces, df, svpt,
first_in, first_won, second_won, svgms, bp_saved, bp_faced, win
) %>%
bind_rows(
# Create rows for the loser
tennis %>%
mutate(
win = 0,
player_id = loser_id,
player_name = loser_name,
player_seed = loser_seed,
player_entry = loser_entry,
player_hand = loser_hand,
player_ht = loser_ht,
player_ioc = loser_ioc,
player_age = loser_age,
aces = l_ace,
df = l_df,
svpt = l_svpt,
first_in = l_1stIn,
first_won = l_1stWon,
second_won = l_2ndWon,
svgms = l_SvGms,
bp_saved = l_bpSaved,
bp_faced = l_bpFaced,
rank = loser_rank,
rank_points = loser_rank_points,
score = score,
tourney_id = tourney_id,
tourney_name = tourney_name,
surface = surface,
draw_size = draw_size,
tourney_level = tourney_level,
tourney_date = tourney_date,
match_num = match_num
) %>%
select(
tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
match_num, player_id, player_seed, player_entry, player_name, player_hand,
player_ht, player_ioc, player_age, score, rank, rank_points, aces, df, svpt,
first_in, first_won, second_won, svgms, bp_saved, bp_faced, win
)
)
# Create new columns for year, month, and day
tennis_long <- tennis_long %>%
mutate(
match_year = substr(tourney_date, 1, 4), # Extract the first 4 characters as the year
match_month = substr(tourney_date, 5, 6) # Extract the 5th and 6th characters as the month
)
tennis_long <- tennis_long %>%
rename(
player_height = player_ht,
double_faults = df,
player_country = player_ioc,
serve_points = svpt,
first_serves=first_in,
first_serves_points_won=first_won,
second_serves_points_won=second_won,
serve_games=svgms,
break_points_saved=bp_saved,
break_points_faced=bp_faced
)
# Convert Win to a factor with appropriate labels
tennis_long <- tennis_long |>
mutate(
win = factor(win, levels = c(0, 1), labels = c("Loss", "Win")),
rank = if_else(is.na(rank), 0, rank),
rank_points = if_else(is.na(rank_points), 0, rank_points)
)
tennis_long <- tennis_long %>%
group_by(player_country) %>%
mutate(
player_height = if_else(is.na(player_height),
mean(player_height, na.rm = TRUE),
player_height)
) %>%
ungroup()
tennis_long <- tennis_long |>
mutate(
player_height = if_else(is.na(player_height), mean(player_height, na.rm = TRUE), player_height)
)
tennis_long <- tennis_long %>%
filter(!is.na(aces) & !is.na(player_age) & !is.na(serve_games))
# Get the number of NAs in each column
#na_count <- sapply(tennis_long, function(x) sum(is.na(x)))
# Display the result
#na_count
#dim(tennis_long)
#tennis_long %>%
#202
tennis_mod1 <- glm(win ~ draw_size + tourney_level + match_num + player_hand+ player_height + player_age + rank + rank_points + aces*surface + double_faults + serve_points + first_serves + first_serves_points_won + second_serves_points_won + break_points_saved + break_points_faced + match_month,
data=tennis_long,
family="binomial") #player_country+
#summary(tennis_mod1)
#exp(tennis_mod1$coefficients)
# serve_games
# Extracting coefficients, standard errors, and p-values
coefficients <- summary(tennis_mod1)$coefficients
#exp_coefficients <- exp(coefficients[, 1]) # Exponentiate coefficients
# Get confidence intervals
#conf_int <- confint(tennis_mod1)
# Create a data frame for the model summary
summary_df <- data.frame(
Term = rownames(coefficients),
Estimate = coefficients[, 1],
`Standard Error` = coefficients[, 2],
`z-value` = coefficients[, 3],
`P-value` = coefficients[, 4]
#`Exp(Estimate)` = exp_coefficients,
#`CI Lower` = conf_int[, 1],
#`CI Upper` = conf_int[, 2]
)
# Create the kable table
library(knitr)
kable(summary_df, caption = "Logistic Regression Model Summary", digits = 3)
```
```{r}
# Convert tourney level to ordinal variable:
## 0 = D (Davis Cup): Team competition, often less prestigious on an individual level.
## 1 = C (Challengers): Lower-tier tournaments below the main ATP Tour.
## 2 = A (Tour-Level Events): Regular ATP events not part of Masters 1000s, Grand Slams, or Finals.
## 3 = M (Masters 1000s): High-prestige, top-tier ATP tournaments after Grand Slams.
## 4 = G (Grand Slams): The most prestigious tournaments in tennis (Australian Open, French Open, Wimbledon, US Open).
## 5 = F (Tour Finals and Other Season-Ending Events): Exclusive tournaments like the ATP Finals, featuring only the top-ranked players of the season.
tourney_mapping <- c("D" = 0, "C" = 1, "A" = 2, "M" = 3, "G" = 4, "F" = 5)
tennis_long$tourney_level_ord <- as.numeric(tourney_mapping[tennis_long$tourney_level])
# Use VIF to assess multicollinearity amongst all viable variables
# 1. Create a model with no interaction terms and all viable variables
tennis_mod_no_interaction <- glm(win ~ draw_size + tourney_level_ord + player_hand + player_height + player_age + rank + rank_points + double_faults + serve_points + first_serves + first_serves_points_won + second_serves_points_won + break_points_saved + break_points_faced + aces + surface,
data=tennis_long,
family="binomial")
vif(tennis_mod_no_interaction)
# 1.1. Results: High VIF score for serve_points, first_serves, first_serves_points_won, second_serves_points_won, break_points_saved, break_points_faced.
# 1.2. Combine/create ratios
tennis_long$first_serve_win_ratio = tennis_long$first_serves_points_won/tennis_long$serve_points
tennis_long$second_serve_win_ratio = tennis_long$second_serves_points_won/(
tennis_long$serve_points - tennis_long$first_serves)
tennis_long$break_pt_save_ratio = tennis_long$break_points_saved/tennis_long$break_points_faced
##tennis[71196,] l_ serve counts do not make sense. Remove that column from tennis_long
tennis_long <- tennis_long[!is.infinite(tennis_long$second_serve_win_ratio), ]
# 2. Calculate VIF for new variables
mod_new_var <- glm(win ~ draw_size + tourney_level_ord + player_hand + player_height + player_age + rank + rank_points + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio + aces + surface,
data=tennis_long,
family="binomial")
vif(mod_new_var)
# 2.1. RESULTS: draw_size and tourney_level_ord have high VIF scores
tennis_long %>%
group_by(tourney_level_ord) %>%
summarize(mean = mean(draw_size), min = min(draw_size), max = max(draw_size))
# 2.2. Remove tourney_level as more granular info comes from draw_size
# 3. Final model
mod_limit <- glm(win ~ draw_size + player_hand + player_height + player_age + rank + rank_points + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio + aces + surface, data=tennis_long, family="binomial")
vif(mod_limit)
```
```{r}
# Model based on above identified columns and lasso regularization to limit variables and prevent overfitting:
tennis_long_clean <- tennis_long[complete.cases(tennis_long[, c("win", "draw_size", "player_hand", "player_height",
"player_age", "rank", "rank_points", "double_faults",
"first_serve_win_ratio", "second_serve_win_ratio",
"break_pt_save_ratio", "aces", "surface")]), ]
formula = win ~ draw_size + player_hand+ player_height + player_age + rank + rank_points + aces*surface + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio
# Recreate the design matrix and target variable with the cleaned data
X <- model.matrix(formula, data = tennis_long_clean)[, -1] # Remove the intercept column
y <- tennis_long_clean$win
# Fit the logistic regression model with Lasso regularization using glmnet
lasso_model <- glmnet(X, y, family = "binomial", alpha = 1)
# Print the model details
print(lasso_model)
# You can use cv.glmnet for cross-validation to find the best lambda
cv_lasso_model <- cv.glmnet(X, y, family = "binomial", alpha = 1)
# Plot the cross-validation results
plot(cv_lasso_model)
# Get the best lambda (the optimal penalty)
best_lambda <- cv_lasso_model$lambda.min
print(best_lambda)
# Fit the model using the best lambda
final_lasso_model <- glmnet(X, y, family = "binomial", alpha = 1, lambda = best_lambda)
# Print final model coefficients
print(coef(final_lasso_model))
```
```{r}
# Final Model:
m = glm(win ~ draw_size + player_hand+ player_height + player_age + rank + rank_points + aces*surface + double_faults + break_pt_save_ratio,
data=tennis_long_clean,
family="binomial")
# Extract coefficients, p-values, and standard errors
coef_summary <- summary(m)
# Get the coefficients, standard errors, and p-values
coefficients <- as.numeric(coef_summary$coefficients[, "Estimate"])
standard_errors <- as.numeric(coef_summary$coefficients[, "Std. Error"])
p_values <- as.numeric(coef_summary$coefficients[, "Pr(>|z|)"])
# Exponentiate the coefficients to get odds ratios
exp_coefficients <- exp(coefficients)
# Create a data frame to organize the results
coeff_df <- data.frame(
term = rownames(coef_summary$coefficients),
coefficient = coefficients,
exp_coefficient = exp_coefficients,
p_value = p_values,
stringsAsFactors = FALSE
)
# Display the table with kable
colnames(coeff_df) <- c("Variable", "Coefficient", "Odds Ratio", "P-value")
kable(coeff_df, caption = "Logistic Regression Coefficients, Odds Ratios, and p-values")
```
Analyzing the variables of interest aces and surface, holding every other variable constant, applying exponential we have an effect of 0.94 times of odds increase for every extra ace point. On the other hand surface Clay and surface hard showed to be statistically significant and their effects is an increase of 4.52 times the odds of winning for clay surface and an increase of 2.5 times the odds of winning for Hard surface.
```{r}
tennis_long %>%
group_by(tourney_level_ord) %>%
summarize(mean = mean(draw_size), min = min(draw_size), max = max(draw_size))
```
```{r , message=FALSE, warning=FALSE}
# Load necessary libraries
library(caret) # For confusionMatrix
library(knitr) # For kable
library(kableExtra)
# Generate predictions
pred_probs <- predict(tennis_mod1, type = "response")
pred_binary <- ifelse(pred_probs > 0.5, 1, 0)
pred_binary <- factor(pred_binary, levels = c(0, 1), labels = c("Loss", "Win"))
# Create a confusion matrix
conf_matrix <- confusionMatrix(pred_binary, tennis_long$win)
conf_matrix_table <- conf_matrix$table
conf_matrix_df <- as.data.frame.matrix(conf_matrix_table)
#conf_matrix_df <- cbind(Actual = rownames(conf_matrix_df), conf_matrix_df)
conf_matrix_latex <- conf_matrix_df %>%
kable("latex", booktabs = TRUE,
caption = "Confusion Matrix: Actual vs. Predicted Satisfaction",
align = "c") %>%
kable_styling(latex_options = c("striped", "hold_position"))
# Extract and prepare the metrics for display
metrics <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1 Score", "Specificity", "Sensitivity", "Positive Predictive Value", "Negative Predictive Value"),
Value = c(
conf_matrix$overall['Accuracy'],
conf_matrix$byClass['Pos Pred Value'],
conf_matrix$byClass['Sensitivity'],
conf_matrix$byClass['F1'],
conf_matrix$byClass['Specificity'],
conf_matrix$byClass['Sensitivity'],
conf_matrix$byClass['Pos Pred Value'],
conf_matrix$byClass['Neg Pred Value']
)
)
# Display the metrics using kable
#cat("\nMetrics:\n")
kable(metrics, format = "markdown", col.names = c("Metric", "Value"))
```
The performance of the logistic regression model was evaluated using standard classification metrics, including accuracy, precision, recall, and F1 score. The model achieved an accuracy of 0.8117, indicating that approximately 81.17% of predictions matched the true outcomes. The precision of the model was 0.8230, reflecting its ability to correctly identify positive cases while minimizing false positives. The recall was measured at 0.8155, demonstrating the model's capability to correctly identify a high proportion of actual positive cases. Finally, the F1 score, a harmonic mean of precision and recall, was calculated to be 0.8192, indicating a balanced performance between these two metrics. Together, these results suggest the model performs reliably in predicting match outcomes based on the given features.
## Conclusion
## References
\[1\] Sackmann, J. (n.d.). Tennis databases, files, and algorithms \[Data set\]. Tennis Abstract. Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Based on a work at https://github.com/JeffSackmann.
\[2\] Newton, P. K., & Keller, J. B. (2005). Probability of winning at tennis I. Theory and data. Studies in applied Mathematics, 114(3), 241-269.
\[3\] O'Malley, A. J. (2008). Probability formulas and statistical analysis in tennis. Journal of Quantitative Analysis in Sports, 4(2).
\[4\] Riddle, L. H. (1988). Probability models for tennis scoring systems. Journal of the Royal Statistical Society Series C: Applied Statistics, 37(1), 63-75.
\[5\] Kovalchik, S. A. (2016). Searching for the GOAT of tennis win prediction. Journal of Quantitative Analysis in Sports, 12(3), 127-138.