-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIDS702_Final_Tennis_Official_20241214.qmd
896 lines (715 loc) · 52.9 KB
/
IDS702_Final_Tennis_Official_20241214.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
---
title: "Study of Factors Impacting Tennis Match Durations and Winning Outcomes"
author:
- "Alejandro Paredes La Torre, Liangcheng (Jay) Liu"
- "Nzarama Kouadio, Jahnavi Maddhuri"
subtitle: "2024-12-15"
format: pdf
header-includes:
- \usepackage{float}
- \usepackage{authblk}
- \floatplacement{table}{H}
execute:
echo: false
geometry: margin=0.8in
---
## Abstract
Modeling the key factors that influence professional tennis match outcomes is crucial for athletes, coaches, and researchers seeking to optimize performance and enhance player development. This study utilizes ATP data to examine how player rankings and aces impact match duration and outcomes. The analysis reveals that larger ranking disparities are associated with shorter match duration, suggesting that players with higher rankings are more likely to win quickly. In addition, the study investigates the relationship between the number of aces a player hits and their probability of winning, finding that more aces significantly improve a player's chances of victory. The effect of aces varies across different court surfaces, with clay courts showing the strongest influence. This research provides valuable insights into the complex dynamics of tennis matches and highlights the importance of surface type, player ranking, and performance metrics like aces in shaping match outcomes.
## Introduction
Predicting outcomes in sports has long been a central focus for athletes, teams, and industries alike. Statistical science, when applied to sports, plays a pivotal role in optimizing an athlete’s performance, making it a subject of significant interest for both researchers and commercial enterprises. In professional tennis, analyzing the factors that influence match outcomes offers insights that extend beyond simple predictions, encompassing strategic advancements and broader applications.
Such insights have practical implications for refining rankings (Klaassen and Magnus, 2003)[1], ratings (Kovalchik, 2016)[2], and seedings(Boulier et. al 1999)[3], serving as a foundation for performance analysis and strategy optimization. Predictive models are pivotal in industries like sports betting, where they drive decision-making and risk management (Foley-Train, 2014)[4]. In academic research, these models contribute to methodological improvements, as seen in the work of Štrumbelj & Vračar (2012)[5], who explored probabilistic approaches to predicting match outcomes.
Analyzing professional tennis data bridges practical applications and theoretical advancements. A growing body of literature underscores the role of data-driven approaches in uncovering intricate patterns within tennis. For instance, Boulier and Stekler (1999)[6] demonstrated how statistical models could be used to assess player performance, while Lasek, Szlávik, and Bhulai (2013)[8] examined predictive methods for ranking systems. McHale and Davies (2007)[9] explored the influence of match dynamics on outcomes, further bridging theoretical advancements with practical applications.These studies highlight how data-driven approaches can uncover nuanced patterns in sports, offering a deeper understanding of the game and its predictive dynamics
Early foundational studies, such as those by Newton and Keller (2005)[9], O'Malley (2008)[10], and Riddle (1988)[11], established that under the assumption of independent and identically distributed (iid) point outcomes on a player's serve, the probability of winning a match could be derived from serve-point probabilities. These studies laid the groundwork for subsequent research in probabilistic modeling of tennis outcomes.
Kovalchik (2016) [12] conducted a comparison of 11 published tennis prediction models, categorizing them into three classes: point-based models relying on the iid assumption, regression-based models, and paired comparison models. The study found that point-based models had lower accuracy and higher log loss, regression and paired comparison models generally outperformed them.
Furthermore, given the importance of player rankings and performance metrics in tennis prediction models, it is essential to understand the context provided by the ATP (Association of Tennis Professionals). This institution is the principal governing body of men's professional tennis. It oversees the ATP Tour, which features the highest level of men's tennis tournaments worldwide, including Grand Slams, Masters 1000 events, and other competitive circuits. The ATP rankings system, introduced in 1973, is used to evaluate and rank players based on their performance in sanctioned tournaments over a rolling 52-week period, serving as a critical metric for seedings and qualifications (ATP Tour, n.d.)[13].
Building on these studies, this research uses the Tennis ATP dataset (Sackmann, 2021)[14], a collection of information for professional tennis matches and professional players, to investigate two specific questions:
1. How does the difference in player rankings influence the duration of a tennis match?
2. How do the number of aces and the court surface type affect a player's odds of winning a match?
## Methods
### Data and preprocessing
The dataset used in this study is the Tennis ATP Dataset curated by Jeff Sackmann (2021) [1]. This dataset serves as a comprehensive repository of professional tennis data, encompassing a wide range of player information, historical rankings, match outcomes, and statistical metrics. This dataset serves as a valuable resource for analyzing trends and performance in professional tennis, forming the basis for addressing the research questions in this study.
The time frame selected includes ATP match data from 2014-2024, the subsets chosen are challenger matches and professional and tournament class A matches such as Davis Cup, Roland Garros and others. The records from this period consist in 116,103 matches where each match has 49 variables.
Pertaining the first research question, to assess the effect of factors on the match length, the difference in the number of aces, rankings, and ranking points between the winner and loser were calculated. These perfomance metrics aim to provide more tangible and interpretable predictors for the model by focusing on measurable aspects of the players performance. Grouping variables into performance metrics also helps to frame the analysis in a way that aligns with the context of the sport and makes the regressors more meaningful and insightful. Based on a VIF analysis, highly correlated variables were addressed by introducing derived metrics, such as the first break point win ratio, to mitigate multicollinearity and improve model stability.
To address the issue of missing values, a multiple imputation approach was applied using the mice package. This method generates several imputed datasets to account for the uncertainty inherent in estimating missing values, thereby enhancing the robustness of subsequent analyses. Predictive mean matching (PMM) was employed as the imputation technique, combining regression-based prediction with donor-based imputation. This ensures that imputed values align with the observed data distribution, remaining realistic and within plausible ranges.
In respect to the second research question, which focuses on analyzing match outcomes, additional data processing steps were taken. The original dataset, which contains match-level features with information on both the winner and loser in a single record, was restructured. This transformation reorganized the data at the player level, ensuring that each record represents either a win or a loss for a given player. The dimensions for this transformed dataset are 218321 records and 30 variables. This adjustment allows for a more focused analysis of the factors influencing match outcomes. Furthermore, in order to deal with multicolinearity new derived variables such as the first serve win ratio, calculated from two underlying variables, were used to merge highly correlated variables, the procedure being explained in detail in the results section.
For the second research question, 4,739 records (2%) with missing values in the aces column were excluded. This decision was based on the observation that missing data in aces systematically coincided with missing values for all other match statistics. A similar approach was applied to the serving games variable 4,740 records (2%), as its missing data also indicated the absence of other critical match details, further supporting exclusion. For the remaining variables without systematic missingness, a multiple imputation technique was applied, informed by the results of the imputation method used in the first research question.
### Variable selection
Taking as reference previous research regarding the most relevant features involved in the outcome of a tennis match (Newton et al., 2005[9]; O'Malley 2008[10]; Kovalchik, 2016[11]) a pre selection was made observing the limitation of the available data. Furthermore, adding and interaction term for the model was supported by the work of Sklenička, J. (2024), who modeled matches outcome using as factors the different type of surfaces. In order to further refine the process of feature selection exploratory data analysis was conducted using correlation plots, box plots and scatterplots.
### Model fitting and evaluation
In regard to the first research question, Multiple linear regression (MLR) was used to analyze the factors influencing match duration. Variance Inflation Factor (VIF) revealed collinearity among some variables, prompting the creation of ratios for breakpoints saved and faced to address this issue. Assumptions of linear regression, such as normality, homoscedasticity, and linearity, were evaluated using diagnostic plots. To address deviations from normality, a log transformation was applied to the target variable, match duration, which significantly improved the distribution of residuals. Residual versus fitted plots and Q-Q plots were used to assess model assumptions and ensure the validity of results. R-squared was used to evaluate the overall performance of the linear regression model. All statistical analyses were performed using R programming language (version 4.3.1)
Regarding the second research question a logistic regression model was developed to predict the outcome of a tennis match, with the dependent variable being a binary outcome representing a win or loss. A series of predictor variables were selected based on relevant literature and expert knowledge, including player attributes, tournament characteristics, and performance metrics. The initial model included a variety of raw variables such as player height, age, rank, double faults, aces, serve statistics, and tournament-level indicators. However, multicollinearity among these predictors was detected using Variance Inflation Factor (VIF) scores, which were notably high for several variables related to serve and break-point statistics. To mitigate multicollinearity, new derived variables such as the first serve win ratio, second serve win ratio, and break-point save ratio were created. Multiple imputation was applied to handle missing data, followed by the use of pooled regression coefficients to account for uncertainty in the imputed data. The receiver operating characteristic (ROC) curve and area under the curve (AUC) were computed to evaluate the classification performance. Finally, binary predictions were generated, and a confusion matrix was produced, along with several classification metrics including accuracy, precision, recall, F1 score, specificity, and sensitivity.
## Results
```{r load-packages, message = FALSE, warning = FALSE, echo = FALSE}
library(tidyverse)
library(dplyr)
library(tidyverse)
library(Hmisc)
library(cowplot)
library(corrplot)
library(ggplot2)
library(modelsummary)
library(car)
library(knitr)
library(conflicted)
library(glmnet)
conflict_prefer("filter", "dplyr")
```
```{r data-1, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# Define the base URL for each type of match data
base_urls <- list(
qualy_chall = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_qual_chall_",
#futures = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_futures_",
match = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_"
)
# Get the current year and create a sequence for the last 5 years
years <- (as.numeric(format(Sys.Date(), "%Y")) - 10):as.numeric(format(Sys.Date(), "%Y"))
# Create a function to download and combine data for all years and match types
download_data <- function(base_url, years) {
data_list <- lapply(years, function(year) {
url <- paste0(base_url, year, ".csv")
tryCatch(
read.csv(url),
error = function(e) {
message("Failed to download: ", url)
NULL
}
)
})
# Combine all years into a single data frame
do.call(rbind, data_list)
}
# Download and combine data for each match type
tennis_qualy_chall <- download_data(base_urls$qualy_chall, years)
#tennis_futures <- download_data(base_urls$futures, years)
tennis_match <- download_data(base_urls$match, years)
# Define a function to standardize column types
standardize_columns <- function(df) {
df %>%
mutate(
tourney_level = as.character(tourney_level),
winner_seed = as.character(winner_seed),
loser_seed = as.character(loser_seed)
)
}
# Apply the function to each dataset
tennis_qualy_chall <- standardize_columns(tennis_qualy_chall)
tennis_match <- standardize_columns(tennis_match)
# Combine all datasets
tennis <- bind_rows(tennis_qualy_chall,tennis_match) #tennis_futures,
tennis <- tennis %>%
mutate(
winner_hand = case_when(
winner_hand == "" | winner_hand == "U" ~ "A", # Convert "" or "U" to "A" for winner
TRUE ~ winner_hand # Keep the original value for other cases
),
loser_hand = case_when(
loser_hand == "" | loser_hand == "U" ~ "A", # Convert "" or "U" to "A" for loser
TRUE ~ loser_hand # Keep the original value for other cases
)
)
tourney_mapping <- c("D" = 0, "C" = 1, "A" = 2, "M" = 3, "G" = 4, "F" = 5)
tennis$tourney_level_ord <- as.numeric(tourney_mapping[tennis$tourney_level])
tennis$w_bpSave_ratio <- tennis$w_bpSaved / tennis$w_bpFaced
tennis$l_bpSave_ratio <- tennis$l_bpSaved / tennis$l_bpFaced
# View the combined dataset
head(tennis)
dim(tennis)
sum(is.na(tennis$minutes))
tennis <- tennis %>%
filter(!is.na(minutes) & minutes>0)
# Subset the rows where winner_ht is NA
#missing_height_players <- tennis %>%
#filter(is.na(winner_ht)) %>%
#distinct(winner_name)
#table(tennis$winner_hand)
sum(is.na(tennis$minutes))
sum(is.na(tennis$winner_rank))
sum(is.na(tennis$loser_rank))
sum(is.na(tennis$winner_ht))
sum(is.na(tennis$loser_ht))
tennis %>% filter(is.na(loser_ht))
```
```{r , message=FALSE, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# Create a long-format dataset for both winners and losers
tennis_long <- tennis %>%
filter(!is.na(minutes)) %>%#& !is.na(winner_rank) & !is.na(loser_rank))%>%
mutate(
win = 1,
player_id = winner_id,
player_name = winner_name,
player_seed = winner_seed,
player_entry = winner_entry,
player_hand = winner_hand,
player_ht = winner_ht,
player_ioc = winner_ioc,
player_age = winner_age,
aces = w_ace,
df = w_df,
svpt = w_svpt,
first_in = w_1stIn,
first_won = w_1stWon,
second_won = w_2ndWon,
svgms = w_SvGms,
bp_saved = w_bpSaved,
bp_faced = w_bpFaced,
rank = winner_rank,
rank_points = winner_rank_points,
score = score,
tourney_id = tourney_id,
tourney_name = tourney_name,
surface = surface,
draw_size = draw_size,
tourney_level = tourney_level,
tourney_date = tourney_date,
match_num = match_num
) %>%
select(
tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
match_num, player_id, player_seed, player_entry, player_name, player_hand,
player_ht, player_ioc, player_age, score, rank, rank_points, aces, df, svpt,
first_in, first_won, second_won, svgms, bp_saved, bp_faced, win
) %>%
bind_rows(
# Create rows for the loser
tennis %>%
mutate(
win = 0,
player_id = loser_id,
player_name = loser_name,
player_seed = loser_seed,
player_entry = loser_entry,
player_hand = loser_hand,
player_ht = loser_ht,
player_ioc = loser_ioc,
player_age = loser_age,
aces = l_ace,
df = l_df,
svpt = l_svpt,
first_in = l_1stIn,
first_won = l_1stWon,
second_won = l_2ndWon,
svgms = l_SvGms,
bp_saved = l_bpSaved,
bp_faced = l_bpFaced,
rank = loser_rank,
rank_points = loser_rank_points,
score = score,
tourney_id = tourney_id,
tourney_name = tourney_name,
surface = surface,
draw_size = draw_size,
tourney_level = tourney_level,
tourney_date = tourney_date,
match_num = match_num
) %>%
select(
tourney_id, tourney_name, surface, draw_size, tourney_level, tourney_date,
match_num, player_id, player_seed, player_entry, player_name, player_hand,
player_ht, player_ioc, player_age, score, rank, rank_points, aces, df, svpt,
first_in, first_won, second_won, svgms, bp_saved, bp_faced, win
)
)
# Create new columns for year, month, and day
tennis_long <- tennis_long %>%
mutate(
match_year = substr(tourney_date, 1, 4), # Extract the first 4 characters as the year
match_month = substr(tourney_date, 5, 6) # Extract the 5th and 6th characters as the month
)
tennis_long <- tennis_long %>%
rename(
player_height = player_ht,
double_faults = df,
player_country = player_ioc,
serve_points = svpt,
first_serves=first_in,
first_serves_points_won=first_won,
second_serves_points_won=second_won,
serve_games=svgms,
break_points_saved=bp_saved,
break_points_faced=bp_faced
)
# Convert Win to a factor with appropriate labels
tennis_long <- tennis_long |>
mutate(
win = factor(win, levels = c(0, 1), labels = c("Loss", "Win")),
rank = if_else(is.na(rank), 0, rank),
rank_points = if_else(is.na(rank_points), 0, rank_points)
)
tennis_long <- tennis_long %>%
filter(!is.na(aces) & !is.na(player_age) & !is.na(serve_games))
# Get the number of NAs in each column
#na_count <- sapply(tennis_long, function(x) sum(is.na(x)))
# Display the result
#na_count
#dim(tennis_long)
#tennis_long %>%
#202
sum(is.na(tennis_long$player_height))
sum(is.na(tennis_long$serve_games))
sum(is.na(tennis_long$draw_size))
sum(is.na(tennis_long$winner_rank))
sum(is.na(tennis_long$draw_size))
```
### Overview of key variables of interest
```{r, message=FALSE, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
tennis_long %>%
group_by(win) %>%
dplyr::summarize(count = n())
tennis %>%
group_by(surface) %>%
dplyr::summarize(count = n(), proportion = n()/116050)
```
The dataset shows a fairly balanced distribution of players who win (101,746) versus those who lose (111,621), indicating minimal information gain from either outcome. Regarding surface types, most matches are played on Hard (53.5%) or Clay (41.1%) courts, with Grass accounting for 5% and Carpet only 0.2%. The distribution of key continuous variables is examined below. Missing values are present in 9% of the minutes variable and 2% of the aces variable, primarily corresponding to canceled or rescheduled matches.
```{r, warning=FALSE, echo=FALSE}
summary_table <- data.frame(
Variable = c("minutes", "rank", "aces"),
Min = round(c(min(tennis$minutes, na.rm = TRUE), min(tennis_long$rank, na.rm = TRUE), min(tennis_long$aces, na.rm = TRUE)), 2),
Mean = round(c(mean(tennis$minutes, na.rm = TRUE), mean(tennis_long$rank, na.rm = TRUE), mean(tennis_long$aces, na.rm = TRUE)), 2),
Median = round(c(median(tennis$minutes, na.rm = TRUE), median(tennis_long$rank, na.rm = TRUE), median(tennis_long$aces, na.rm = TRUE)), 2),
`Standard Deviation` = round(c(sd(tennis$minutes, na.rm = TRUE), sd(tennis_long$rank, na.rm = TRUE), sd(tennis_long$aces, na.rm = TRUE)), 2),
Max = round(c(max(tennis$minutes, na.rm = TRUE), max(tennis_long$rank, na.rm = TRUE), max(tennis_long$aces, na.rm = TRUE)), 2)
)
#`Q1` = c(quantile(tennis$minutes, 0.25, na.rm = TRUE), quantile(tennis_long$rank, 0.25, na.rm = TRUE), quantile(tennis_long$aces, 0.25, na.rm = TRUE)),
#`Q3` = c(quantile(tennis$minutes, 0.75, na.rm = TRUE), quantile(tennis_long$rank, 0.75, na.rm = TRUE), quantile(tennis_long$aces, 0.75, na.rm = TRUE)),
kable(summary_table, caption = "Summary Statistics for Variables")
```
### **Research question 1: Effects of the difference in ranking over the length in minutes for a tennis match**
```{r, message=FALSE, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# Count the number of NAs in each column
missing_summary <- tennis %>%
summarise(across(everything(), ~ sum(is.na(.))))
missing_summary
# Summary statistics of the variables
summary(tennis)
# Creating new binary indicators
tennis <- tennis %>%
mutate(
winner_seeded = ifelse(is.na(winner_seed), 0, 1),
loser_seeded = ifelse(is.na(loser_seed), 0, 1)
)
# Calculate the percentage of missing values in rank_diff
percent_missing_minutes<- sum(is.na(tennis$minutes)) / nrow(tennis) * 100
# Print the result
print(percent_missing_minutes)
```
```{r ,warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# Player Performance Metrics: Create performance difference metrics (e.g., difference in aces, rank, and rank points):
tennis <- tennis %>%
mutate(
diff_aces = abs(w_ace - l_ace),
diff_rank = abs(loser_rank - winner_rank),
diff_rank_points = abs(loser_rank_points - winner_rank_points)
)
# Categorical Variables: Convert categorical variables like surface and tourney_level to factors:
tennis <- tennis %>%
mutate(
surface = as.factor(surface),
tourney_level = as.factor(tourney_level)
)
# Cap or remove extreme values in minutes, winner_ht, and loser_ht:
tennis <- tennis %>%
mutate(
minutes = ifelse(minutes > 300, 300, minutes),
winner_ht = ifelse(winner_ht > 210, 210, winner_ht),
loser_ht = ifelse(loser_ht > 210, 210, loser_ht)
)
#winner_seed
#loser_seed
#15720
```
```{r, message = FALSE, warning=FALSE, echo=FALSE, results = 'hide'}
#imputation
library(mice)
library(sjlabelled)
library(tidyverse)
# Exclude columns and check data
tennis$surface <- factor(tennis$surface, levels = c("Carpet", "Clay", "Grass", "Hard"))
tennis$surface <- relevel(tennis$surface, ref = "Carpet")
tennis.sub <- tennis %>%
select(minutes , diff_rank , diff_aces , winner_ht , loser_ht , surface ,
tourney_level_ord, tourney_level , winner_age , loser_age , w_df , l_df ,
w_bpSave_ratio , l_bpSave_ratio) # Exclude irrelevant columns
tennis.sub <- unlabel(tennis.sub) #unlabel the data (labels cause problem for the mice function)
tennis.imp <- mice(tennis.sub, m=5, method="pmm", print=FALSE, seed = 123)
tennis.comp <- complete(tennis.imp, "long", include=TRUE) #stack the imputed values into one variable and include he observed values
tennis.comp$winner_ht.comp <- cci(tennis$winner_ht) #create an indicator for completeness
glimpse(tennis.imp)
#ggplot(tennis.comp, aes(x=factor(.imp), y=winner_ht, fill=winner_ht.comp))+
# geom_boxplot()
```
Match duration in tennis is influenced by a variety of factors, including the interaction between the tournament level and rank difference, as well as player performance, court surface, and physical characteristics. During model development, we assessed multicollinearity among predictors using the Variance Inflation Factor (VIF). High collinearity was identified between the breakpoints saved and breakpoints faced for winners and losers. To address this issue and improve model stability, these variables were combined into ratios of breakpoints saved and ratios of breakpoints faced. This transformation not only reduced redundancy but also provided a clearer representation of the relative performance between winners and losers during matches.
When considering the interaction between rank difference and tournament level, the effect varies depending on competitiveness. At level C (Intermediate), the interaction term is significant, with match duration increasing slightly as rank disparity grows, approximately 0.01%. In contrast, at level G (Advanced Recreational), the interaction term is also significant but negative, showing that match duration decreases by 0.01% as rank disparity increases. These results indicate that at intermediate levels, skill differences can prolong matches due to competitive balance, while at higher levels like G, greater rank disparities lead to quicker resolutions due to clearer dominance. For levels D (Beginner), F (Competitive), and M (Professional), the interaction terms are not significant, suggesting that rank disparity does not meaningfully influence match duration at these levels.
Differences in player performance, such as aces, also influence match duration. For every additional ace difference between players, match duration increases by 1% (95% CI: 0.9%, 1.1%, p < 0.001). This indicates that strong serving performances often lead to prolonged matches, as players must engage in more games and sets. Our study also highlights how service errors extend gameplay, likely due to increased rally lengths or additional points. Each additional double fault by the winner increases match length by approximately 5.1% (95% CI: 5.0%, 5.2%, p < 0.001).
Furthermore, the ratio of breakpoints saved by players shows contrasting effects. Winners who save more breakpoints reduce match duration by 12.6% (95% CI: -13.3%, -11.9%, p < 0.001), likely due to their ability to close out critical points effectively. On the other hand, losers who save more breakpoints increase match length by 48.1% (95% CI: 47.2%, 49.1%, p < 0.001). This demonstrates how defensive efforts by losers prolong gameplay, as they manage to stave off losing points but fail to secure victory. The court surface plays a pivotal role in match duration. Matches on clay courts last significantly longer, with duration increasing by 14% (95% CI: 10.1%, 17.8%, p < 0.001). However other types of surfaces have shown no significant impact on length of the match.
The overall model adjusted R-squared is 0.31.
```{r, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# try with log transformed minutes
# for (i in 1:5) {
# tennis.imp$imp$minutes[[i]] <- ifelse(tennis.imp$imp$minutes[[i]] == 0, 0.1, tennis.imp$imp$minutes[[i]])
# }
imp.log_mods <- with(tennis.imp, lm(log(minutes) ~ diff_rank*tourney_level + diff_aces + winner_ht + loser_ht +
surface + winner_age + loser_age + w_df + l_df +
w_bpSave_ratio + l_bpSave_ratio))
summary(pool(imp.log_mods))
```
```{r, warning=FALSE, echo=FALSE}
library(dplyr)
library(broom)
library(knitr)
# Tidy the pooled model
tidy_model <- broom::tidy(pool(imp.log_mods))
# Manually compute confidence intervals
z_value <- 1.96 # For 95% CI
formatted_results <- tidy_model %>%
filter(term != "(Intercept)") %>% # Exclude the intercept
mutate(
Lower_CI = estimate - z_value * std.error,
Upper_CI = estimate + z_value * std.error,
`95% CI` = paste0(round(Lower_CI, 2), ", ", round(Upper_CI, 2)) # Combine CI
) %>%
dplyr::select(term, estimate, std.error, `95% CI`, p.value) %>%
rename(
Predictors = term,
Coefficient = estimate,
`Standard Error` = std.error,
`P-Value` = p.value
)
# Exclude the intercept row
#formatted_results <- formatted_results[formatted_results$term != "(Intercept)", ]
# Display the table with kable
formatted_results %>%
mutate(
Coefficient = round(Coefficient, 4),
`Standard Error` = round(`Standard Error`, 2),
#`P-Value` = round(`P-Value`, 2)
`P-Value` = ifelse(`P-Value` <0.001, "<0.001", round(`P-Value`, 3)) # Change p-value of 0 to <0.001
) %>%
knitr::kable(
caption = "Linear Regression Model Results",
digits = 4,
format = "markdown",
align = c("l", "r", "r", "r", "r") #, "r"
)
```
```{r, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# Fit the models on each imputed dataset
model_list <- with(tennis.imp, lm(log(minutes) ~ diff_rank * tourney_level +
diff_aces + winner_ht + loser_ht +
surface + winner_age + loser_age +
w_df + l_df + w_bpSave_ratio + l_bpSave_ratio))
# Extract R-squared for each model
r2_values <- sapply(model_list$analyses, function(mod) summary(mod)$r.squared)
# Average the R-squared values
mean_r2 <- mean(r2_values)
mean_r2
```
It is important to examine the underlying assumptions of our model to ensure that the regression estimates are both reliable and valid, providing meaningful insights into the data. The model assumptions is assessed by residual vs. fitted plot. The log transformation was applied to our target variable (minutes) in order to reduce skewness, stabilize variance, and improve the normality of the residuals, ensuring the assumptions of linear regression are better met. Even though, we observe slight deviations in the normality curve and a pattern that suggests non linearity, the assumptions are not severely violated, which is acceptable for reliable linear regression estimates.
```{r, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
par(mfrow = c(2,2))
# Pool the results of the imputed models
pooled_model <- pool(imp.log_mods)
# Extract residuals and fitted values from each imputed model
residuals_list <- lapply(1:5, function(i) {
model <- with(complete(tennis.imp, i), lm(log(minutes) ~ diff_rank*tourney_level + diff_aces + winner_ht + loser_ht +
surface + winner_age + loser_age + w_df + l_df +
w_bpSave_ratio + l_bpSave_ratio))
data.frame(
residuals = residuals(model),
fitted = fitted(model),
imputation = i
)
})
# Combine the residuals and fitted values
residuals_combined <- do.call(rbind, residuals_list)
# Create diagnostic plots using ggplot2
ggplot(residuals_combined, aes(x = fitted, y = residuals, color = factor(imputation))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Residuals vs Fitted Values", x = "Fitted Values", y = "Residuals") +
theme_minimal()
ggplot(residuals_combined, aes(sample = residuals, color = factor(imputation))) +
stat_qq() +
stat_qq_line() +
labs(title = "Normal Q-Q Plot", x = "Theoretical Quantiles", y = "Sample Quantiles") +
theme_minimal()
```
```{r, fig.height=8, fig.width=10, warning=FALSE, echo=FALSE}
# Extract completed data frame from mice object
tennis_imp_df <- complete(tennis.imp, 1) # Choose the first imputed dataset
# Plot with the extracted data
ggplot(tennis_imp_df, aes(x = diff_rank, y = minutes, color = tourney_level)) +
geom_point(alpha = 0.5) +
facet_wrap(~ tourney_level) +
theme_minimal() +
labs(
title = "Scatterplot of Rank Difference vs Match Duration by Tournament Level",
x = "Rank Difference",
y = "Match Duration (Minutes)",
color = "Tournament Level"
) +
theme(
plot.title = element_text(hjust = 0.5),
legend.position = "bottom"
)
```
This scatterplot shows the relationship between Rank Difference (x-axis) and Match Duration in Minutes (y-axis), faceted by Tournament Level. In each tournament level (A, C, D, F, G, M), Rank Difference is skewed towards lower values (concentrated near 0), indicating that matches are typically played between players with similar ranks. Plus, Matches with small rank differences (competitive matches) tend to have longer durations.
### **Research question 2: Aces and court surface type influence in match outcome**
```{r, message = FALSE, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
conflicts_prefer(dplyr::summarize)
# Convert tourney level to ordinal variable:
## 0 = D (Davis Cup): Team competition, often less prestigious on an individual level.
## 1 = C (Challengers): Lower-tier tournaments below the main ATP Tour.
## 2 = A (Tour-Level Events): Regular ATP events not part of Masters 1000s, Grand Slams, or Finals.
## 3 = M (Masters 1000s): High-prestige, top-tier ATP tournaments after Grand Slams.
## 4 = G (Grand Slams): The most prestigious tournaments in tennis (Australian Open, French Open, Wimbledon, US Open).
## 5 = F (Tour Finals and Other Season-Ending Events): Exclusive tournaments like the ATP Finals, featuring only the top-ranked players of the season.
tourney_mapping <- c("D" = 0, "C" = 1, "A" = 2, "M" = 3, "G" = 4, "F" = 5)
tennis_long$tourney_level_ord <- as.numeric(tourney_mapping[tennis_long$tourney_level])
# Use VIF to assess multicollinearity amongst all viable variables
# 1. Create a model with no interaction terms and all viable variables
tennis_mod_no_interaction <- glm(win ~ draw_size + tourney_level_ord + player_hand + player_height + player_age + rank + rank_points + double_faults + serve_points + first_serves + first_serves_points_won + second_serves_points_won + break_points_saved + break_points_faced + aces + surface,
data=tennis_long,
family="binomial")
vif(tennis_mod_no_interaction)
# 1.1. Results: High VIF score for serve_points, first_serves, first_serves_points_won, second_serves_points_won, break_points_saved, break_points_faced.
# 1.2. Combine/create ratios
tennis_long$first_serve_win_ratio = tennis_long$first_serves_points_won/tennis_long$serve_points
tennis_long$second_serve_win_ratio = tennis_long$second_serves_points_won/(
tennis_long$serve_points - tennis_long$first_serves)
tennis_long$break_pt_save_ratio = tennis_long$break_points_saved/tennis_long$break_points_faced
##tennis[71196,] l_ serve counts do not make sense. Remove that column from tennis_long
tennis_long <- tennis_long[!is.infinite(tennis_long$second_serve_win_ratio), ]
# 2. Calculate VIF for new variables
mod_new_var <- glm(win ~ draw_size + tourney_level_ord + player_hand + player_height + player_age + rank + rank_points + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio + aces + surface,
data=tennis_long,
family="binomial")
vif(mod_new_var)
# 2.1. RESULTS: draw_size and tourney_level_ord have high VIF scores
tennis_long %>%
group_by(tourney_level_ord) %>%
summarize(mean = mean(draw_size), min = min(draw_size), max = max(draw_size))
# 2.2. Remove tourney_level as more granular info comes from draw_size
# 3. Final model
mod_limit <- glm(win ~ draw_size + player_hand + player_height + player_age + rank + rank_points + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio + aces + surface, data=tennis_long, family="binomial")
vif(mod_limit)
tennis_long %>%
group_by(tourney_level_ord) %>%
summarize(mean = mean(draw_size), min = min(draw_size), max = max(draw_size))
```
To structure this model the relevant variables in related literature as well as the variables of interest along with the interaction term of the type of surface were included. To evaluate multicollinearity, the VIF score was used on an initial logistic regression model including all the variables selected. The raw variables representing a player's total serve points, number of first serve points made, number of first serve points won, number of second serve points won, number of break points faced, number of break points saved, total draw size in the tournament and the tournament level were all highly correlated. It was found that total number of serves attempted, first serves attempted, first serve points won and second serve points one were highly correlated with VIF scores ranging from 9 to 72.
To continue to capture this information, but limit multicollinearity, we combine these into the total first serve points won ratio which is a ratio of the total first serve points won to the total first serves attempted. Similarly, the total second serve points won ratio is the ratio of the total second serve points won to the difference between the total serves attempted and the total first serves attempted. The new VIF scores for these two ratios were less than 1.3 Similarly, the break points saved ratio is used in place of the overall counts. Lastly, between draw size and tournament level, draw size is only used as there is more granular information derived from draw size than tournament level.
```{r, message = FALSE, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
# Model based on above identified columns and lasso regularization to limit variables and prevent overfitting:
#tennis_long_clean <- tennis_long[complete.cases(tennis_long[, c("win", "draw_size", "player_hand", "player_height",
# "player_age", "rank", "rank_points", "double_faults",
# "first_serve_win_ratio", "second_serve_win_ratio",
# "break_pt_save_ratio", "aces", "surface")]), ]
tennis_long_clean <- tennis_long[, c("win", "draw_size", "player_hand", "player_height",
"player_age", "rank", "rank_points", "double_faults",
"first_serve_win_ratio", "second_serve_win_ratio",
"break_pt_save_ratio", "aces", "surface")]#), ]
formula = win ~ draw_size + player_hand+ player_height + player_age + rank + rank_points + aces*surface + double_faults + first_serve_win_ratio + second_serve_win_ratio + break_pt_save_ratio
# Recreate the design matrix and target variable with the cleaned data
X <- model.matrix(formula, data = tennis_long_clean)[, -1] # Remove the intercept column
y <- tennis_long_clean$win
# Fit the logistic regression model with Lasso regularization using glmnet
#lasso_model <- glmnet(X, y, family = "binomial", alpha = 1)
# Print the model details
#print(lasso_model)
# You can use cv.glmnet for cross-validation to find the best lambda
#cv_lasso_model <- cv.glmnet(X, y, family = "binomial", alpha = 1)
# Plot the cross-validation results
#plot(cv_lasso_model)
# Get the best lambda (the optimal penalty)
#best_lambda <- cv_lasso_model$lambda.min
#print(best_lambda)
# Fit the model using the best lambda
#final_lasso_model <- glmnet(X, y, family = "binomial", alpha = 1, lambda = best_lambda)
# Print final model coefficients
#print(coef(final_lasso_model))
```
Mentioned in the methodology section of this document, a multiple imputation technique with the mice package was applied to handle non systematic missing data correspondent to the variable height. Presented below is a summary of the final logistic regression model.
```{r, message = FALSE, warning=FALSE, echo=FALSE}
# Load necessary libraries
library(mice)
library(sjlabelled)
library(tidyverse)
library(caret) # For confusionMatrix
library(knitr) # For kable
library(kableExtra)
# Load necessary libraries
library(pROC)
library(ggplot2)
tennis_long_clean <- tennis_long_clean %>%
mutate(break_pt_save_ratio = break_pt_save_ratio*100)
# Exclude columns and check data
tennis_long.sub <- tennis_long_clean %>%
select(win, draw_size , player_hand, player_height, player_age, rank, rank_points, aces, surface, double_faults, break_pt_save_ratio) # Exclude irrelevant columns
tennis_long.sub$player_hand <- as.factor(tennis_long.sub$player_hand)
tennis_long.sub$surface <- as.factor(tennis_long.sub$surface)
tennis_long.sub <- unlabel(tennis_long.sub) #unlabel the data (labels cause problem for the mice function)
tennis_long.imp <- mice(tennis_long.sub, m=5, method="pmm", print=FALSE, seed = 123)
tennis_long.comp <- complete(tennis_long.imp, "long", include=TRUE) #stack the imputed values into one variable and include the observed values
tennis_long.comp$player_height.comp <- cci(tennis_long_clean$player_height) #create an indicator for completeness
#ggplot(tennis_long.comp, aes(x=factor(.imp), y=player_height, fill=player_height.comp))+
# geom_boxplot()
imp.mods_long <- with(tennis_long.imp, glm(win ~ draw_size + player_hand+ player_height + player_age + rank + rank_points + aces*surface + double_faults + break_pt_save_ratio,
family="binomial"))
coef_summary <- summary(pool(imp.mods_long))
# Extract key information for the table
coefficients <- coef_summary$estimate
std_errors <- coef_summary$std.error
exp_coefficients <- exp(coefficients) # Calculate odds ratios
lower_ci <- exp(coefficients - 1.96 * std_errors) # Lower bound of 95% CI
upper_ci <- exp(coefficients + 1.96 * std_errors) # Upper bound of 95% CI
p_values <- coef_summary$p.value
# Combine coefficients, standard errors, and split confidence intervals into separate columns
coeff_df <- data.frame(
term = coef_summary$term,
coefficient = round(coefficients, 2),
std_error = round(std_errors, 2), # Add standard error
exp_coefficient = round(exp_coefficients, 2),
#lower_ci = round(lower_ci, 2), # Separate lower CI
#upper_ci = round(upper_ci, 2), # Separate upper CI
ci = paste0(round(lower_ci, 2), ",", round(upper_ci, 2)), # Combine CI into one column
#p_value = round(p_values, 2),
p_value = ifelse(p_values <0.001, "<0.001", round(p_values, 3)),
stringsAsFactors = FALSE
)
# Exclude the intercept row
coeff_df <- coeff_df[coeff_df$term != "(Intercept)", ]
# Rename columns for clarity
colnames(coeff_df) <- c("Variable", "Coefficient", "Standard Error", "Odds Ratio", "95% CI", "P-value") #"Upper 95% CI",
# Create the table and align all columns to the right
merged_table <- kable(
coeff_df,
caption = "Logistic Regression Coefficients, Odds Ratios, Confidence Intervals, Standard Errors, and P-values",
align = c("l", "r", "r", "r", "r", "r"), #, "r"
escape = TRUE # Escape LaTeX-sensitive characters
) %>%
kable_styling(full_width = FALSE)
merged_table
```
Analyzing the variables of interest, aces and surface, considering that the variable aces is statistically significant while the interation terms are not, we have a combined effect of 1.11 times increase in odds of winning for every extra ace point when the match is disputed on a clay surface, with a combined confidence interval between 1.03 and 1.19. Similarly, for matches played on Grass the model has a combined effect increasing the odds of winning by 1.07 times. The impact of every extra ace point while playing on hard surfaces represent a combine effect of 1.09 increase in the odds of winning the match.
Another statistically significant variable impacting the odds of winning is the referred to right vs left-handed player. Specifically being Right or Left Handed (as opposed to ambidextrous, the base level) is associated with approximately 1.27 times and 1.29 times increase in odds of winning respectively. Furthermore, the break point serve ration explains an effect of 1.03 times the increase in odds of winning, holding the other variables constant. Finally, statistically significant variables related to the player attributes showed that an increase of a year for a player's age reduces the odds of winning by 2%, while for every extra inch for player the odds of winning reduce by 2%.
```{r, message=FALSE, warning=FALSE,fig.width=5,fig.height=3}
# Load necessary libraries
library(mice)
library(sjlabelled)
library(tidyverse)
library(caret) # For confusionMatrix
library(knitr) # For kable
library(kableExtra)
# Load necessary libraries
library(pROC)
library(ggplot2)
# Generate predictions for each imputed dataset
pred_probs <- complete(tennis_long.imp, "all") %>%
lapply(function(data) {
glm_model <- glm(win ~ draw_size + player_hand + player_height + player_age +
rank + rank_points + aces * surface + double_faults + break_pt_save_ratio,
family = "binomial", data = data)
predict(glm_model, newdata = data, type = "response")
})
# Aggregate predictions by averaging across imputations
mean_pred_probs <- Reduce("+", pred_probs) / length(pred_probs)
# Step 3: Compute ROC and AUC using original win labels
roc_curve <- roc(tennis_long_clean$win, mean_pred_probs)
# Step 4: Plot the ROC curve
roc_data <- data.frame(
FPR = 1 - roc_curve$specificities,
TPR = roc_curve$sensitivities
)
auc_value <- auc(roc_curve)
auc_text <- paste("AUC:", round(auc_value, 2))
ggplot(roc_data, aes(x = FPR, y = TPR)) +
geom_line(color = "#006400", size = 1) +
geom_abline(linetype = "dashed", color = "green") +
annotate("text", x = 0.6, y = 0.2, label = auc_text, size = 5, color = "black") +
labs(title = "ROC Curve", x = "False Positive Rate", y = "True Positive Rate") +
theme_minimal()
```
```{r, message = FALSE, warning=FALSE, echo=FALSE, include=FALSE, results = 'hide'}
library(caret) # For confusionMatrix
library(knitr) # For kable
library(kableExtra)
# Generate predictions
# Generate predictions for each imputed dataset
pred_probs <- complete(tennis_long.imp, "all") %>%
lapply(function(data) {
glm_model <- glm(win ~ draw_size + player_hand + player_height + player_age +
rank + rank_points + aces * surface + double_faults + break_pt_save_ratio,
family = "binomial", data = data)
predict(glm_model, newdata = data, type = "response")
})
# Aggregate predictions by averaging across imputations
mean_pred_probs <- Reduce("+", pred_probs) / length(pred_probs)
pred_binary <- ifelse(mean_pred_probs > 0.5, 1, 0)
pred_binary <- factor(pred_binary, levels = c(0, 1), labels = c("Loss", "Win"))
# Create a confusion matrix
conf_matrix <- confusionMatrix(pred_binary, tennis_long_clean$win)
conf_matrix_table <- conf_matrix$table
conf_matrix_df <- as.data.frame.matrix(conf_matrix_table)
#conf_matrix_df <- cbind(Actual = rownames(conf_matrix_df), conf_matrix_df)
conf_matrix_latex <- conf_matrix_df %>%
kable("latex", booktabs = TRUE,
caption = "Confusion Matrix: Actual vs. Predicted Satisfaction",
align = "c") %>%
kable_styling(latex_options = c("striped", "hold_position"))
# Extract and prepare the metrics for display
metrics <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1 Score", "Specificity", "Sensitivity", "Positive Predictive Value", "Negative Predictive Value"),
Value = c(
conf_matrix$overall['Accuracy'],
conf_matrix$byClass['Pos Pred Value'],
conf_matrix$byClass['Sensitivity'],
conf_matrix$byClass['F1'],
conf_matrix$byClass['Specificity'],
conf_matrix$byClass['Sensitivity'],
conf_matrix$byClass['Pos Pred Value'],
conf_matrix$byClass['Neg Pred Value']
)
)
# Display the metrics using kable
#metrics
#cat("\nMetrics:\n")
#kable(metrics, format = "markdown", col.names = c("Metric", "Value"))
```
The performance of the logistic regression model was evaluated using standard classification metrics. The model achieved an accuracy of 0.68. The precision of the model was 0.68, reflecting its ability to correctly identify positive cases while minimizing false positives. The recall was measured at 0.70, demonstrating the model's capability to correctly identify a high proportion of actual positive cases. Finally, the F1 score, a harmonic mean of precision and recall, was calculated to be 0.8192, indicating a balanced performance between these two metrics. Together, these results suggest the model performs reliably in predicting match outcomes based on the given features.
Finally, shown below, a plot showing the average of aces achieved in a game is slightly higher in respect to the average of aces by Losers. This supports the model estimates and the slight increase in odds of winning (by 1.1 times) for every additional ace played.
```{r, message = FALSE, warning=FALSE, echo=FALSE,fig.width=5,fig.height=3}
# Create the boxplot with a green gradient color scale
ggplot(tennis_long_clean, aes(x = win, y = aces, fill = surface)) +
geom_boxplot() +
scale_fill_manual(values = c("Clay" = "#006400", "Grass" = "#00FF00", "Hard" = "#228B22")) + # Green shades for surfaces
labs(
title = "Distribution of Aces for Winners vs Losers by Surface",
x = "Match Outcome",
y = "Number of Aces",
fill = "Surface"
) +
theme_minimal() +
theme(legend.position = "right") +
theme(plot.margin = margin(10, 10, 10, 10)) # Adjust margins to make the plot smaller
```
## Conclusion
This study identified key factors influencing tennis match durations, emphasizing the combined effects of tournament level and rank difference. The interaction between these two variables reveals that player ranking disparities influence match duration under specific tournament conditions. In intermediate tournaments (level C), greater rank disparity slightly increases match duration, suggesting that skill differences lead to more extended, balanced matches. Conversely, in advanced tournaments (level G), rank disparity reduces match duration, reflecting a quicker resolution due to clearer dominance in skill. At other levels, the interaction effect is not significant, indicating that rank difference plays a minimal role in determining match duration. Other significant predictors included differences in player performance, such as the number of aces and breakpoints saved, both of which revealed nuanced relationships with match duration. Physical characteristics, such as player height, also contributed, albeit modestly, suggesting a strategic role in gameplay.
For match outcomes, the logistic regression model identified aces, court surface, player handedness, draw size, and breakpoint save ratio as significant predictors. An additional ace increased the odds of winning across surfaces, with the strongest effects observed on clay (1.11 times), followed by hard (1.09 times) and grass (1.07 times). The breakpoint serve ratio also contributed, with an incremental effect of 1.03 times on winning odds. The player's attributes such as height or age showed that every extra unit of increase of the respective variables reduce the odds of winning. The logistic regression model demonstrated robust predictive performance, achieving an accuracy of 68%, a precision of 68%, and a recall of 70%. The F1 score of 0.82 reflects a balanced capacity to minimize false positives and false negatives.
Despite these insights, the study faced limitations. The reliance on imputed data may have introduced bias, while the linear regression model's assumptions, such as normality, were not perfectly met despite transformations. Plus, for the multiple linear model the R squared explained 31% of the variance in match duration (R² = 0.31), leaving room for unaccounted influences such as weather, fatigue or external conditions, potentially leaving room for unexplored contributors to match durations. These limitations may have influenced the precision of the model's estimates
Future research should address these limitations by refining data collection, especially for lower-level tournaments, and incorporating additional variables. By exploring these dimensions, future analyses could provide an even more comprehensive understanding of the factors shaping match durations, aiding in tournament planning and player preparation.
## References
[1] Klaassen, F. J., & Magnus, J. R. (2003). Forecasting the winner of a tennis match. European Journal of Operational Research, 148(2), 257-267.
[2] Kovalchik, S. A. (2016). Searching for the GOAT of tennis win prediction. Journal of Quantitative Analysis in Sports, 12(3), 127-138.
[3] Boulier, B. L., & Stekler, H. O. (1999). Are sports seedings good predictors?: an evaluation. International Journal of Forecasting, 15(1), 83-91.
[4] Foley-Train, J. (2014). Sports betting: Commercial and integrity issues. Report prepared for the Association of British Bookmakers, European Gaming and Betting Association, European Sport Security Association and Remote Gambling Association. Retrieved January, 21, 2015.
[5] Štrumbelj, E., & Vračar, P. (2012). Simulating a basketball match with a homogeneous Markov model and forecasting the outcome. International Journal of Forecasting, 28(2), 532-542.
[6] Boulier, B. L., & Stekler, H. O. (1999). Are sports seedings good predictors?: an evaluation. International Journal of Forecasting, 15(1), 83-91.
[7] Lasek, J., Szlávik, Z., & Bhulai, S. (2013). The predictive power of ranking systems in association football. International Journal of Applied Pattern Recognition, 1(1), 27-46.
[8] McHale, I., & Davies, S. (2007). Statistical analysis of the effectiveness of the FIFA world rankings. In Statistical thinking in sports (pp. 89-102). Chapman and Hall/CRC.
[9] Newton, P. K., & Keller, J. B. (2005). Probability of winning at tennis I. Theory and data. Studies in applied Mathematics, 114(3), 241-269.
[10] O'Malley, A. J. (2008). Probability formulas and statistical analysis in tennis. Journal of Quantitative Analysis in Sports, 4(2).
[11] Riddle, L. H. (1988). Probability models for tennis scoring systems. Journal of the Royal Statistical Society Series C: Applied Statistics, 37(1), 63-75.
[12] Kovalchik, S. A. (2016). Searching for the GOAT of tennis win prediction. Journal of Quantitative Analysis in Sports, 12(3), 127-138.
[13] ATP Tour. (n.d.). About the ATP. Retrieved from https://www.atptour.com
[14] Sackmann, J. (n.d.). Tennis databases, files, and algorithms \[Data set\]. Tennis Abstract. Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Based on a work at https://github.com/JeffSackmann.
[15] Sklenička, J. (2024). Predicting the outcomes of tennis matches. How important is the factor of different surfaces?.