-
Notifications
You must be signed in to change notification settings - Fork 0
/
all_t20_internationals_eda_from_(2005_2024).py
361 lines (236 loc) · 13.2 KB
/
all_t20_internationals_eda_from_(2005_2024).py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# -*- coding: utf-8 -*-
"""All T20 Internationals EDA FROM (2005 - 2024)
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/#fileId=https%3A//storage.googleapis.com/kaggle-colab-exported-notebooks/all-t20-internationals-eda-from-2005-2024-45ff7d5c-82ea-4961-a645-c47bee91292d.ipynb%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com/20240229/auto/storage/goog4_request%26X-Goog-Date%3D20240229T164054Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3Da4da9b4cd9abf8a3f2661d7603ef9408d5f62e9c907a4172a67c203f3a964bbc2da81d28bd5cecf36a1b8e8a46329c651b56bea69a6be0876277957dd5b341023a9944ab00fbe8f2f9f2ec4e730c98ab97238613bdd8a2b4f6ff057af7979583eb04861f178a20b1146d1c62adddcc5986063add96745cf0e6fdd89e3c4f695708d2b35f0df392f5e3698a7f572ae1fcc0236a5a445a7a03b0e98e1b8901fe57e305188f98bd2cd48375222787f695c27d537da42e4c666a0e0ef7b5594ca9d215c0bda4a50cbb75b644da76ad74d91134a3018bb1cd21ca8184fd35ba1ef879a637493fcbfdeb35f18e6310085e1716888bda7538f80334527ec69bb827840a
# <b> Exploratory Data Analysis on Cricket T20 Internationals From 2005-2024
<h1 style="font-family: 'poppins'; font-weight: bold; color: Green;">👨💻Author: Irfan Ullah Khan</h1>
[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/programmarself)
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/programmarself)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/irfan-ullah-khan-4a2871208/)
[![YouTube](https://img.shields.io/badge/YouTube-Profile-red?style=for-the-badge&logo=youtube)](https://www.youtube.com/@irfanullahkhan7748)
[![Facebook](https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook)](https://www.facebook.com/programmar.person.5)
[![TikTok](https://img.shields.io/badge/TikTok-Profile-black?style=for-the-badge&logo=tiktok)](https://www.tiktok.com/@world_changing_words)
[![Twitter/X](https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter)](https://twitter.com/programmarself)
[![Instagram](https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram)](https://www.instagram.com/programmar.person.5/)
[![Email](https://img.shields.io/badge/Email-Contact%20Me-red?style=for-the-badge&logo=email)](mailto:[email protected])
### <b>About Dataset</b>
This extensive dataset serves as a comprehensive repository of historical information regarding T20 International (T20I) cricket matches dating back to the inception of the format. T20I cricket is renowned for its thrilling encounters, and this dataset meticulously documents the particulars of these matches. It stands as a valuable resource for cricket enthusiasts, statisticians, and analysts eager to delve into and dissect T20I cricket data.
<b> Key Features:</b>
- Match Details: Thorough information pertaining to each T20I match, encompassing match date, location, and format.
- Teams and Players: In-depth details about the participating teams, encompassing player names, roles, and batting/bowling statistics).
- Match Outcomes: Insights into match results, encompassing the victorious team and the margin of victory.
- Player of the Match: Recognition of the standout player in each T20I match.
- Umpires and Match Referees: Particulars of the officials responsible for overseeing the match.
- Toss Details: Revelations about the toss winner's decisions, which can significantly influence the game's trajectory.
- Venue Information: Location specifics, including stadium name, city, and country.
# Importing Liabraries
"""
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
"""# Loading datasets"""
# All matches records
matches_data = pd.read_csv("/content/t20i_Matches_Data.csv")
# Players information
players_info = pd.read_csv("/content/players_info.csv")
#Batting Stats
batting_data = pd.read_csv("/content/t20i_Batting_Card.csv")
#Bowling Stats
bowling_data = pd.read_csv("/content/t20i_Bowling_Card.csv")
"""### Understanding and cleaning matches data"""
matches_data.head()
matches_data.columns
matches_data.shape
# Renaming columns
new_col = []
for col_name in matches_data.columns.to_list():
new_col.append(col_name.replace(" ","_").lower())
matches_data.columns = new_col
matches_data.columns
matches_data.head()
#Looking for null values
matches_data.isna().sum()
# Dropping Redundant columns
matches_data.drop(columns=['match_referee', 'umpire_1', 'umpire_2'], inplace=True)
matches_data.head(2)
matches_data.sort_values('t20i_match_no', inplace=True)
matches_data
# Converting date values to Date Time format i.e Timestamp
matches_data['match_date'] = pd.to_datetime(matches_data['match_date'])
# Finding out time range of data
print(matches_data['match_date'].min())
print(matches_data['match_date'].max())
"""### Understanding & Cleaning player information data"""
players_info.head()
players_info.shape
players_info.isna().sum()
players_info['gender'].value_counts()
# Dropping Redundant Columns
players_info.drop(columns=['gender', 'image_url', 'image_metadata'], inplace=True)
players_info.head()
"""### Understanding & Cleaning batting stats data"""
batting_data.shape
batting_data.head(20)
batting_data.tail(20)
batting_data.sample(10)
#Checking null values
batting_data.isna().sum()
# Dropping rows with null values in runs, balls, fours, sixes, strikeRate
batting_data.dropna(subset=['runs','balls','fours','sixes', 'strikeRate'], inplace=True)
batting_data.isna().sum()
batting_data
batting_data.rename(columns= {'Match ID' : 'match_id'}, inplace=True)
batting_data
"""### Understanding & Cleaning bowling stats data"""
bowling_data
bowling_data.sample(20)
# Checking null values
bowling_data.isna().sum()
# Verying Null rows
bowling_data[bowling_data['team'].isna()]
# Dropping these rows
bowling_data.dropna(subset="team", inplace=True)
bowling_data.isna().sum()
bowling_data[bowling_data['dots'].isna()]
# No need to drop these
bowling_data.rename(columns= {'Match ID' : 'match_id', 'bowler id' : 'bowler_id'}, inplace=True)
bowling_data
"""# <b>Performing Exploratory Data Analysis(EDA) for finding useful insights and answering analytical questions.
# Which country has hosted the most number of matches?
"""
matches_data.head(2)
matches_data['match_venue_(country)'].value_counts().nlargest(10)
# Setting figure size
plt.figure( figsize = (15,5) )
# Setting figure style or theme
sns.set_style('darkgrid')
# Plotting chart
ax = sns.countplot(data=matches_data, x='match_venue_(country)', palette='hls')
# Setting labels with each bar count
for container in ax.containers:
ax.bar_label(container, label_type="edge", padding=1, size=9, color="black")
# Customizations
plt.tick_params('x', rotation=90)
plt.xlabel("Match venues (countries)")
plt.ylabel("Matches")
plt.title("No. of matches hosted by countries")
plt.yticks([0,25,50,75,100,125,150,175,200,225])
# Show chart
plt.show()
"""<b> From the chart, we can see that United Arab Emirates has hosted the most number of T20I matches till now i.e 236 </b>
# Top 10 players with most man-of-the-match awards
"""
matches_data.head(2)
# Grouping the dataframe by 'mom_player' columns and aggregating the rows by counting the no. of awards
motm_players = matches_data.groupby('mom_player')[['match_id']].count().rename(columns={'match_id' : 'awards'})
# Merging the result dataframe with player info on each unique player id
motm_with_names = motm_players.merge(players_info, left_index=True, right_on='player_id')
# Setting size and theme
plt.figure(figsize=(12,6))
sns.set_style('whitegrid')
# Plotting by 10 largest no. of awards
ax = sns.barplot(x='player_name', y='awards', data=motm_with_names.nlargest(10,'awards'), palette='viridis')
# Customizations
ax.tick_params('x', rotation=45)
ax.set_title("Top 10 players with most MotM awards in T20Is", fontweight = 'bold', fontsize = 15)
ax.set_xlabel('Players', fontweight = 'bold', fontsize = 13)
ax.set_ylabel('No. of awards', fontweight = 'bold', fontsize = 13)
# Setting bar label on each container
for container in ax.containers:
ax.bar_label(container, padding=-20, size=12, color='white')
"""<b> Virat Kholi has won the most Man-of-the-Match awards followed by Muhammad Nabi and Sikandar Raza</b>
# Highest wicket takers in T20 Internationals
"""
bowling_data.head()
# Grouping data by bowler ids and aggregating balls, wickets and economy
bowlers_record = bowling_data.groupby('bowler_id')[['balls', 'wickets', 'economy']].agg(
{'balls' : 'sum',
'wickets' : 'sum',
'economy' : 'mean'})
# Extracting top 10 bowlers by wickets
top_10_bowlers = bowlers_record.nlargest(10, 'wickets')
# Merging dataframe with players info
top_10_bowlers_names = top_10_bowlers.merge(players_info, left_index=True, right_on='player_id')
# Setting size and theme
plt.figure(figsize=(8,5))
sns.set_style('darkgrid')
# Plotting with balls bowled on x-axis and economy on y-axis
ax = sns.scatterplot(x='balls', y='economy', data=top_10_bowlers_names, size='wickets', hue='player_name', sizes=(100,200))
# Customizing legend
plt.legend(bbox_to_anchor=(1.07, 1), loc='upper left', borderaxespad=0)
# Annotating points on the plot with wickets and economy
for lab,row in top_10_bowlers_names.iterrows():
ax.annotate(f"W: { int(row['wickets']) }\nE: { round(row['economy'],2) }", xy=(row['balls']+15, row['economy']-0.025))
# Customizing the chart
ax.set_title("Highest wicket takers in T20I", fontweight = 'bold', fontsize=15)
ax.set_xlabel("Balls bowled", fontweight = 'bold')
ax.set_ylabel("Economy", fontweight = 'bold')
plt.show()
"""<b><li>Rashid Khan is the most economical bowler with 130 wickets <br>
<b><li>Tim Southee has the highest wickets uptill now but also with a high economy
# Highest run scorers in T20 Internationals
"""
batting_data.head()
batsman_records = batting_data.pivot_table(values=['runs', 'balls', 'fours', 'sixes', 'strikeRate', 'isOut'], index='batsman',
aggfunc={'runs': 'sum', 'balls' : 'sum', 'fours' : 'sum', 'sixes' : 'sum', 'strikeRate' : 'mean', 'isOut' : 'sum'})
top_10_batsman = batsman_records.nlargest(10,'runs').reset_index()
top_10_batsman['batting_avg'] = top_10_batsman['runs'] / top_10_batsman['isOut']
top_10_batsman_names = top_10_batsman.merge(players_info, left_on='batsman', right_on='player_id')
top_10_batsman_names
# Setting size and theme
plt.figure(figsize=(8,5))
sns.set_style('whitegrid')
# Plotting with balls bowled on x-axis and economy on y-axis
ax = sns.scatterplot(x='runs', y='strikeRate', data=top_10_batsman_names, size='batting_avg',
hue='player_name', sizes=(100,250), palette='Paired')
# Customizing legend
plt.legend(bbox_to_anchor=(1.07, 1), loc='upper left', borderaxespad=0)
# Annotating points on the plot with wickets and economy
for lab,row in top_10_batsman_names.iterrows():
ax.annotate(f"R: { int(row['runs']) }\nA: {round(row['batting_avg'],2)}", xy=(row['runs']+30, row['strikeRate']-0.5))
# Customizing the chart
ax.set_title("Highest run scorers in T20I", fontweight = 'bold', fontsize=15)
ax.set_xlabel("Runs scored", fontweight = 'bold')
ax.set_ylabel("Strike rate", fontweight = 'bold')
plt.show()
"""<b><li>Highest Run scorer is Virat Kohli with 4008 runs with average of 52.74<br><li> Jos Buttler has the highest strike rate.<br><li> Babar Azam has scored 3485 runs with average of 41.49
# Who hit most sixes?
"""
batsman_records.head()
# Filter top 5 players with most sixes and merging it with player info dataframe
top_5_sixes = batsman_records.nlargest(5,'sixes')
top_5_sixes_names = top_5_sixes.merge(players_info, left_index=True, right_on='player_id')
# Setting size and theme
plt.figure(figsize=(8,6))
sns.set_style('darkgrid')
# Plotting values and setting title and labels
ax = sns.barplot(x='player_name', y='sixes', data=top_5_sixes_names.sample(5), palette='husl')
ax.set_title("Most sixes in T20Is", fontweight = 'bold', fontsize=15)
ax.set_xlabel("Players", fontweight = 'bold', fontsize=13)
ax.set_ylabel("Sixes", fontweight = 'bold', fontsize=13)
# Setting label on each bar
for container in ax.containers:
ax.bar_label(container, padding=-30, fontsize = 17, color='white')
"""<b><li> Rohit Sharma has hitted most sixes till now i.e 182
# Most successful teams in T20Is
"""
matches_data.head(2)
# Grouping by match winners and counting them
matches_win_countries = matches_data.pivot_table(values='match_id', index='match_winner', aggfunc='count').reset_index()
# Renaming and extracting 10 largest
top_10_countries = matches_win_countries.rename(columns={'match_id':'matches_won', 'match_winner' : 'country'}).nlargest(10,'matches_won')
plt.figure(figsize=(10,6))
sns.set_style('darkgrid')
# Plotting
ax =sns.barplot(x='matches_won', y='country', data=top_10_countries, label="Won", palette='rocket')
# Customizaitons
ax.set_title("Most succesful teams in T20Is", fontweight = 'bold', fontsize=15)
ax.set_xlabel("Matches won", fontweight = 'bold', fontsize=13)
ax.set_ylabel("Teams", fontweight = 'bold', fontsize=13)
# Setting label on each bar
for container in ax.containers:
ax.bar_label(container, padding=-35, fontsize = 17, color='white')
plt.show()
"""# **All uptill now!**"""