Deepak Karawande June 13, 2021
This vignette is an introduction on how to query REST-API end points using R and perform exploratory data analysis using various R packages. We’ll use be using the NHL REST-API endpoints which can be contracted by following instruction on https://gitlab.com/dword4/nhlapi/-/blob/master/records-api.md
Following list of packages were used for accessing REST-API endpoints and exploratory data analysis and presentation.
library("httr")
library("jsonlite")
library("tidyverse")
library("kableExtra")
NHLAPI project on github provides REST-API endpoints to access various datapoints for historical NHL games.
For this project, I accessed 7 different endpoints from NHLAPI to fetch information about NHL - 1. Franchise summary 2. Franchise details 3. Total stats for franchise 4. Season records 5. Skater records 6. Admin history and retired numbers 7. Team stats
GET function from httr package was used for fetching data through REST-API. Using content and from JSON function data received was converted into r dataframe object.
# Base url to access NHLAPI
get_baseURL <- function() { return("https://records.nhl.com/site/api") }
# helper function to fetch data from REST-API endpoint and conert it to dataframe.
fetch_DF <- function(api_url) {
get_response <- GET(api_url)
json_contents <- content(get_response, "text")
list_res <- fromJSON(json_contents, flatten = TRUE)
return(list_res)
}
# Get All Franchise
get_all_franchise <- function() {
tab_name <- "franchise"
full_url <- paste0(get_baseURL(), "/", tab_name)
franchise_res <- fetch_DF(full_url)
franchise_res$data
}
By default NHL REST-APIs return data for all franchise or team. Using helper functions I created a ability for user to provide Id or Name to fetch data for specific franchise/team if desired. If no Id or Name is passed query functions will return data for all franchise or teams.
# Function returns id for franchise by matching team common name to the fran_name argument provided. If team name was not found it will return -1.
get_franchise_id <- function(fran_id=NA, fran_name=NA) {
retValue <- NA
if(!is.na(fran_id)){
return(fran_id)
} else if(!is.na(fran_name)) {
#get id from name
franchise <- get_all_franchise() %>% filter(teamCommonName == fran_name)
if(nrow(franchise) == 1) {
retValue <- franchise$id
} else {
#we couldn't find the teamid for team name specified.
retValue <- -1
}
}
return(retValue)
}
With the help of functions above, created a R function to fetch data from desired NHL Records endpoint using either franchise id or team common name. If franchise id or team common name was not provided api will return data from all franchises. Similar set of functions were used to contract with NHL team stats RES-API end point.
# Get NHL records as data.frame for given team by id/name or all. If team not found then empty data.frame is returned.
get_NHL_records <- function(tab_name, fran_id=NA, fran_name=NA) {
full_url <- ""
id_filter <- ""
id <- get_franchise_id(fran_id, fran_name)
if(!is.na(id)) {
if(id != -1) {
# fetch franchise information for given team id.
if(tab_name == "franchise" | tab_name == "franchise-detail") {
id_filter <- paste0("cayenneExp=id=",id)
} else {
id_filter <- paste0("cayenneExp=franchiseId=",id)
}
full_url <- paste0(get_baseURL(), "/", tab_name, "?", id_filter)
} else {
# Couldn't fine franchise by the name provided.
return(data.frame())
}
} else {
#fetch franchise information for all teams
full_url <- paste0(get_baseURL(), "/", tab_name)
}
#fetch NHL records
records_res <- fetch_DF(full_url)
return(records_res$data)
}
Using Switch-Case, created wrapper function for providing simplicity for fetching data from all NHL REST-API endpoints relevant to this project,
NHL_wrapper_api <- function(command, fran_id=NA, fran_name=NA, team_id=NA, team_name=NA) {
result = switch(
command,
"get_franchise"= get_NHL_records("franchise", fran_id, fran_name),
"get_total_stats"= get_NHL_records("franchise-team-totals", fran_id, fran_name),
"get_season_records"= get_NHL_records("franchise-season-records", fran_id, fran_name),
"get_goalie_records"= get_NHL_records("franchise-goalie-records", fran_id, fran_name),
"get_skater_records"= get_NHL_records("franchise-skater-records", fran_id, fran_name),
"get_franchise_detail"= get_NHL_records("franchise-detail", fran_id, fran_name),
"get_stats_for_team"=get_NHL_stats(team_id, team_name),
print0("command not found, available commands are = \n",
"get_franchise",
"get_total_stats",
"get_season_records",
"get_goalie_records",
"get_skater_records",
"get_franchise_detail",
"get_team_stats_for_season",
"get_stats_for_team"
)
)
return(result)
}
To explore data, started with combining data from franchise summary and detail using inner_join to get basic tabular view of all franchises. Rendered such summary table using kable function from knitr.
# Fetch franchise and franchise detail
franchise <- as.tbl(NHL_wrapper_api(command="get_franchise"))
franchise_details <- as.tbl(NHL_wrapper_api(command="get_franchise_detail"))
# Combine franchise and franchise detail using inner join.
franchise_joined <- inner_join(franchise, franchise_details, by="id" ) %>%
select(id, heroImageUrl, teamCommonName, active )
# make image urls renderable for franchise.
franchise_joined$heroImageUrl[!is.na(franchise_joined$heroImageUrl)] <- sprintf("![](%s){width=100px}", franchise_joined$heroImageUrl)
#print
franchise_joined %>%
knitr::kable(caption="Franchise Summary Preview") %>%
kable_styling()
There are many ways to compute/add more variables to the dataset you are working with. I used group_by and summarize functions toc create two new variables totalWins and toalLosses for each unique combination of Franchise Id & Team. Also computed percentage wins for each such combination of Franchise Id & Team using total wins and total losses.
# calculate % of wins or % losses
Team_total_stats <- as.tbl(NHL_wrapper_api(command="get_total_stats"))
#colnames(Team_total_stats)
Team_total_stats %>%
select(teamName, franchiseId, wins, losses) %>%
group_by(franchiseId, teamName) %>%
summarise(totalWins = sum(wins), totalLosses = sum(losses)) %>%
mutate(perWins = round(totalWins/(totalWins+totalLosses),2)) %>%
arrange(desc(perWins)) %>%
head() %>%
knitr::kable(caption="Win/Loss Percentages by Franchise & Teams") %>%
kable_styling()
franchiseId | teamName | totalWins | totalLosses | perWins |
---|---|---|---|---|
38 | Vegas Golden Knights | 210 | 120 | 0.64 |
1 | Montréal Canadiens | 3917 | 2623 | 0.60 |
15 | Dallas Stars | 1189 | 833 | 0.59 |
16 | Philadelphia Flyers | 2310 | 1670 | 0.58 |
27 | Colorado Avalanche | 1131 | 822 | 0.58 |
6 | Boston Bruins | 3573 | 2740 | 0.57 |
Contingency tables were created for Total goals scored by skaters by his position and franchise Id. 2 separate tables were created by considering active and inactive players.
# create contingency table for franchise, Wins, losses, %wins
skaters <- as.tbl(NHL_wrapper_api(command="get_skater_records"))
skaters$positionCode <- factor(skaters$positionCode)
skaters$franchiseName <- factor(skaters$franchiseName)
levels(skaters$positionCode) <- c("Center Forward", "Defenseman", "Left Wing Forward", "Right Wing Forward")
skaters %>%
filter(activePlayer == TRUE) %>%
group_by(franchiseName, positionCode) %>%
summarise(TotalGoals=sum(goals)) %>%
spread(positionCode, TotalGoals) %>%
head() %>%
knitr::kable(caption="Active Players: Goal Counts by FranchiseID and Position") %>%
kable_styling()
franchiseName | Center Forward | Defenseman | Left Wing Forward | Right Wing Forward |
---|---|---|---|---|
Anaheim Ducks | 677 | 237 | 308 | 797 |
Arizona Coyotes | 294 | 351 | 200 | 272 |
Boston Bruins | 1020 | 359 | 663 | 420 |
Buffalo Sabres | 450 | 217 | 377 | 218 |
Calgary Flames | 600 | 308 | 434 | 96 |
Carolina Hurricanes | 970 | 244 | 471 | 110 |
skaters %>%
filter(activePlayer == FALSE) %>%
group_by(franchiseName,positionCode) %>%
summarise(TotalGoals=sum(goals)) %>%
spread(positionCode, TotalGoals) %>%
head() %>%
knitr::kable(caption="Inactive Players: Goal Counts by FranchiseID and Position") %>%
kable_styling()
franchiseName | Center Forward | Defenseman | Left Wing Forward | Right Wing Forward |
---|---|---|---|---|
Anaheim Ducks | 982 | 622 | 1084 | 915 |
Arizona Coyotes | 2615 | 1203 | 2162 | 2563 |
Boston Bruins | 5667 | 2658 | 4986 | 5271 |
Brooklyn Americans | 519 | 202 | 519 | 403 |
Buffalo Sabres | 3516 | 1346 | 2852 | 3413 |
Calgary Flames | 3436 | 1497 | 2295 | 3596 |
Numerical summaries were created for toals, gamesPlayed, mostGoalsOneGame, mostGoalsOneSeason by skaters with different positions as follows -
# Numerical Summaries
sakters_table <- function(pos){
data <- skaters %>% filter(positionCode == pos) %>% select(goals, gamesPlayed, mostGoalsOneGame, mostGoalsOneSeason)
kable(apply(data, 2, summary), caption = paste("Summary Goals by Position", pos), digit = 1) %>%
kable_styling()
}
sakters_table("Center Forward")
goals | gamesPlayed | mostGoalsOneGame | mostGoalsOneSeason | |
---|---|---|---|---|
Min. | 0.0 | 1.0 | 0.0 | 0.0 |
1st Qu. | 1.0 | 15.0 | 1.0 | 1.0 |
Median | 6.0 | 55.0 | 1.0 | 5.0 |
Mean | 27.6 | 121.2 | 1.4 | 9.4 |
3rd Qu. | 26.0 | 148.0 | 2.0 | 14.0 |
Max. | 692.0 | 1607.0 | 7.0 | 92.0 |
sakters_table("Defenseman")
goals | gamesPlayed | mostGoalsOneGame | mostGoalsOneSeason | |
---|---|---|---|---|
Min. | 0.0 | 1 | 0.0 | 0.0 |
1st Qu. | 0.0 | 15 | 0.0 | 0.0 |
Median | 2.0 | 56 | 1.0 | 2.0 |
Mean | 8.7 | 117 | 0.9 | 3.4 |
3rd Qu. | 8.0 | 150 | 1.0 | 5.0 |
Max. | 395.0 | 1564 | 5.0 | 48.0 |
sakters_table("Left Wing Forward")
goals | gamesPlayed | mostGoalsOneGame | mostGoalsOneSeason | |
---|---|---|---|---|
Min. | 0.0 | 1.0 | 0.0 | 0.0 |
1st Qu. | 1.0 | 14.0 | 1.0 | 1.0 |
Median | 5.0 | 52.0 | 1.0 | 4.0 |
Mean | 24.1 | 110.5 | 1.4 | 8.8 |
3rd Qu. | 24.0 | 141.0 | 2.0 | 14.0 |
Max. | 730.0 | 1436.0 | 6.0 | 65.0 |
sakters_table("Right Wing Forward")
goals | gamesPlayed | mostGoalsOneGame | mostGoalsOneSeason | |
---|---|---|---|---|
Min. | 0.0 | 1.0 | 0.0 | 0.0 |
1st Qu. | 1.0 | 15.0 | 1.0 | 1.0 |
Median | 6.0 | 53.0 | 1.0 | 5.0 |
Mean | 28.2 | 117.9 | 1.5 | 9.9 |
3rd Qu. | 27.0 | 146.5 | 2.0 | 15.0 |
Max. | 786.0 | 1687.0 | 5.0 | 86.0 |
ggplot packages supports creating nice plots to describe data.
For created Bar plot of totals goals by skater’s position used geom_bar function from ggplot with stat=identity to used y value provided for bar height.
skatersData <- skaters %>%
group_by(positionCode) %>%
summarise(TotalGoals=sum(goals), TotalGames=sum(gamesPlayed))
ggplot(skatersData, aes(x = positionCode, y=TotalGoals )) +
geom_bar(stat="identity") +
ggtitle("Bar Plot: Total Goals by PositionCode of Skaters")
Density plot is created using geom_histogram for mostSaves in one game by a goalie.
# fetch goalie data
goalie <- as.tbl(NHL_wrapper_api(command="get_goalie_records"))
#keep only mostSavesOneGame column in it
goalie_msg <- select(goalie, mostSavesOneGame, activePlayer)
goalie_msg$activePlayer <- factor(goalie_msg$activePlayer)
levels(goalie_msg$activePlayer) <- c("Inactive", "Active")
#remove NA rows
goalie_msg <- na.omit(goalie_msg)
ggplot(goalie_msg, aes(x = mostSavesOneGame, ..density..)) +
geom_histogram(bins = 20) +
ggtitle("Histogram for Most Save by a Goalie in one game") +
ylab("Density") +
geom_density(col = "red", lwd = 3, adjust = 0.4)
Using facet_wrap layer density plot of active and inactive player for mostSaves in one game by a goalies created as follows -
ggplot(goalie_msg, aes(x = mostSavesOneGame, ..density..)) +
geom_histogram(bins = 20) +
facet_wrap(~activePlayer) +
ggtitle("2 Histogram for Most Save by a Active vs InActive Goalie in one game") +
ylab("Density") +
geom_density(col = "red", lwd = 3, adjust = 0.4)
Box plot of points for active and inactive franchise is created with geom_boxplot layer.
geom_point layer function allows creating scatter plot with ggplot. Her is active and inactive franchise wins and fit linear model line in it.