Wins Above Replacement

I developed this framework for Wins Above Replacement (referred to henceforth as WAR) in 2017 and its antecedent K¹ a year prior. The time required to properly document the methodology was unfortunately hard to come by until now. I am a proponent of reproducible research and hence I have decided to publish this work in notebook form with the inclusion of my R code so that those inclined may follow along. Herein, I will outline the governing principles, the construction and the results of my model for WAR.

WAR, first developed by sabermetricians in baseball circles, is an evaluative metric intended to approximate a player’s impact in units of wins added. A player’s contribution is measured against what a replacement level player is expected to offer in the same circumstances. Replacement level is not uniquely defined, but it is commonly equated to players earning a league-minimum salary. I opt to echo this paradigm. The manner by which wins added are calculated is more abstract still, and is wholly dependent on the sport in question. One can propose a simplistic and incomplete version of wins added in hockey by isolating the shooting component. Such a model may consider a skater’s ability to score goals by comparing their shooting percentage with that of an average replacement level skater. In this example, wins added would be given by:

\[ Wins\ Added = \alpha\times((Sh\%-Sh\%_{Replacement})\times SoG) \]

where $\alpha$ is a conversion factor for goals to wins. Evidently, this framework is flawed and incomplete. But, we can expand on this general idea to properly incorporate and adequately measure the ways by which a player can impact the occurrence of goals and wins.

The K framework proposed 4 ways by which a player can influence the occurrence of goals:

The rate of shots for or against a team
The likelihood of a shot becoming a goal
The rate of penalties taken for or against a team
The game state or context of imminent play

However, revision led me to separate the second item into two distinct components: impact on shot quality and shooting. Originally, it was suggested that a player’s impact on the likelihood of a goal being scored on a given shot could be measured if the player was not the shooter or goaltender. While this notion holds true in theory, insufficient information exists relating to the role of on-ice actors on the results of shots to draw reliable conclusions. Instead, we make the assumption that all skaters involved in play can impact shot quality (as measured by xG²), but only those acting as the shooter can impact the possibility of scoring.

I propose a framework for WAR in hockey using the following distinct components:

Shot Rates
- For
- Against
Shot Quality
- For
- Against
Shooting
Goaltending
Penalties
- For
- Against
Zone Transitions
- Offensive
- Neutral
- Defensive

where it is paramount that no overlap occurs between a player’s impact measured on one component and another.

It is important to note that player impact should be independent of contextual factors that may influence results yet fall outside of the player’s control. Such elements in hockey are: zone starts, game state and score effects³. Indeed, we aim here to isolate the effect had by players on observed results from those of external factors such that players who are exposed to different game situations may be easily compared.

There exist many instances of people using regression to isolate the partial impacts of players in various sports. In the NHL, such examples include Brian MacDonald’s Adjusted Plus-Minus⁴, Michael Schuckers’ Total Hockey Rating (THoR)⁵, and Gramacy et al.’s logistic regression approach⁶. I also employed this strategy with K, which took the form of a composite of several regressions.

Regression is a statistical method used to represent a dependent variable as a function of one or more independent variables. This representation is highly interpretable as the partial effects of each independent variable are given by their respective coefficients in a linear combination. Consider the form:

\[ y \sim \beta_{1}x_{1}+\beta_{2}x_{2}+ ... +\beta_{n}x_{n}+\beta_{0} \]

where $y$ is modelled as a function of n variables $[x_{1}, x_{2},...,x_{n}]$. By using an optimization procedure (typically Ordinary Least Squares) it is possible to find values for the n+1 coefficients $[\beta_{0},\beta_{1},...,\beta_{n}]$ that produce the best fit to the dependent variable $y$. Hence, if the nth independent variable $x_{n}$ increases by 1, it is expected that the dependent variable $y$ increase by $\beta_{n}$. To illustrate how this method can be applied to WAR, we can reformulate the example given earlier for scoring impact. The starting assumption might be: the probability of scoring on a shot is a function of the shooter and the goaltender. The coefficient associated with a given player acting as shooter would then give the player’s partial impact on shooting percentage, independent from the role of goaltenders. Some maneuvering is required in order to structure this process as a regression problem, but this will be shown in a later section.

By expanding on this regression framework, one can develop ways to isolate and measure the impact of each player on processes directly or indirectly resonsible for the occurence of goals. In order to accomplish this, all players must be present as variables in the regression alongside contextual features like those listed afore. As I will show, a binary vector of length m equal to the total number of players involved in the regression can be constructed, where the mth entry is 1 if the player at position m is involved in the play and 0 otherwise. The desired coefficients resulting from OLS optimization will exist as a vector of the same length multiplied by the binary array. Formally:

\[ Let\ X = \begin{bmatrix} x_1\\ x_2 \\...\\x_m \end{bmatrix} and\ \beta=\begin{bmatrix} \beta_1\\ \beta_2\\...\\\beta_m \end{bmatrix}\\\therefore X^T\beta = X\cdot\beta \equiv \sum_{i=1}^{m}x_i\beta_i\]

So, for any given observation of a dependant variable $y$, the regression simplifies to a linear combination of the coefficients belonging to players whose positions in the binary vector $X$ are nonzero plus the remaining independant variables.

With these concepts in place, we have a solid foundation on which to build a WAR model.

print("Hello, world!")

## [1] "Hello, world!"

Next, I will begin to demonstrate how my WAR model was constructed in R while explaining each step in as much detail as possible. Note that I have made Play-By-Play files available in .CSV form at the following URLs:

2017-2018: http://fenwicka.com/shiny/pbp_20172018.csv
2016-2017: http://fenwicka.com/shiny/pbp_20162017.csv
2015-2016: http://fenwicka.com/shiny/pbp_20152016.csv
2014-2015: http://fenwicka.com/shiny/pbp_20142015.csv
2013-2014: http://fenwicka.com/shiny/pbp_20132014.csv
2012-2013: http://fenwicka.com/shiny/pbp_20122013.csv
2011-2012: http://fenwicka.com/shiny/pbp_20112012.csv
2010-2011: http://fenwicka.com/shiny/pbp_20102011.csv
2009-2010: http://fenwicka.com/shiny/pbp_20092010.csv
2008-2009: http://fenwicka.com/shiny/pbp_20082009.csv
2007-2008: http://fenwicka.com/shiny/pbp_20072008.csv

If you wish to follow along in your own R environment, you can easily load this season’s play-by-play as follows:

pbp <- read.csv(url("http://fenwicka.com/shiny/pbp_20172018.csv"))

Additionally, the following packages are required:

## Dependencies
require(dplyr)
require(RSQLite)
require(doMC)
require(Kmisc)
require(survival)
require(glmnet)
require(caret)
require(Matrix)

These custom functions will also be used:

## Functions
# Meta
nabs <- function(x) {
  
  return(as.numeric(as.character(x)))
  
}

# General
is_on <- function(player, pbp, venue) {

  if(tolower(venue) == "home") {

    vect <- 1*(pbp$home_on_1 == player) +
            1*(pbp$home_on_2 == player) +
            1*(pbp$home_on_3 == player) +
            1*(pbp$home_on_4 == player) +
            1*(pbp$home_on_5 == player) +
            1*(pbp$home_on_6 == player)

  } else if (tolower(venue) == "away") {

    vect <- 1*(pbp$away_on_1 == player) +
            1*(pbp$away_on_2 == player) +
            1*(pbp$away_on_3 == player) +
            1*(pbp$away_on_4 == player) +
            1*(pbp$away_on_5 == player) +
            1*(pbp$away_on_6 == player)

  }

  vect[which(is.na(vect) == TRUE)] <- 0
  vect[which(vect > 1)] <- 1

  return(vect)

}

is_on2 <- function(player, pbp, ref) {

  if(tolower(ref) == "for") {

    vect <- (1*(pbp$home_on_1 == player) +
             1*(pbp$home_on_2 == player) +
             1*(pbp$home_on_3 == player) +
             1*(pbp$home_on_4 == player) +
             1*(pbp$home_on_5 == player) +
             1*(pbp$home_on_6 == player)
             )*(pbp$event_team == pbp$home_team) +
            (1*(pbp$away_on_1 == player) +
             1*(pbp$away_on_2 == player) +
             1*(pbp$away_on_3 == player) +
             1*(pbp$away_on_4 == player) +
             1*(pbp$away_on_5 == player) +
             1*(pbp$away_on_6 == player)
             )*(pbp$event_team == pbp$away_team)

  } else if (tolower(ref) == "against") {

    vect <- (1*(pbp$home_on_1 == player) +
             1*(pbp$home_on_2 == player) +
             1*(pbp$home_on_3 == player) +
             1*(pbp$home_on_4 == player) +
             1*(pbp$home_on_5 == player) +
             1*(pbp$home_on_6 == player)
             )*(pbp$event_team == pbp$away_team) +
            (1*(pbp$away_on_1 == player) +
             1*(pbp$away_on_2 == player) +
             1*(pbp$away_on_3 == player) +
             1*(pbp$away_on_4 == player) +
             1*(pbp$away_on_5 == player) +
             1*(pbp$away_on_6 == player)
             )*(pbp$event_team == pbp$home_team)

  }

  vect[which(is.na(vect) == TRUE)] <- 0
  vect[which(vect > 1)] <- 1

  return(vect)

}

player_dummy <- function(player, pbp, type) {

  if(tolower(type) == "shooting") {

    vect <- 1*(as.character(pbp$event_player_1) == player)

  } else if(tolower(type) == "pens") {

    vect <- 1*(as.character(pbp$player) == player)

  }

  vect[which(is.na(vect) == TRUE)] <- 0
  vect[which(vect > 1)] <- 1

  return(vect)

}

OR2dProb <- function(OR, prob1) {

  prob2 <- (OR - 1)*(prob1/(1 - prob1))/
           (1 + (OR - 1)*(prob1/(1 - prob1)))

  return(prob2)

}

To begin, I load one full season (including playoffs) of Play-By-Play data from my database. This action is equivalent to reading the .CSV file from URL as coded above.

## Select Season
season <- "20172018"


## Load data
# Connect to database
conn <- dbConnect(SQLite(), "~/corsica_data/corsica.sqlite")

# Query PBP
pbp_query <- dbSendQuery(conn, paste("SELECT * FROM pbp WHERE season == '",
                                     season,
                                     "'",
                                     sep = ""
                                     )
                         )

pbp <- fetch(pbp_query, -1)

# Disconnect
dbDisconnect(conn)

The Play-By-Play object lists all events recorded by the NHL, as well as shift changes made by players. Additional fields are appended to this table to give a comprehensive summary of game events. This is the starting point for the majority of analyses.

## NOT RUN
colnames(sample_pbp)

##  [1] "game_id"             "event_index"         "season"             
##  [4] "game_date"           "session"             "game_period"        
##  [7] "game_seconds"        "event_type"          "event_description"  
## [10] "event_detail"        "event_team"          "event_player_1"     
## [13] "event_player_2"      "event_player_3"      "event_length"       
## [16] "coords_x"            "coords_y"            "players_substituted"
## [19] "home_on_1"           "home_on_2"           "home_on_3"          
## [22] "home_on_4"           "home_on_5"           "home_on_6"          
## [25] "away_on_1"           "away_on_2"           "away_on_3"          
## [28] "away_on_4"           "away_on_5"           "away_on_6"          
## [31] "home_goalie"         "away_goalie"         "home_team"          
## [34] "away_team"           "home_skaters"        "away_skaters"       
## [37] "home_score"          "away_score"          "game_score_state"   
## [40] "game_strength_state" "highlight_code"      "event_distance"     
## [43] "event_angle"         "event_rinkside"      "event_circle"       
## [46] "event_zone"          "home_zone"           "prob_goal"          
## [49] "prob_save"           "adj_home_corsi"      "adj_home_fenwick"   
## [52] "adj_home_shot"       "adj_home_goal"       "adj_away_corsi"     
## [55] "adj_away_fenwick"    "adj_away_shot"       "adj_away_goal"

Finally, roster data along with a list of replacement level players are loaded:

# Query roster
conn <- dbConnect(SQLite(), "~/corsica_data/raw.sqlite")
roster_query <- dbSendQuery(conn, paste("SELECT * FROM roster WHERE season == '",
                                        season,
                                        "'",
                                        sep = ""
                                        )
                            )

roster <- fetch(roster_query, -1)

# Disconnect
dbDisconnect(conn)

# Replacement players
load("/srv/shiny-server/models/replacement_players.RData")

replacement_players$player <- ds.fix_names(replacement_players$player)

freps <- as.character(na.omit(replacement_players$player[which(replacement_players$position == "F" & replacement_players$season == season)]))
dreps <- as.character(na.omit(replacement_players$player[which(replacement_players$position == "D" & replacement_players$season == season)]))
greps <- as.character(na.omit(replacement_players$player[which(replacement_players$position == "G" & replacement_players$season == season)]))

These objects can be loaded into your R session thusly:

load(url("http://fenwicka.com/shiny/war_roster_data.RData"))

This data will serve to produce a list of players for which we will compute WAR.

## Prepare data
# Get goalie list
glist <- unique(c(pbp$home_goalie,
                  pbp$away_goalie
                  )
                )

# Get player list
plist <- c(pbp$home_on_1,
           pbp$home_on_2,
           pbp$home_on_3,
           pbp$home_on_4,
           pbp$home_on_5,
           pbp$home_on_6,
           pbp$away_on_1,
           pbp$away_on_2,
           pbp$away_on_3,
           pbp$away_on_4,
           pbp$away_on_5,
           pbp$away_on_6,
           freps,
           dreps
           ) %>%
  na.omit()

data.frame(player = plist) %>%
  group_by(player) %>%
  summarise(events = n()) %>%
  filter({events > 1000 & player %in% glist == FALSE} |
         player %in% c(freps, dreps)
         ) %>%
  data.frame() ->
  player_df

goalie_df <- data.frame(player = glist,
                        position = "G"
                        )

# Player positions
roster %>%
  filter(player_position != "G") %>%
  group_by(player_name) %>%
  summarise(position = paste(unique(player_position), collapse = "/")) %>%
  data.frame() ->
  position_ref

position_ref$position[which(grepl("C|L|R", position_ref$position) == FALSE)] <- "D"
position_ref$position[which(grepl("C|L|R", position_ref$position) == TRUE)] <- "F"

player_df$position <- position_ref$position[match(player_df$player, position_ref$player)]

For skaters, a limit of 1,000 Play-By-Play events is used to omit players with insufficient data from analysis.

As outlined previously, my proposed WAR framework is comprised of 6 major components. Before proceeding, it is vital to properly verify that these can exist as distinct non-overapping processes. Absent this condition, the system is prone to double-counting contributions by players and thus is flawed.

Between shot rates and shot quality, it is necessary only to satisfy:

\[ Goals\ Added \approx (Shots\ Added \times Shot\ Value) + (Shots \times Mean\ Value\ Added) \]

This only means isolating players’ partial contributions towards the mean goal probability of shots they are involved with by being present on the ice as a member of the attacking or defending team. To this end, we will leverage the existing Expected Goals² model used on Corsica.

However, a third component - the individual impact of shooters - complicates things further. Because the value associated to the approximate quality of shots has already been attributed to players on the ice, the value added from shooting talent must exist independently. We achieve this by including xG as a control variable in the shooting regression.

Lastly, I’ll show how zone transitions are measured separately from the effect of zone starts on each of the 5 other WAR components. Accounting for zone starts prevents players who are awarded a disproportionate quantity of offensive zone starts from being unfairly rated relative to those who begin a majority of shifts in the defensive zone. While this feature is desired, it does not address the fact that players themselves can impact the placement of faceoffs. Consider the value collectively gained by a 5-skater unit that begins a shift in the defensive zone and ends the shift with an ensuing offensive-zone faceoff. It is desired to divide this value for all such zone transitions between all players active on the ice and allow the impact of zone starts to exist as a distinct entity.

Given this assurance that the proposed framework satisfies the requirement of non-overlapping processes, we can proceed to establishing methods for calculating these player contributions. The first such component is shot rates. Shots will henceforth be defined as all unblocked shot attempts. Blocked shots are excluded from this analysis due to coordinate missingness, making it impossible to measure shot quality. A first pass at formulating Shot Rates WAR may take the form:

\[ WAR_{Shot\ Rates} = \alpha\times(Shots\ Added_{Player} - Shots\ Added_{Replacement})\times Shot\ Value \] where $\alpha$ is a conversion factor from goals to wins and Shot Value is equal to the mean goal value of an unblocked shot attempt. While calculating these constants is trivial, determining Shots Added is not.

As was proposed afore, a regression approach will be employed here with skaters acting as predictors of some yet undetermined response. Coefficients for each skater in player_df are sought such that they represent each player’s individual impact on the response variable and that this impact can be converted to units of shots.

When developping K, I opted to adapt the Cox Proportional Hazards method proposed by A.C. Thomas and Sam Ventura⁷. In survival analysis a Cox regression can be used to measure the impacts of various factors on the time until death. Thomas et al. proposed:

There are, at a minimum, two opposing processes in a hockey game: the home team tries to score on the away team, and vice versa. Both of these events are relatively rare compared to the number of observed event intervals, so that it is natural to model these as competing stochastic processes. Predictors that modulate these processes can be the teams in the game, the score of the game, the players on the ice, or some other combination, and the same predictors appear in each process.

and further, that the Cox model contains a time-independent component $h_1(X)$ where $X$ can represent various predictors including but not limited to players on the ice. Using this general concept, we’ll shift our attention to the rate of shots by the home and away teams.

Let the rate of shots by the home team be given by:

\[ \rho^h = e^{\alpha^h + X\cdot\beta} \] where $\alpha^h$ is the baseline rate of home shots, $X$ is binary vector of indicators and $\beta$ is a coefficient vector of the same length. The impact of the nth predictor on the home team shot rate is given by:

\[ \delta_n = e^{\beta_n} \]

where $\delta_n$ is a multiplier on the baseline rate. Equivalently, the away team shot rate can be represented in identical fashion.

We aim to solve these equations by using a proportional hazards regression with skaters’ on-ice indicators included in $X$ alongside various contextual features. The dot product $X\cdot\beta$ from the equation given for the home shot rate can be rewritten as:

\[\sum_{p}{X_p^h\Omega_p + X_p^a\Delta_p}+\sum_{s}{X_s\sigma_s}\]

where $\Omega$ and $\Delta$ are the offensive and defensive contributions respectively of p players and $\sigma$ are the impacts of s game state indicators. To satisfy the Cox model, we must construct an indicator matrix $X$ and a response matrix $Y$ for both the home and away processes meeting these conditions:

The row length of $X$ and $Y$ are both equal to the number of observations corresponding to uninterrupted periods of play where all indicators remain unchanged
$Y$ contains both the length in seconds of each observation and a binary response equal to 1 if the observation ends in a shot and 0 otherwise
The column length of $X$ is equal to the number of desired indicators, or $2p + s$

This is to say that observations are censored by either a change in any of the predictors (including any player substitution) or a shot being taken by the team under observation. In addition to the skater on-ice indicators, the following game state variables are considered (with $*$ denoting interactions):

Skater advantage
Score advantage
Zone start
Score advantage * game time
Zone start * time elapsed since faceoff

Some additional data preparation was required before building these matrices:

# Add event zone
pbp$event_zone <- gsub(". [zZ]one", "", unlist(lapply(regmatches(as.character(pbp$event_description), gregexpr("[a-zA-Z]{3}. [zZ]one", as.character(pbp$event_description))), na_if_null)))
pbp$event_zone[which(pbp$event_type == "BLOCK" & pbp$event_zone == "Def")] <- "Off"

pbp$home_zone <- pbp$event_zone
pbp$home_zone[which(pbp$event_team == pbp$away_team & pbp$event_zone == "Off")] <- "Def"
pbp$home_zone[which(pbp$event_team == pbp$away_team & pbp$event_zone == "Def")] <- "Off"

# Prepare PBP
pbp %>%
  filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states,
         event_type %in% c(st.fenwick_events,
                           "ON",
                           "PENL",
                           "FAC"
                           )
         ) %>%
  arrange(game_id, event_index) %>%
  mutate(FAC_ref = cumsum(event_type == "FAC")) %>%
  group_by(game_id, FAC_ref) %>%
  arrange(event_index) %>%
  mutate(seconds_since = nabs(game_seconds) - min(nabs(game_seconds)),
         is_home_team = 1*(event_type %in% st.corsi_events & event_team == home_team),
         home_zonestart = 1*(first(home_zone) == "Def") + 2*(first(home_zone) == "Neu") + 3*(first(home_zone) == "Off"),
         home_score_adv = home_score - away_score
         ) %>%
  ungroup() %>%
  arrange(game_id, event_index) %>%
  data.frame() ->
  newpbp

And further:

## Shot Rates
newpbp %>%
  filter(event_type %in% c(st.fenwick_events,
                           "ON",
                           "FAC"
                           ),
         !is.na(home_zonestart)
         ) %>%
  group_by(game_id) %>%
  arrange(event_index) %>%
  mutate(elapsed = game_seconds - lag(game_seconds, 1)) %>%
  ungroup() %>%
  filter(!is.na(elapsed)) %>%
  arrange(game_id, event_index) %>%
  data.frame() ->
  rates_pbp

Skaters’ on-ice indicator matrices for both home and away teams are created thusly:

# Build home matrix
do.call("cbind",
        lapply(as.list(player_df$player),
               is_on,
               rates_pbp,
               venue = "Home"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  home_on_mat

colnames(home_on_mat) <- paste(player_df$player, "home", sep = "_")

# Build away matrix
do.call("cbind",
        lapply(as.list(player_df$player),
               is_on,
               rates_pbp,
               venue = "Away"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  away_on_mat

colnames(away_on_mat) <- paste(player_df$player, "away", sep = "_")

And the final design matrix is completed by adding the remaining game state indicators:

# Build design matrix
rates_pbp$game_strength_state <- as.factor(rates_pbp$game_strength_state)
rates_pbp$home_zonestart <- as.factor(rates_pbp$home_zonestart)

rates_pbp$home_hazard <- 1*(rates_pbp$event_type %in% st.fenwick_events & rates_pbp$event_team == rates_pbp$home_team)
rates_pbp$away_hazard <- 1*(rates_pbp$event_type %in% st.fenwick_events & rates_pbp$event_team == rates_pbp$away_team)

design_rates <- dummyVars(home_hazard ~
                          game_strength_state +
                          home_score_adv*game_seconds +
                          home_zonestart*seconds_since,
                          data = rates_pbp,
                          contrasts = TRUE,
                          fullRank = FALSE
                          )

rates_mat <- cBind(predict(design_rates,
                           rates_pbp
                           ) %>%
                   Matrix(sparse = TRUE),
                   home_on_mat,
                   away_on_mat
                   )

Finally, we perform the home and away regressions:

# Model home rates
home_index <- which({rates_pbp$event_type %in% st.fenwick_events == FALSE & rates_pbp$elapsed > 0} |
                    rates_pbp$home_hazard > 0
                    )

rates_pbp_home <- rates_pbp[home_index, ]
rates_mat_home <- rates_mat[home_index, ]

rates_pbp_home$elapsed[which(rates_pbp_home$elapsed <= 0)] <- 0.1
home_surv <- Surv(rates_pbp_home$elapsed, rates_pbp_home$home_hazard)

# registerDoMC(cores = 2)
cox_home_rates <- cv.glmnet(x = as.matrix(rates_mat_home),
                            y = home_surv,
                            family = "cox",
                            nfolds = 4,
                            alpha = 1,
                            parallel = FALSE
                            )

home_rate_coefs <- data.frame(var = colnames(rates_mat_home),
                              coeff = matrix(exp(coef(cox_home_rates, s = "lambda.min")))
                              )

# Model away rates
away_index <- which({rates_pbp$event_type %in% st.fenwick_events == FALSE & rates_pbp$elapsed > 0} |
                    rates_pbp$away_hazard > 0
                    )

rates_pbp_away <- rates_pbp[away_index, ]
rates_mat_away <- rates_mat[away_index, ]

rates_pbp_away$elapsed[which(rates_pbp_away$elapsed <= 0)] <- 0.1
away_surv <- Surv(rates_pbp_away$elapsed, rates_pbp_away$away_hazard)

# registerDoMC(cores = 2)
cox_away_rates <- cv.glmnet(x = as.matrix(rates_mat_away),
                            y = away_surv,
                            family = "cox",
                            nfolds = 4,
                            alpha = 1,
                            parallel = FALSE
                            )

away_rate_coefs <- data.frame(var = colnames(rates_mat_away),
                              coeff = matrix(exp(coef(cox_away_rates, s = "lambda.min")))
                              )

I’ve opted to use regularization in the form of the LASSO penalty. This method simultaneously performs shrinkage and variable selection, meaning in our case that the measured impacts of skaters and game states on shot rates are reduced by a common factor and in some cases set to 0. The motivation for this is to avoid overfitting on observed results, especially in cases where skaters have not played enough to accumulate reliable amounts of data. Cross-validation is employed to find a suitable penalty term. The use of regularization improves the model’s ability to predict shot rates in out-of-sample data at the cost of assigning nonzero coefficients to a minority of skaters - roughly 40%. This suggests that the process is driven in large part by exceptional players and that those nearer to the median league talent have little influence comparatively.

As shown previously, the exponentiated player coefficients are interpretable as a shot rate multiplier. In order to obtain shots added, the difference $(e^\beta - 1) \times Baseline\ Rate$ must be multiplied by Time On Ice. Offensive and Defensive shots added can be computed from both home and away regressions and added together before conversion to goals and subsequently wins. For this, we require baseline shot rates, average shot value and skater TOI.

# Baseline rates
pbp %>%
  filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
  summarise(TOI = sum(na.omit(event_length))/60,
            home_FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
            away_FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
            home_GF = sum(event_type == "GOAL" & event_team == home_team),
            away_GF = sum(event_type == "GOAL" & event_team == away_team)
            ) %>%
  mutate(home_rate = home_FF/TOI,
         away_rate = away_FF/TOI,
         home_fsh = home_GF/home_FF,
         away_fsh = away_GF/away_FF,
         total_fsh = (home_GF + away_GF)/(home_FF + away_FF)
         ) %>%
  data.frame() ->
  baseline_rates

# Player stats
bind_rows(
  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(home_on_1) %>%
    rename(player = home_on_1) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == away_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(home_on_2) %>%
    rename(player = home_on_2) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == away_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(home_on_3) %>%
    rename(player = home_on_3) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == away_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(home_on_4) %>%
    rename(player = home_on_4) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == away_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(home_on_5) %>%
    rename(player = home_on_5) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == away_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(home_on_6) %>%
    rename(player = home_on_6) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == home_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == away_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(away_on_1) %>%
    rename(player = away_on_1) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == home_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(away_on_2) %>%
    rename(player = away_on_2) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == home_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(away_on_3) %>%
    rename(player = away_on_3) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == home_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(away_on_4) %>%
    rename(player = away_on_4) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == home_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(away_on_5) %>%
    rename(player = away_on_5) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == home_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              ),

  pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(away_on_6) %>%
    rename(player = away_on_6) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              FF = sum(event_type %in% st.fenwick_events & event_team == away_team),
              FA = sum(event_type %in% st.fenwick_events & event_team == home_team),
              iFF = sum(na.omit(event_type %in% st.fenwick_events & event_player_1 == player))
              )
) %>%
  data.frame() %>%
  group_by(player) %>%
  summarise(HTOI = sum(na.omit(TOI*(venue == "home"))),
            ATOI = sum(na.omit(TOI*(venue == "away"))),
            FF = sum(FF),
            FA = sum(FA),
            iFF = sum(iFF)
            ) %>%
  data.frame() ->
  skater_stats

And we can subsequently calculate goals added:

home_rate_coefs %>%
  filter(grepl("_away|_home", var) == TRUE) %>%
  mutate(player = gsub("_away|_home", "", var),
         home = 1*(grepl("_home", var) == TRUE)
         ) %>%
  data.frame() ->
  home_rate_coefs

away_rate_coefs %>%
  filter(grepl("_away|_home", var) == TRUE) %>%
  mutate(player = gsub("_away|_home", "", var),
         away = 1*(grepl("_away", var) == TRUE)
         ) %>%
  data.frame() ->
  away_rate_coefs

home_rate_coefs %>%
  group_by(player) %>%
  summarise(RF_home = sum((home == 1)*coeff),
            RA_away = sum((home == 0)*coeff)
            ) %>%
  data.frame() ->
  home_rates_sum

away_rate_coefs %>%
  group_by(player) %>%
  summarise(RF_away = sum((away == 1)*coeff),
            RA_home = sum((away == 0)*coeff)
            ) %>%
  data.frame() ->
  away_rates_sum

merge(home_rates_sum,
      away_rates_sum,
      by.x = c("player"),
      by.y = c("player")
      ) %>%
  data.frame() %>%
  mutate(HTOI = skater_stats$HTOI[match(player, skater_stats$player)],
         ATOI = skater_stats$ATOI[match(player, skater_stats$player)],
         GA_RF = (RF_home - 1)*baseline_rates$home_rate*HTOI*baseline_rates$home_fsh +
                  (RF_away - 1)*baseline_rates$away_rate*ATOI*baseline_rates$away_fsh,
         GA_RA = (1 - RA_home)*baseline_rates$away_rate*HTOI*baseline_rates$away_fsh +
                  (1 - RA_away)*baseline_rates$home_rate*ATOI*baseline_rates$home_fsh,
         GA_Rates = GA_RF + GA_RA
         ) %>%
  data.frame() ->
  rates_sum

The final conversion to WAR will be performed lastly, once all goals added components have been computed.

The second element of WAR is a measure of players’ impacts on shot quality. Recall that shots added were multiplied by the average goal value of unblocked shots in the above example. This leaves any perceived difference between the quality of shots and the average unaccounted. It is assumed that all skaters on the ice can exert an influence on shot quality.

Because shot quality is already quantified in the Play-By-Play, we need only treat the numeric vector pbp$prob_goal as a response variable is a linear regression to obtain coefficients interpretable as xG added by the presense of a given indicator variable. Formally:

\[Let\ xG = \beta_0 + X\cdot\beta = \beta_0 + \sum_p {X_p^o\Omega_p + X_p^d\Delta_p} + \sum_s {X_s\sigma_s}\]

where $X^o$ is the indicator matrix for skaters on the Offensive or shooting team and $X^d$ for skaters on the Defensive team. Total xG added per shot by a player $p$ is then a sum of offensive xG added and defensive xG added, or:

\[xG\ Added = \Omega_p - \Delta_p\] Note that this can be modelled as a single process in contrast to the previous problem. The indicator matrices are constructed similarly, with game state variables designed to reflect the shooting team’s point of reference. Additionally, a variable is_home_team is included to account for home ice advantage.

## Shot Quality
newpbp %>%
  filter(event_type %in% st.fenwick_events,
         !is.na(home_zonestart),
         !is.na(prob_goal),
         !{event_team == home_team & is.na(away_goalie) == TRUE},
         !{event_team == away_team & is.na(home_goalie) == TRUE}
         ) %>%
  mutate(is_home_team = 1*(event_team == home_team)) %>%
  data.frame() ->
  qual_pbp

# Build for matrix
do.call("cbind",
        lapply(as.list(player_df$player),
               is_on2,
               qual_pbp,
               ref = "For"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  for_on_mat

colnames(for_on_mat) <- paste(player_df$player, "for", sep = "_")

# Build against matrix
do.call("cbind",
        lapply(as.list(player_df$player),
               is_on2,
               qual_pbp,
               ref = "Against"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  against_on_mat

colnames(against_on_mat) <- paste(player_df$player, "against", sep = "_")

# Build design matrix
qual_pbp$shooter_strength_state <- ifelse(qual_pbp$is_home_team == 1,
                                          qual_pbp$game_strength_state,
                                          str_rev(qual_pbp$game_strength_state)
                                          )

qual_pbp$shooter_score_adv <- ifelse(qual_pbp$is_home_team == 1,
                                     qual_pbp$home_score - qual_pbp$away_score,
                                     qual_pbp$away_score - qual_pbp$home_score
                                     )

qual_pbp$shooter_strength_state <- as.factor(qual_pbp$shooter_strength_state)

design_qual <- dummyVars(prob_goal ~
                         shooter_strength_state +
                         is_home_team +
                         shooter_score_adv*game_seconds,
                         data = qual_pbp,
                         contrasts = TRUE,
                         fullRank = FALSE
                         )

qual_mat <- cBind(predict(design_qual,
                          qual_pbp
                          ) %>%
                  Matrix(sparse = TRUE),
                  for_on_mat,
                  against_on_mat
                  )

As explained for the shot rates regression, LASSO regularization was used with a cross-validation procedure to select the penalty term:

# registerDoMC(cores = 2)
glm_qual <- cv.glmnet(x = as.matrix(qual_mat),
                      y = as.numeric(qual_pbp$prob_goal),
                      family = "gaussian",
                      nfolds = 8,
                      alpha = 1,
                      parallel = FALSE
                      )

qual_coefs <- data.frame(var = c("Intercept", colnames(qual_mat)),
                         coeff = matrix(coef(glm_qual, s = "lambda.min"))
                         )

Finally, goals added are obtained by multiplying the coefficients obtained by the quantity of Fenwick events observed for and against skaters while on the ice:

qual_coefs %>%
  filter(grepl("_for|_against", var) == TRUE) %>%
  mutate(player = gsub("_for|_against", "", var),
         is_for = 1*(grepl("_for", var) == TRUE)
         ) %>%
  data.frame() ->
  qual_coefs

qual_coefs %>%
  group_by(player) %>%
  summarise(QF = sum((is_for == 1)*coeff),
            QA = sum((is_for == 0)*coeff)
            ) %>%
  mutate(FF = skater_stats$FF[match(player, skater_stats$player)],
         FA = skater_stats$FA[match(player, skater_stats$player)],
         GA_QF = QF*FF,
         GA_QA = -QA*FA,
         GA_Qual = GA_QF + GA_QA
         ) %>%
  data.frame() ->
  qual_sum

The Shooting and Goaltending components of WAR can be modelled as competing processes. In simple terms, a player acting as shooter attempts to score while a player acting as goalie simultaneously attempts to prevent a goal. This paradigm lends itself to performing one single regression to produce coefficients for both the shooting and goaltending elements of WAR.

Because shot quality itself has already been accounted for and its value dispersed among skaters, the shooting process must be adjusted as to exist independently from this prior feature. This is easily accomplished with the inclusion of xG as a regression variable. The nature of the competing shooter/goalie game and the binary identity that one can attribute to shots as either a goal (1) or a non-goal (0) both suggest the use of logistic regression. In accordance with this fact, we’ll model shots as a binary response to the linear combination of various independent variables including shooter and goaltender identifiers and xG, passed through the logistic function:

\[f(x)=\frac{1}{1+e^{-x}}\]

as to obtain values bounded between 0 and 1 and thus interpretable as probabilities. Logistic regression has the property that its coefficients reflect the odds ratio, or the impact had by a given variable on the odds of a response equal to 1. In our case, we will aim to unpack the results of the shooting regression in order to find the partial impacts of players acting as a shooter or a goaltender on the likelihood of a goal being scored.

In practice, assumptions we can confidently make about the nature of shooting in hockey will interfere with the aforementioned possibility of obtaining both shooter and goalie ratings from a single regression. That is, we intuit that goaltenders do not significantly influence the probability of shots hitting or missing the net, while skaters do. Hence, the same regression will be performed in double, with the single exception that one will use all unblocked shots as observations and the other will only consider shots on goal.

The former case will serve to produce shooter coefficients. The data is prepared as follows:

## Shooting
newpbp %>%
  filter(event_type %in% st.fenwick_events,
         !{{is.na(home_goalie) == TRUE & event_team == away_team} |
           {is.na(away_goalie) == TRUE & event_team == home_team}
           },
         !is.na(prob_goal),
         event_player_1 %in% player_df$player
         ) %>%
  mutate(is_home_team = 1*(event_team == home_team)) %>%
  data.frame() ->
  shooting_pbp

# Build design matrix
shooting_pbp$goalie <- ifelse(shooting_pbp$is_home_team == 1,
                              shooting_pbp$away_goalie,
                              shooting_pbp$home_goalie
                              )

shooting_pbp$shooter_strength_state <- ifelse(shooting_pbp$is_home_team == 1,
                                              shooting_pbp$game_strength_state,
                                              str_rev(shooting_pbp$game_strength_state)
                                              )

shooting_pbp$shooter_score_adv <- ifelse(shooting_pbp$is_home_team == 1,
                                         shooting_pbp$home_score - shooting_pbp$away_score,
                                         shooting_pbp$away_score - shooting_pbp$home_score
                                         )

shooting_pbp$shooter_strength_state <- as.factor(shooting_pbp$shooter_strength_state)
shooting_pbp$event_player_1 <- as.factor(shooting_pbp$event_player_1)
shooting_pbp$goalie <- as.factor(shooting_pbp$goalie)
shooting_pbp$is_goal <- as.factor(1 + 1*(shooting_pbp$event_type == "GOAL"))

do.call("cbind",
        lapply(as.list(player_df$player),
               player_dummy,
               shooting_pbp,
               type = "shooting"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  shooter_mat

colnames(shooter_mat) <- paste("event_player_1", player_df$player, sep = ".")

design_shooting <- dummyVars(as.factor(is_goal) ~
                             shooter_strength_state +
                             is_home_team +
                             shooter_score_adv*game_seconds +
                             goalie +
                             prob_goal,
                             data = shooting_pbp,
                             contrasts = TRUE,
                             fullRank = FALSE
                             )

shooting_mat <- cbind(predict(design_shooting,
                              shooting_pbp
                              ) %>%
                        Matrix(sparse = TRUE),
                      shooter_mat
                      )

Then, the regularized regression:

# registerDoMC(cores = 2)
glm_shooting <- cv.glmnet(x = as.matrix(shooting_mat),
                          y = as.factor(shooting_pbp$is_goal),
                          family = "binomial",
                          nfolds = 8,
                          alpha = 1,
                          parallel = FALSE
                          )

shooting_coefs <- data.frame(var = c("Intercept", colnames(shooting_mat)),
                             coeff = matrix(exp(coef(glm_shooting, s = "lambda.min")))
                             )

Per the form of the logistic regression, we now have a probabilistic representation of goals:

\[P(x)=\frac{1}{1+e^{-(\beta_0 + \beta_1 x_1 + ...)}}\]

from which we can obtain the inverse logistic function:

\[g(P(x))=\log(\frac{P(x)}{1 - P(x)}) = \beta_0 + \beta_1 x_1 + ...\]

and:

\[\frac{P(x)}{1 - P(x)}=e^{\beta_0 + \beta_1 x_1 + ...}\]

where the left term is interpretable as the odds of a goal. The odds ratio for a given identifier variable $x_n$ is then $e^{\beta_n}$, meaning the odds of a goal are multiplied by $e^{\beta_n}$ when the identifier is equal to 1.

Using the average goal odds for unblocked shots, we can derive a new shooting percentage for a player’s shots and goals added from multiplying shots taken by it:

shooting_coefs %>%
  filter(grepl("event_player_1\\.|goalie\\.", var) == TRUE) %>%
  mutate(player = gsub("event_player_1\\.|goalie\\.", "", var),
         is_shooter = 1*(grepl("event_player_1", var) == TRUE)
         ) %>%
  data.frame() ->
  shooting_coefs

shooting_coefs %>%
  group_by(player) %>%
  summarise(SF = sum((is_shooter == 1)*coeff)
            ) %>%
  mutate(iFF = skater_stats$iFF[match(player, skater_stats$player)],
         WAR_Shooting = OR2dProb(SF, baseline_rates$total_fsh)*iFF
         ) %>%
  data.frame() ->
  shooting_sum

Note that this new shooting percentage differs from the actual observed shooting percentage as it is only affected by the measured partial player impact after controlling for game state, goaltender and shot quality.

Unlike the approach used in K, where penalties were modelled in similar fashion to shot rates using Cox proportional harzards, WAR relies on a simplified approach based on the assumption of penalties satisfying the conditions for a Poisson distribution. Moreover, only skaters responsible for taking or drawing the penalty in question will receive or be deducted the associated wins added. Previously, this was shared between all on-ice actors.

Each skater’s Time On Ice was fragmented into observations corresponding to distinct game states, then the Time On Ice and total penalties taken and penalties drawn were summed within each sample. The game state factors under consideration here are: skater advantage, score advantage, venue, and the cumulative imbalance in penalties awarded within each game. This latter feature aims to counter the bias in officiating that has been observed by Michael Lopez⁸ and others whereby intragame penalty history influences subsequent calls (often by seeking to balance itself). Using a Poisson regression, we obtain coefficients measuring the effect of skater identities on the mean of the Poisson distribution for penalties taken and drawn.

The form of the Poisson regression can be simplified to:

\[E(Y\mid x)=e^{\beta_0 + \beta_1x_1+...}\]

giving the mean expected observations $Y$ per unit of time as a function of some linear combination of predictors. It is then concluded that the odds ratio $e^{\beta_n}$ is a multiplier on the mean rate of penalties for the nth indicator variable. For the Poisson regression, an exposure value must be supplied for each observation as to contextualize the number of penalties within a known time frame. The necessary data are organized thusly:

## Penalties
pbp %>%
  group_by(game_id, game_seconds) %>%
  mutate(pens = sum(event_type == "PENL" & grepl("ps \\-|match|fighting|major", tolower(event_detail)) == FALSE),
         home_score_adv = home_score - away_score
         ) %>%
  group_by(game_id) %>%
  arrange(event_index) %>%
  mutate(pref = cumsum(event_type == "PENL" & grepl("ps \\-|match|fighting|major", tolower(event_detail)) == FALSE & pens < 2)) %>%
  group_by(game_id, pref) %>%
  mutate(last_penalty = ifelse(is.na(first(event_team)) == TRUE,
                               NA,
                               first(event_team)
                               ),
         is_PP = 2 - cumsum({event_type == "GOAL" & event_team != last_penalty} |
                            event_type %in% c("PENL", "GEND")
                            ) +
                     1*({event_type == "GOAL" & event_team != last_penalty} |
                        event_type %in% c("PENL", "GEND")
                        )
         ) %>%
  group_by(game_id) %>%
  arrange(event_index) %>%
  mutate(streak_change = cumsum(event_type == "PENL" &
                                grepl("ps \\-|match|fighting|major", tolower(event_detail)) == FALSE &
                                pens < 2 &
                                event_team != lag(last_penalty, 1)
                                )
         ) %>%
  group_by(game_id, streak_change) %>%
  arrange(event_index) %>%
  mutate(streak = cumsum(event_type == "PENL" &
                         grepl("ps \\-|match|fighting|major", tolower(event_detail)) == FALSE &
                         pens < 2
                         )
         ) %>%
  ungroup() %>%
  arrange(game_id, event_index) %>%
  data.frame() ->
  pens_pbp

pens_pbp$streak <- (lag(pens_pbp$last_penalty == pens_pbp$home_team))*lag(pens_pbp$streak, 1) -
                   (lag(pens_pbp$last_penalty == pens_pbp$away_team))*lag(pens_pbp$streak, 1)

# Compile counts
bind_rows(
  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, home_on_1) %>%
    rename(player = home_on_1) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, home_on_2) %>%
    rename(player = home_on_2) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, home_on_3) %>%
    rename(player = home_on_3) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, home_on_4) %>%
    rename(player = home_on_4) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, home_on_5) %>%
    rename(player = home_on_5) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, home_on_6) %>%
    rename(player = home_on_6) %>%
    summarise(venue = "home",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, away_on_1) %>%
    rename(player = away_on_1) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, away_on_2) %>%
    rename(player = away_on_2) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, away_on_3) %>%
    rename(player = away_on_3) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, away_on_4) %>%
    rename(player = away_on_4) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, away_on_5) %>%
    rename(player = away_on_5) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              ),

  pens_pbp %>%
    filter(!{game_period > 4 & session == "R"},
         game_strength_state %in% st.strength_states
         ) %>%
    group_by(game_id, game_strength_state, home_score_adv, streak, away_on_6) %>%
    rename(player = away_on_6) %>%
    summarise(venue = "away",
              TOI = sum(na.omit(event_length))/60,
              PENT = sum(na.omit(1*(event_type == "PENL" & event_player_1 == player) +
                                 1*(event_type == "PENL" & event_player_1 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_1 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 )),
              PEND = sum(na.omit(1*(event_type == "PENL" & event_player_2 == player) +
                                 1*(event_type == "PENL" & event_player_2 == player & grepl("double minor", tolower(event_description)) == TRUE) -
                                 1*(event_type == "PENL" & event_player_2 == player & {{grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE} | {pens > 1}})
                                 ))
              )
) %>%
  data.frame() %>%
  filter(player %in% player_df$player) %>%
  group_by(game_id, player, venue, game_strength_state, home_score_adv, streak) %>%
  summarise(TOI = sum(TOI),
            PENT = sum(PENT),
            PEND = sum(PEND)
            ) %>%
  ungroup() %>%
  filter(TOI > 0) %>%
  data.frame() ->
  pens_df

# Build design matrix
pens_df$strength_state <- ifelse(pens_df$venue == "home",
                                         pens_df$game_strength_state,
                                         str_rev(pens_df$game_strength_state)
                                         )

pens_df$score_adv <- ifelse(pens_df$venue == "home",
                            pens_df$home_score_adv,
                            -pens_df$home_score_adv
                            )

pens_df$streak <- ifelse(pens_df$venue == "home",
                         -pens_df$streak,
                         pens_df$streak
                         )

pens_df$strength_state <- as.factor(pens_df$strength_state)
pens_df$player <- as.factor(pens_df$player)
pens_df$streak[which(is.na(pens_df$streak) == TRUE)] <- 0

do.call("cbind",
        lapply(as.list(player_df$player),
               player_dummy,
               pens_df,
               type = "pens"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  pens_player_mat

colnames(pens_player_mat) <- paste("player", player_df$player, sep = "_")

design_pens <- dummyVars(PENT ~
                         strength_state +
                         venue +
                         score_adv +
                         streak,
                         data = pens_df,
                         contrasts = TRUE,
                         fullRank = FALSE
                         )

pens_mat <- cbind(predict(design_pens,
                          pens_df
                          ) %>%
                    Matrix(sparse = TRUE),
                  pens_player_mat
                  )

And the baseline penalty value and penalty rate are found:

# Get penalty value
pens_pbp %>%
  mutate(home_last = 1*(last_penalty == home_team)) %>%
  group_by(game_id, pref, home_last) %>%
  filter(is_PP == 1) %>%
  summarise(home_goals = sum(event_type == "GOAL" & event_team == home_team & game_seconds - min(game_seconds) < 120),
            away_goals = sum(event_type == "GOAL" & event_team == away_team & game_seconds - min(game_seconds) < 120)
            ) %>%
  data.frame() %>%
  summarise(gf = sum(na.omit((home_last == 1)*away_goals + (home_last == 0)*home_goals)),
            ga = sum(na.omit((home_last == 0)*away_goals + (home_last == 1)*home_goals)),
            pens = sum(pref > 0)
            ) ->
  sum_pens

penalty_value <- (sum_pens$gf - sum_pens$ga)/sum_pens$pens

# Get baseline rate
pens_df %>%
  summarise(TOI = sum(TOI),
            PENT = sum(PENT)
            ) %>%
  mutate(penalty_rate = PENT/TOI) %>%
  data.frame() ->
  baseline_pens

The regressions are performed using the LASSO penalty for regularization and supplying the exposure terms with the offset parameter:

# registerDoMC(cores = 2)
poiss_taken <- cv.glmnet(x = as.matrix(pens_mat),
                         y = as.numeric(pens_df$PENT),
                         offset = log(as.numeric(pens_df$TOI)),
                         family = "poisson",
                         alpha = 1,
                         nfolds = 8,
                         parallel = FALSE
                         )

taken_coefs <- data.frame(var = c("Intercept", colnames(pens_mat)),
                          coeff = matrix(exp(coef(poiss_taken, s = "lambda.min")))
                          )

# registerDoMC(cores = 2)
poiss_drawn <- cv.glmnet(x = as.matrix(pens_mat),
                         y = as.numeric(pens_df$PEND),
                         offset = log(as.numeric(pens_df$TOI)),
                         family = "poisson",
                         alpha = 1,
                         nfolds = 8,
                         parallel = FALSE
                         )

drawn_coefs <- data.frame(var = c("Intercept", colnames(pens_mat)),
                          coeff = matrix(exp(coef(poiss_drawn, s = "lambda.min")))
                          )

Finally, we calculate goals added:

taken_coefs %>%
  filter(grepl("player_", var) == TRUE) %>%
  mutate(player = gsub("player_", "", var)
         ) %>%
  data.frame() ->
  taken_coefs

drawn_coefs %>%
  filter(grepl("player_", var) == TRUE) %>%
  mutate(player = gsub("player_", "", var)
         ) %>%
  data.frame() ->
  drawn_coefs

merge(taken_coefs %>%
        rename(coef_taken = coeff) %>%
        data.frame(),
      drawn_coefs %>%
        rename(coef_drawn = coeff) %>%
        data.frame(),
      by.x = c("player"),
      by.y = c("player")
      ) %>%
  data.frame() %>%
  mutate(TOI = skater_stats$HTOI[match(player, skater_stats$player)] +
               skater_stats$ATOI[match(player, skater_stats$player)],
         GA_PT = (1 - coef_taken)*baseline_pens$penalty_rate*TOI*penalty_value,
         GA_PD = (coef_drawn - 1)*baseline_pens$penalty_rate*TOI*penalty_value,
         GA_Pens = GA_PT + GA_PD
         ) %>%
  data.frame() ->
  pens_sum

Previously, it was discussed how value can be added by skaters in the form of zone transitions and how this process exists independently from zone starts which have heretofore been utilized as control variables. Henceforth, a shift will refer to the period between two faceoffs. A shift has among its properties a unique start zone and end zone. Because each previous WAR component has been adjusted as to control for the starting zone, the end zone value remains unaccounted for. This will become a response variable in the zone transitions regression.

The 3 end zone values are: defensive, neutral and offensive. A multinomial logistic regression is used to find the partial effects of variables on the probabilities of each of the 3 possible outcomes occurring on a given shift. One can imagine a simplified version of this regression using the identity of the zone start as the only explanatory variable.

The results of this simplistic regression match both our starting assumptions and the observed results:

                 var  coeff_def   coeff_neu   coeff_off
1          Intercept  1.0493206   0.8735047   1.0910045
2 home_zonestart.def  1.0000795   1.0021987   0.9977268
3 home_zonestart.neu  1.0002389   0.9981798   1.0015842
4 home_zonestart.off  0.9996852   0.9996749   1.0006403

where the above coefficients are exponentiated as to obtain the odds ratio for each predictor.

Expanding on this example, we aim to include on-ice skater indicator variables as well as additional contextual information such as score advantage and skater advantage. Within each shift, observations are taken as periods of uninterrupted persistence of independent variables. This means each player substitution has the effect of censoring an observation. Because shift duration can sometimes exceed several minutes and contain numerous substitutions, observations are weighted in the regression by how near they are to the end of a shift. This has the effect of assigning more responsibility for the end zone identity to skaters that were involved in play later in the shift, or closer to its end.

Obtaining goals added from the multinomial regression coefficients is a relatively simple matter. Because the regression is performed using three possible response variable values from the home team’s perspective, we must separate home and away contributions and reverse the latter, satisfying Home OZ = Away DZ and Home DZ = Away OZ. Each skater is represented twice in the indicator matrix, once when acting as a home team skater and once as an away team skater. Thus:

\[Goals\ Added\ (OZ)=Home\ Goals\ Added\ (OZ_{home})+Away\ Goals\ Added\ (DZ_{home})\\ Goals\ Added\ (NZ)=Home\ Goals\ Added\ (NZ_{home})+Away\ Goals\ Added\ (NZ_{home})\\ Goals\ Added\ (DZ)=Home\ Goals\ Added\ (DZ_{home})+Away\ Goals\ Added\ (OZ_{home})\]

and goals added are calculated in each instance by multiplying probabilistic reprentation of the product of the exponentiated regression coefficient and the baseline end zone odds by its equivalent faceoff value and again by the total quantity of observations.

Data preparation was performed as follows:

## Zones
newpbp %>%
  group_by(game_id) %>%
  arrange(event_index) %>%
  mutate(ref = cumsum(event_type == "FAC") - 1*(event_type == "FAC")) %>%
  group_by(game_id, ref) %>%
  arrange(event_index) %>%
  mutate(home_zonestart_next = 1*(last(event_type) == "FAC" & last(home_zone) == "Def") +
                               2*(last(event_type) == "FAC" & last(home_zone) == "Neu") +
                               3*(last(event_type) == "FAC" & last(home_zone) == "Off"),
         shift_end = last(game_seconds),
         seconds_until = shift_end - game_seconds
         ) %>%
  ungroup() %>%
  filter(event_type == "ON",
         home_zonestart_next > 0,
         seconds_until > 0,
         seconds_until <= 120
         ) %>%
  group_by(game_id) %>%
  arrange(event_index) %>%
  mutate(sref = cumsum(event_type == "ON")) %>%
  group_by(game_id,
           ref,
           sref,
           home_on_1,
           home_on_2,
           home_on_3,
           home_on_4,
           home_on_5,
           home_on_6,
           away_on_1,
           away_on_2,
           away_on_3,
           away_on_4,
           away_on_5,
           away_on_6
           ) %>%
  summarise(shift_start = min(game_seconds),
            game_strength_state = first(game_strength_state),
            home_score_adv = first(home_score_adv),
            home_zonestart = first(home_zonestart),
            home_zonestart_next = first(home_zonestart_next),
            seconds_until = min(seconds_until),
            seconds_since = min(seconds_since),
            game_seconds = min(game_seconds)
            ) %>%
  ungroup() %>%
  arrange(game_id, game_seconds) %>%
  data.frame() ->
  zones_pbp

# Zone shift baselines
pbp %>%
  filter(event_type == "FAC") %>%
  mutate(home_zonestart = 1*(home_zone == "Def") +
                          2*(home_zone == "Neu") +
                          3*(home_zone == "Off"),
         total = n()
         ) %>%
  summarise(pct_1 = sum(home_zonestart == 1)/first(total),
            pct_2 = sum(home_zonestart == 2)/first(total),
            pct_3 = sum(home_zonestart == 3)/first(total)
            ) %>%
  data.frame() ->
  baseline_zones

# Zone start value
newpbp %>%
  filter(!is.na(home_zonestart)) %>%
  group_by(home_zonestart) %>%
  summarise(HG = sum(event_type == "GOAL" & event_team == home_team),
            AG = sum(event_type == "GOAL" & event_team == away_team),
            HPENT = sum(na.omit(1*(event_type == "PENL" & event_team == home_team) +
                                1*(event_type == "PENL" & event_team == home_team & grepl("double minor", tolower(event_description)) == TRUE) -
                                1*(event_type == "PENL" & event_team == home_team & grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE)
                                )),
            APENT = sum(na.omit(1*(event_type == "PENL" & event_team == away_team) +
                                1*(event_type == "PENL" & event_team == away_team & grepl("double minor", tolower(event_description)) == TRUE) -
                                1*(event_type == "PENL" & event_team == away_team & grepl("ps \\-|match|fighting|major", tolower(event_description)) == TRUE)
                                )),
            count = sum(event_type == "FAC")
            ) %>%
  mutate(value = (HG - AG + (APENT - HPENT)*penalty_value)/count) %>%
  data.frame() ->
  faceoff_value

# Zone shift stats
bind_rows(
  zones_pbp %>%
    group_by(home_on_1) %>%
    rename(player = home_on_1) %>%
    summarise(venue = "home",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(home_on_2) %>%
    rename(player = home_on_2) %>%
    summarise(venue = "home",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(home_on_3) %>%
    rename(player = home_on_3) %>%
    summarise(venue = "home",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(home_on_4) %>%
    rename(player = home_on_4) %>%
    summarise(venue = "home",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(home_on_5) %>%
    rename(player = home_on_5) %>%
    summarise(venue = "home",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(home_on_6) %>%
    rename(player = home_on_6) %>%
    summarise(venue = "home",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(away_on_1) %>%
    rename(player = away_on_1) %>%
    summarise(venue = "away",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(away_on_2) %>%
    rename(player = away_on_2) %>%
    summarise(venue = "away",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(away_on_3) %>%
    rename(player = away_on_3) %>%
    summarise(venue = "away",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(away_on_4) %>%
    rename(player = away_on_4) %>%
    summarise(venue = "away",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(away_on_5) %>%
    rename(player = away_on_5) %>%
    summarise(venue = "away",
              ZF = length(unique(ref))
              ),

  zones_pbp %>%
    group_by(away_on_6) %>%
    rename(player = away_on_6) %>%
    summarise(venue = "away",
              ZF = length(unique(ref))
              )
) %>%
  data.frame() %>%
  group_by(player) %>%
  summarise(HZF = sum(ZF*(venue == "home")),
            AZF = sum(ZF*(venue == "away"))
            ) %>%
  data.frame() ->
  zones_stats


# Build home matrix
do.call("cbind",
        lapply(as.list(player_df$player),
               is_on,
               zones_pbp,
               venue = "Home"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  home_on_mat

colnames(home_on_mat) <- paste(player_df$player, "home", sep = "_")

# Build away matrix
do.call("cbind",
        lapply(as.list(player_df$player),
               is_on,
               zones_pbp,
               venue = "Away"
               )
        ) %>%
  Matrix(sparse = TRUE) ->
  away_on_mat

colnames(away_on_mat) <- paste(player_df$player, "away", sep = "_")

# Build design matrix
zones_pbp$game_strength_state <- as.factor(zones_pbp$game_strength_state)
zones_pbp$home_zonestart <- as.factor(zones_pbp$home_zonestart)
zones_pbp$home_zonestart_next <- as.factor(zones_pbp$home_zonestart_next)

design_zones <- dummyVars(home_zonestart_next ~
                          game_strength_state +
                          home_score_adv*game_seconds +
                          home_zonestart*seconds_since,
                          data = zones_pbp,
                          contrasts = TRUE,
                          fullRank = FALSE
                          )

zones_mat <- cBind(predict(design_zones,
                           zones_pbp
                           ) %>%
                   Matrix(sparse = TRUE),
                   home_on_mat,
                   away_on_mat
                   )

and the regression was performed using the reciprocal logarithm of the time remaining in shift as weights:

index <- unique(which(is.na(zones_pbp) == TRUE, arr.ind = TRUE)[, 1])

# registerDoMC(cores = 2)
glm_zones <- cv.glmnet(x = as.matrix(zones_mat[-index, ]),
                       y = as.factor(zones_pbp$home_zonestart_next[-index]),
                       weights = 1/log(zones_pbp$seconds_until[-index] + 1, base = 10),
                       family = "multinomial",
                       alpha = 1,
                       nfolds = 8,
                       parallel = FALSE
                       )

zones_coefs <- data.frame(var = c("Intercept", colnames(zones_mat)),
                          coeff_1 = matrix(exp(coef(glm_zones, s = "lambda.min")$`1`)),
                          coeff_2 = matrix(exp(coef(glm_zones, s = "lambda.min")$`2`)),
                          coeff_3 = matrix(exp(coef(glm_zones, s = "lambda.min")$`3`))
                          )

Then, goals added were calculated:

zones_coefs %>%
  filter(grepl("_away|_home", var) == TRUE) %>%
  mutate(player = gsub("_away|_home", "", var),
         home = 1*(grepl("_home", var) == TRUE)
         ) %>%
  data.frame() ->
  zones_coefs

zones_coefs %>%
  group_by(player) %>%
  summarise(Z1_home = sum((home == 1)*coeff_1),
            Z1_away = sum((home == 0)*coeff_1),
            Z2_home = sum((home == 1)*coeff_2),
            Z2_away = sum((home == 0)*coeff_2),
            Z3_home = sum((home == 1)*coeff_3),
            Z3_away = sum((home == 0)*coeff_3)
            ) %>%
  mutate(HZF = zones_stats$HZF[match(player, zones_stats$player)],
         AZF = zones_stats$AZF[match(player, zones_stats$player)],
         GA_DZF = OR2dProb(Z1_home, baseline_zones$pct_1)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 1)] -
                   OR2dProb(Z3_away, baseline_zones$pct_3)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 3)],
         GA_NZF = OR2dProb(Z2_home, baseline_zones$pct_2)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 2)] -
                   OR2dProb(Z2_away, baseline_zones$pct_2)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 2)],
         GA_OZF = OR2dProb(Z3_home, baseline_zones$pct_3)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 3)] -
                   OR2dProb(Z1_away, baseline_zones$pct_1)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 1)],
         GA_Zones = GA_DZF + GA_NZF + GA_OZF
         ) %>%
  data.frame() ->
  zones_sum

In order to obtain WAR, the goals added in each category must be subtracted by the replacement goals added, then multiplied by a conversion constant. The former task requires replacement coefficients for each player position from each regression performed. Using a record of players earning league-minimum salary in each season between 2009-2010 and 2016-2017, exponentiated regression coefficients belonging to these replacement-level players were grouped by position and averaged:

load("~/corsica_data/replacement_coefficients.RData")
print(reps)

##     RF_home   RA_away   RF_away  RA_home            QF            QA
## 1 0.9831913 0.9998113 0.9976799 1.000748 -3.308423e-04 -0.0002194591
## 2 0.9813116 0.9925484 0.9970611 1.001045 -7.324746e-05 -0.0003238430
## 3        NA        NA        NA       NA            NA            NA
##          SF coef_taken coef_drawn  Z1_home  Z1_away  Z2_home  Z2_away
## 1 0.9996964   1.060219  1.2568494 1.005503 1.011908 1.004092 0.996562
## 2 1.0435596   1.134660  0.8201853 1.016801 1.008825 1.002413 1.006647
## 3        NA         NA         NA       NA       NA       NA       NA
##     Z3_home  Z3_away position   Goalie
## 1 1.0009156 1.000839        F       NA
## 2 0.9766402 1.012290        D       NA
## 3        NA       NA        G 1.034714

Replacement goals added is player-specific. That is, its value is not constant across players of the same position. Rather, its value depends on player-specific factors like Time On Ice and shots taken. Recall the general formula for shots added from the shot rates component of WAR:

\[Shots\ Added = (e^\beta - 1)\times Baseline\ Rate\ \times TOI\]

Where $TOI$ is a player characteristic. Given a baseline home Fenwick rate of 0.75/minute and away Fenwick rate of 0.71/minute and the corresponding coefficients, a replacement forward would be expected to provide -0.0131 shots added per minute played at home.

WAR is calculated by first evaluating each skater’s equivalent replacement-level goals added in each category, then subtracting these from the player’s goals added and multiplying by the goals-to-wins constant (4.5):

## WAR
merge(rates_sum %>%
        select(player, GA_RF, GA_RA, GA_Rates) %>%
        data.frame(),
      qual_sum %>%
        select(player, GA_QF, GA_QA, GA_Qual) %>%
        data.frame(),
      by.x = c("player"),
      by.y = c("player")
      ) %>%
  merge(shooting_sum %>%
        select(player, GA_Shooting) %>%
        data.frame(),
        by.x = c("player"),
        by.y = c("player")
        ) %>%
  merge(pens_sum %>%
        select(player, GA_PT, GA_PD, GA_Pens) %>%
        data.frame(),
        by.x = c("player"),
        by.y = c("player")
        ) %>%
  merge(zones_sum %>%
        select(player, GA_DZF, GA_NZF, GA_OZF, GA_Zones) %>%
        data.frame(),
        by.x = c("player"),
        by.y = c("player")
        ) %>%
  mutate(season = season,
         OGA = GA_RF + GA_QF + GA_Shooting + GA_PD + GA_OZF,
         DGA = GA_RA + GA_QA + GA_PT + GA_DZF + GA_NZF,
         GA = GA_Rates + GA_Qual + GA_Shooting + GA_Pens + GA_Zones
         ) %>%
  data.frame() ->
  war_df

war_df$position <- as.character(player_df$position[match(war_df$player, player_df$player)])
war_df$position[which(grepl("R|L|C", war_df$position) == TRUE)] <- "F"

bind_cols(
  replacement_coefs %>%
    filter(position == "F") %>%
    rename(f_home_rate_for = RF_home,
           f_home_rate_aga = RA_home,
           f_away_rate_for = RF_away,
           f_away_rate_aga = RA_away,
           f_qual_for = QF,
           f_qual_aga = QA,
           f_shooting = SF,
           f_taken = coef_taken,
           f_drawn = coef_drawn,
           f_home_dz = Z1_home,
           f_home_nz = Z2_home,
           f_home_oz = Z3_home,
           f_away_dz = Z1_away,
           f_away_nz = Z2_away,
           f_away_oz = Z3_away
           ) %>%
    select(-position) %>%
    data.frame(),

  replacement_coefs %>%
    filter(position == "D") %>%
    rename(d_home_rate_for = RF_home,
           d_home_rate_aga = RA_home,
           d_away_rate_for = RF_away,
           d_away_rate_aga = RA_away,
           d_qual_for = QF,
           d_qual_aga = QA,
           d_shooting = SF,
           d_taken = coef_taken,
           d_drawn = coef_drawn,
           d_home_dz = Z1_home,
           d_home_nz = Z2_home,
           d_home_oz = Z3_home,
           d_away_dz = Z1_away,
           d_away_nz = Z2_away,
           d_away_oz = Z3_away
           ) %>%
    select(-position) %>%
    data.frame()
) %>%
  data.frame() ->
  rep_coefs_summary

bind_rows(war_df %>%
            filter(position == "F") %>%
            mutate(HTOI = skater_stats$HTOI[match(player, skater_stats$player)],
                   ATOI = skater_stats$ATOI[match(player, skater_stats$player)],
                   FF = skater_stats$FF[match(player, skater_stats$player)],
                   FA = skater_stats$FA[match(player, skater_stats$player)],
                   iFF = skater_stats$iFF[match(player, skater_stats$player)],
                   TOI = HTOI + ATOI,
                   HZF = zones_stats$HZF[match(player, zones_stats$player)],
                   AZF = zones_stats$AZF[match(player, zones_stats$player)],
                   REP_RF = (rep_coefs_summary$f_home_rate_for - 1)*baseline_rates$home_rate*HTOI*baseline_rates$home_fsh +
                            (rep_coefs_summary$f_away_rate_for - 1)*baseline_rates$away_rate*ATOI*baseline_rates$away_fsh,
                   REP_RA = (1 - rep_coefs_summary$f_home_rate_aga)*baseline_rates$away_rate*HTOI*baseline_rates$away_fsh +
                            (1 - rep_coefs_summary$f_away_rate_aga)*baseline_rates$home_rate*ATOI*baseline_rates$home_fsh,
                   REP_Rates = REP_RF + REP_RA,
                   REP_QF = rep_coefs_summary$f_qual_for*FF,
                   REP_QA = -rep_coefs_summary$f_qual_aga*FA,
                   REP_Qual = REP_QF + REP_QA,
                   REP_Shooting = OR2dProb(rep_coefs_summary$f_shooting, baseline_rates$total_fsh)*iFF,
                   REP_PT = (1 - rep_coefs_summary$f_taken)*baseline_pens$penalty_rate*TOI*penalty_value,
                   REP_PD = (rep_coefs_summary$f_drawn - 1)*baseline_pens$penalty_rate*TOI*penalty_value,
                   REP_Pens = REP_PT + REP_PD,
                   REP_DZF = OR2dProb(rep_coefs_summary$f_home_dz, baseline_zones$pct_1)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 1)] -
                             OR2dProb(rep_coefs_summary$f_away_dz, baseline_zones$pct_3)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 3)],
                   REP_NZF = OR2dProb(rep_coefs_summary$f_home_nz, baseline_zones$pct_2)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 2)] -
                             OR2dProb(rep_coefs_summary$f_away_nz, baseline_zones$pct_2)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 2)],
                   REP_OZF = OR2dProb(rep_coefs_summary$f_home_oz, baseline_zones$pct_3)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 3)] -
                             OR2dProb(rep_coefs_summary$f_away_oz, baseline_zones$pct_1)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 1)],
                   REP_Zones = REP_DZF + REP_NZF + REP_OZF
                   ) %>%
            data.frame(),

          war_df %>%
            filter(position == "D") %>%
            mutate(HTOI = skater_stats$HTOI[match(player, skater_stats$player)],
                   ATOI = skater_stats$ATOI[match(player, skater_stats$player)],
                   FF = skater_stats$FF[match(player, skater_stats$player)],
                   FA = skater_stats$FA[match(player, skater_stats$player)],
                   iFF = skater_stats$iFF[match(player, skater_stats$player)],
                   TOI = HTOI + ATOI,
                   HZF = zones_stats$HZF[match(player, zones_stats$player)],
                   AZF = zones_stats$AZF[match(player, zones_stats$player)],
                   REP_RF = (rep_coefs_summary$d_home_rate_for - 1)*baseline_rates$home_rate*HTOI*baseline_rates$home_fsh +
                            (rep_coefs_summary$d_away_rate_for - 1)*baseline_rates$away_rate*ATOI*baseline_rates$away_fsh,
                   REP_RA = (1 - rep_coefs_summary$d_home_rate_aga)*baseline_rates$away_rate*HTOI*baseline_rates$away_fsh +
                            (1 - rep_coefs_summary$d_away_rate_aga)*baseline_rates$home_rate*ATOI*baseline_rates$home_fsh,
                   REP_Rates = REP_RF + REP_RA,
                   REP_QF = rep_coefs_summary$d_qual_for*FF,
                   REP_QA = -rep_coefs_summary$d_qual_aga*FA,
                   REP_Qual = REP_QF + REP_QA,
                   REP_Shooting = OR2dProb(rep_coefs_summary$d_shooting, baseline_rates$total_fsh)*iFF,
                   REP_PT = (1 - rep_coefs_summary$d_taken)*baseline_pens$penalty_rate*TOI*penalty_value,
                   REP_PD = (rep_coefs_summary$d_drawn - 1)*baseline_pens$penalty_rate*TOI*penalty_value,
                   REP_Pens = REP_PT + REP_PD,
                   REP_DZF = OR2dProb(rep_coefs_summary$d_home_dz, baseline_zones$pct_1)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 1)] -
                             OR2dProb(rep_coefs_summary$d_away_dz, baseline_zones$pct_3)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 3)],
                   REP_NZF = OR2dProb(rep_coefs_summary$d_home_nz, baseline_zones$pct_2)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 2)] -
                             OR2dProb(rep_coefs_summary$d_away_nz, baseline_zones$pct_2)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 2)],
                   REP_OZF = OR2dProb(rep_coefs_summary$d_home_oz, baseline_zones$pct_3)*HZF*faceoff_value$value[which(faceoff_value$home_zonestart == 3)] -
                             OR2dProb(rep_coefs_summary$d_away_oz, baseline_zones$pct_1)*AZF*faceoff_value$value[which(faceoff_value$home_zonestart == 1)],
                   REP_Zones = REP_DZF + REP_NZF + REP_OZF
                   ) %>%
            data.frame()
          ) %>%
  data.frame() %>%
  mutate(WAR_RF = (GA_RF - REP_RF)/4.5,
         WAR_RA = (GA_RA - REP_RA)/4.5,
         WAR_Rates = WAR_RF + WAR_RA,
         WAR_QF = (GA_QF - REP_QF)/4.5,
         WAR_QA = (GA_QA - REP_QA)/4.5,
         WAR_Qual = WAR_QF + WAR_QA,
         WAR_Shooting = (GA_Shooting - REP_Shooting)/4.5,
         WAR_PT = (GA_PT - REP_PT)/4.5,
         WAR_PD = (GA_PD - REP_PD)/4.5,
         WAR_Pens = WAR_PT + WAR_PD,
         WAR_OZF = (GA_OZF - REP_OZF)/4.5,
         WAR_NZF = (GA_NZF - REP_NZF)/4.5,
         WAR_DZF = (GA_DZF - REP_DZF)/4.5,
         WAR_Zones = WAR_OZF + WAR_NZF + WAR_DZF,
         OWAR = WAR_RF + WAR_QF + WAR_Shooting + WAR_PD + WAR_OZF,
         DWAR = WAR_RA + WAR_QA + WAR_PT + WAR_DZF + WAR_NZF,
         WAR = WAR_Rates + WAR_Qual + WAR_Shooting + WAR_Pens + WAR_Zones
         ) %>%
  data.frame() ->
  war_df

The derivation of the goals-to-wins constant will not be covered here.

Of note regarding the WAR densities obtained from regular players at each position is that replacement level defenders appear to perform at a near-average level. We can also plainly observe a long tail on the offensive side of the forward distribution traced by elite offensive skaters distancing themselves from the median.

Since 2007-2008, the best individual defensive seasons by forwards are:

load("~/corsica_data/war_full.RData")
war %>%
  filter(position == "F") %>%
  arrange(desc(DWAR)) %>%
  select(player, season, DWAR) %>%
  head(20)

##               player   season     DWAR
## 1      PAVEL.DATSYUK 20072008 3.179738
## 2    NIKOLAI.ZHERDEV 20072008 2.531520
## 3    DAYMOND.LANGKOW 20092010 2.433706
## 4        MIKKO.KOIVU 20172018 2.314690
## 5      PATRICK.SHARP 20092010 1.998245
## 6      PAVEL.DATSYUK 20112012 1.933166
## 7       PATRIK.ELIAS 20102011 1.715952
## 8       TROY.BROUWER 20112012 1.691218
## 9      JERE.LEHTINEN 20072008 1.688180
## 10   MIKAEL.GRANLUND 20162017 1.636564
## 11    JONATHAN.TOEWS 20082009 1.608893
## 12       JIRI.HUDLER 20072008 1.528053
## 13   DAYMOND.LANGKOW 20072008 1.507370
## 14 NICKLAS.BACKSTROM 20072008 1.474989
## 15     PAVEL.DATSYUK 20082009 1.465737
## 16       BOYD.GORDON 20072008 1.459910
## 17       MIKKO.KOIVU 20152016 1.441790
## 18        MAX.TALBOT 20072008 1.376212
## 19 MIKAEL.SAMUELSSON 20072008 1.346862
## 20       MIKKO.KOIVU 20072008 1.345745

The best individual offensive seasons by forwards are:

load("~/corsica_data/war_full.RData")
war %>%
  filter(position == "F") %>%
  arrange(desc(OWAR)) %>%
  select(player, season, OWAR) %>%
  head(20)

##            player   season     OWAR
## 1   SIDNEY.CROSBY 20092010 9.351082
## 2   ALEX.OVECHKIN 20072008 9.106343
## 3   SIDNEY.CROSBY 20162017 8.951932
## 4  STEVEN.STAMKOS 20112012 8.634886
## 5  CONNOR.MCDAVID 20172018 7.859471
## 6    JOHN.TAVARES 20112012 7.522830
## 7      JAMIE.BENN 20132014 7.136211
## 8   EVGENI.MALKIN 20072008 6.820916
## 9  STEVEN.STAMKOS 20092010 6.732040
## 10  ALEX.OVECHKIN 20082009 6.431848
## 11   JOE.THORNTON 20072008 6.192007
## 12 STEVEN.STAMKOS 20102011 6.184864
## 13   JOE.PAVELSKI 20102011 6.012327
## 14     ERIC.STAAL 20082009 6.006631
## 15   JAROMIR.JAGR 20072008 5.868054
## 16  ALEX.OVECHKIN 20092010 5.776350
## 17 JONATHAN.TOEWS 20132014 5.706941
## 18     BRAD.BOYES 20072008 5.508979
## 19 CONNOR.MCDAVID 20162017 5.483904
## 20  SIDNEY.CROSBY 20072008 5.469366

The best individual defensive seasons by defenders are:

load("~/corsica_data/war_full.RData")
war %>%
  filter(position == "D") %>%
  arrange(desc(DWAR)) %>%
  select(player, season, DWAR) %>%
  head(20)

##                 player   season     DWAR
## 1           RYAN.SUTER 20162017 2.567066
## 2       BRIAN.CAMPBELL 20092010 2.540747
## 3  MARC-EDOUARD.VLASIC 20132014 2.478074
## 4           RYAN.SUTER 20132014 2.377756
## 5  MARC-EDOUARD.VLASIC 20152016 2.300381
## 6            TJ.BRODIE 20132014 2.225767
## 7          ANDY.GREENE 20132014 2.133643
## 8          PAUL.MARTIN 20072008 2.063240
## 9           GREG.ZANON 20092010 1.987817
## 10          ROMAN.JOSI 20142015 1.906835
## 11           TJ.BRODIE 20162017 1.828014
## 12     HAMPUS.LINDHOLM 20152016 1.800479
## 13        JONAS.BRODIN 20142015 1.795230
## 14    ANTON.VOLCHENKOV 20112012 1.698401
## 15         PAUL.MARTIN 20132014 1.638689
## 16         ANDY.GREENE 20122013 1.631745
## 17        GREG.PATERYN 20172018 1.598980
## 18        KIM.JOHNSSON 20082009 1.593834
## 19        ADAM.LARSSON 20152016 1.517664
## 20  NIKLAS.HJALMARSSON 20122013 1.514217

And the best individual offensive seasons by defenders are:

load("~/corsica_data/war_full.RData")
war %>%
  filter(position == "D") %>%
  arrange(desc(OWAR)) %>%
  select(player, season, OWAR) %>%
  head(20)

##                 player   season     OWAR
## 1          BRENT.BURNS 20162017 4.122464
## 2            DAN.BOYLE 20092010 3.509711
## 3           MIKE.GREEN 20082009 3.388679
## 4         JUSTIN.FAULK 20142015 3.188187
## 5           RYAN.ELLIS 20162017 3.016041
## 6            DAN.BOYLE 20082009 2.958891
## 7            ROB.BLAKE 20082009 2.890094
## 8        ERIK.KARLSSON 20142015 2.710302
## 9        ERIK.KARLSSON 20132014 2.595720
## 10       ANDREI.MARKOV 20072008 2.581959
## 11         JAKE.MUZZIN 20152016 2.311009
## 12   LUBOMIR.VISNOVSKY 20102011 2.296664
## 13       ERIK.KARLSSON 20112012 2.277251
## 14 MARC-EDOUARD.VLASIC 20082009 2.275077
## 15      JOHN.KLINGBERG 20152016 2.261843
## 16     DOUGIE.HAMILTON 20162017 2.199894
## 17           DAN.BOYLE 20102011 2.181347
## 18    ALEX.PIETRANGELO 20112012 2.178934
## 19       KIMMO.TIMONEN 20132014 2.162415
## 20       KIMMO.TIMONEN 20092010 2.145446

Finally, the best individual goaltending seasons are:

load("~/corsica_data/war_full.RData")
war %>%
  filter(position == "G") %>%
  arrange(desc(WAR_Goalie)) %>%
  select(player, season, WAR_Goalie) %>%
  head(20)

##              player   season WAR_Goalie
## 1        TIM.THOMAS 20102011  11.839463
## 2  HENRIK.LUNDQVIST 20112012  10.860530
## 3  SERGEI.BOBROVSKY 20162017   9.937583
## 4    JONATHAN.QUICK 20112012   9.439853
## 5        MIKE.SMITH 20112012   9.374637
## 6        TIM.THOMAS 20082009   8.165371
## 7       CAREY.PRICE 20142015   7.967805
## 8       TUUKKA.RASK 20122013   7.866080
## 9  HENRIK.LUNDQVIST 20132014   7.755855
## 10         CAM.WARD 20082009   7.252770
## 11 HENRIK.LUNDQVIST 20082009   6.938354
## 12   ROBERTO.LUONGO 20102011   6.856440
## 13   EVGENI.NABOKOV 20092010   6.717359
## 14 HENRIK.LUNDQVIST 20092010   6.494654
## 15    BRADEN.HOLTBY 20142015   6.484342
## 16         CAM.WARD 20102011   6.476188
## 17 HENRIK.LUNDQVIST 20122013   6.472801
## 18   JAROSLAV.HALAK 20092010   6.363367
## 19     JONAS.HILLER 20082009   6.331477
## 20   JONATHAN.QUICK 20172018   6.207274

Taking total WAR, we find the best individual seasons to be:

load("~/corsica_data/war_full.RData")
war %>%
  mutate(WAR = ifelse(position == "G", WAR_Goalie, WAR)) %>%
  arrange(desc(WAR)) %>%
  select(player, season, WAR) %>%
  head(40)

##               player   season       WAR
## 1         TIM.THOMAS 20102011 11.839463
## 2   HENRIK.LUNDQVIST 20112012 10.860530
## 3   SERGEI.BOBROVSKY 20162017  9.937583
## 4      SIDNEY.CROSBY 20092010  9.722228
## 5      ALEX.OVECHKIN 20072008  9.453252
## 6     JONATHAN.QUICK 20112012  9.439853
## 7         MIKE.SMITH 20112012  9.374637
## 8     STEVEN.STAMKOS 20112012  8.387472
## 9         TIM.THOMAS 20082009  8.165371
## 10       CAREY.PRICE 20142015  7.967805
## 11       TUUKKA.RASK 20122013  7.866080
## 12  HENRIK.LUNDQVIST 20132014  7.755855
## 13      JOHN.TAVARES 20112012  7.626564
## 14     SIDNEY.CROSBY 20162017  7.573488
## 15          CAM.WARD 20082009  7.252770
## 16    CONNOR.MCDAVID 20172018  7.157724
## 17      JOE.THORNTON 20072008  7.043349
## 18  HENRIK.LUNDQVIST 20082009  6.938354
## 19    ROBERTO.LUONGO 20102011  6.856440
## 20     PAVEL.DATSYUK 20072008  6.718918
## 21    EVGENI.NABOKOV 20092010  6.717359
## 22     EVGENI.MALKIN 20072008  6.530675
## 23  HENRIK.LUNDQVIST 20092010  6.494654
## 24     BRADEN.HOLTBY 20142015  6.484342
## 25          CAM.WARD 20102011  6.476188
## 26  HENRIK.LUNDQVIST 20122013  6.472801
## 27        JAMIE.BENN 20132014  6.391783
## 28 HENRIK.ZETTERBERG 20072008  6.378891
## 29    JAROSLAV.HALAK 20092010  6.363367
## 30      JONAS.HILLER 20082009  6.331477
## 31     ALEX.OVECHKIN 20092010  6.294403
## 32      JAROMIR.JAGR 20072008  6.279444
## 33    JONATHAN.QUICK 20172018  6.207274
## 34        ERIC.STAAL 20082009  6.175381
## 35     ALEX.OVECHKIN 20082009  6.173078
## 36      JOE.PAVELSKI 20102011  6.142695
## 37    STEVEN.STAMKOS 20102011  6.117706
## 38       CAREY.PRICE 20102011  6.092893
## 39       RYAN.MILLER 20092010  6.013737
## 40    STEVEN.STAMKOS 20092010  5.886627

of which 22 of the top-40 belong to goaltenders and 18 to forwards. The best recorded season by a defender belongs to Dan Boyle in 2008-2009, ranking 112th overall.

WAR can be aggregated over multiple seasons and prorated as to compare players’ long-term performances. Taking the mean WAR/82 of regular players since the 2015-2016 season:

war %>%
  filter(season %in% c("20152016", "20162017", "20172018")) %>%
  mutate(WAR = ifelse(position == "G", WAR_Goalie, WAR)) %>%
  group_by(player) %>%
  summarise(position = first(position),
            GP = sum(GP),
            WAR = sum(WAR)
            ) %>%
  mutate(WAR82 = WAR/GP*82) %>%
  filter(GP > 82) %>%
  group_by(position) %>%
  top_n(10, wt = WAR82) %>%
  arrange(position, desc(WAR82)) %>%
  data.frame()

##                player position  GP       WAR    WAR82
## 1          RYAN.ELLIS        D 228  5.898355 2.121338
## 2          SHEA.WEBER        D 202  4.897389 1.988049
## 3          RYAN.SUTER        D 253  5.401316 1.750624
## 4     DOUGIE.HAMILTON        D 247  4.623849 1.535043
## 5       MARK.GIORDANO        D 247  4.451406 1.477795
## 6         KRIS.LETANG        D 212  3.610721 1.396600
## 7      JOHN.KLINGBERG        D 249  4.214215 1.387814
## 8     HAMPUS.LINDHOLM        D 237  3.959375 1.369910
## 9       RYAN.MCDONAGH        D 226  3.743225 1.358161
## 10      MICHAL.KEMPNY        D 102  1.609986 1.294302
## 11     CONNOR.MCDAVID        F 220 15.480463 5.769991
## 12    NIKITA.KUCHEROV        F 246 14.475294 4.825098
## 13       PATRIK.LAINE        F 153  8.394060 4.498777
## 14 VLADIMIR.TARASENKO        F 271 13.161185 3.982351
## 15      SIDNEY.CROSBY        F 283 12.864362 3.727483
## 16   PATRICE.BERGERON        F 226  9.635389 3.496026
## 17    AUSTON.MATTHEWS        F 148  6.272348 3.475220
## 18      ALEX.OVECHKIN        F 266 11.266580 3.473156
## 19      BRAD.MARCHAND        F 228  9.432035 3.392223
## 20   NATHAN.MACKINNON        F 226  9.087914 3.297385
## 21   SERGEI.BOBROVSKY        G 169 15.015497 7.285626
## 22     JONATHAN.QUICK        G 135 10.232793 6.215475
## 23       ANTTI.RAANTA        G 104  7.715894 6.083685
## 24        JOHN.GIBSON        G 170 12.005039 5.790666
## 25     COREY.CRAWFORD        G 152  9.082756 4.899908
## 26  FREDERIK.ANDERSEN        G 184  9.589625 4.273637
## 27      BRADEN.HOLTBY        G 207 10.697367 4.237604
## 28       MARTIN.JONES        G 217  9.165061 3.463295
## 29     MATTHEW.MURRAY        G 128  5.291741 3.390021
## 30         CAM.TALBOT        G 207  8.191402 3.244903

WAR can be found on Corsica here: https://corsica.hockey/war/

References

Perry, Emmanuel. June 16, 2016. Corsica. “Composite Tailored Regression Modeling For Evaluative Ratings in Professional Hockey.” [https://corsica.hockey/misc/K_Manuscript.pdf]
Perry, Emmanuel, March 3, 2016. Corsica “Shot Quality and Expected Goals: Part I.” [https://corsica.hockey/blog/2016/03/03/shot-quality-and-expected-goals-part-i/]
Costella, Jennifer. Oct. 29, 2014. Puck Daddy. “Why Score Effects are the key to your fancy stats.” [https://sports.yahoo.com/blogs/puck-daddy/fancy-stats-164244781.html]
MacDonald, Brian. Oct 3, 2012. arXiv. “Adjusted Plus-Minus for NHL Players using Ridge Regression with Goals, Shots, Fenwick, and Corsi.” [https://arxiv.org/pdf/1201.0317.pdf]
Schuckers, Michael and Curro, James. March 1, 2013. Statsportsconsulting. “Total Hockey Rating (THoR): A comprehensive statistical rating of National Hockey League forwards and defensemen based upon all on-ice events.” [http://statsportsconsulting.com/main/wp-content/uploads/Schuckers_Curro_MIT_Sloan_THoR.pdf]
Robert B. Gramacy, Matt Taddy and Sen Tian. Jan 25, 2016. arXiv. “Hockey Player Performance via Regularized Logistic Regression.” [https://arxiv.org/pdf/1510.02172.pdf]
A.C. Thomas, Samuel Ventura, Shane Jensen and Stephen Ma. Mar. 1, 2013. arXiv. “Competing Process Hazard Function Models for Player Ratings in Ice Hockey.” [https://arxiv.org/pdf/1208.0799.pdf]
Lopez, Michael. May 2013. International Journal of Sport Finance. “Biased Impartiality Among National Hockey League Referees.” [https://pdfs.semanticscholar.org/0533/6d1969264bf1212ccae226bf11b21b151bed.pdf]

Wins Above Replacement

Emmanuel Perry

References