About

This page describes the nflfastR Expected Points (EP), Win Probability (WP), and Completion Percentage (CP) models before showing that they are well calibrated using the procedure introduced by Yurko, Ventura, and Horowitz. Because the 2020 season will mark 22 seasons of nflfastR data, the main purpose behind creating new models for EP and WP was to build in era adjustments to fascilitate better cross-era comparisons. However, we also discovered that switching to tree-based methods could improve model calibration, especially for end-of-half situations with complicated nonlinear interactions between variables. Because we are introducing new models, we compare our calibation results to nflscrapR to show that these new models are somewhat better calibrated. If they weren’t, there would be no point in updating the models!

nflfastR switching from the nflscrapR EP and WP models to its own model should not be thought of as a criticism of nflscrapR: the improvements are relatively minor and nflscrapR provided the code base to perform much of this analysis, breaking new ground in the process.

Model features

The EP, WP, and CP models are trained using xgboost, which uses training data to create decision trees.

EP model features

  • Seconds remaining in half
  • Yard line
  • Whether possession team is at home
  • Roof type: retractable, dome, or outdoors
  • Down
  • Yards to go
  • Era: 1999-2001 (pre-expansion), 2002-2005 (pre-CPOE), 2006-2013 (pre-LOB rules change), 2014-2017, 2018 and beyond
  • Timeouts remaining for each team

WP model features

  • Seconds remaining in half
  • Seconds remaining in game
  • Yard line
  • Expected Points
  • Score differential
  • Ratio of expected score differential (expected points + point differential) to time remaining (feature borrowed from nflscrapR)
  • Down
  • Yards to go
  • Timeouts remaining for each team
  • Whether team will receive 2nd half kickoff
  • Whether possession team is at home
  • [Model with Vegas line only: point spread * log(3600 / (50 + (seconds elapsed in game))]

CP model features

  • Yard line
  • Whether possession team is at home
  • Roof type: retractable, dome, or outdoors
  • Down
  • Yards to go
  • Distance to sticks (air yards - yards to go)
  • Era: 2006-2013, 2014-2017, 2018 and beyond (note that air yards data only go back to 2006, so there is no CP for earlier years)
  • Air yards
  • Whether air yards is 0 (probably unnecessary with tree-based method and a relic from earlier models where it was included because completion percentage is much lower for 0 air yard passes)
  • Pass location (binary: middle or not middle)
  • Whether quarterback was hit on the play

EP Model Calibration Results

The goal of this section is to show that the nflfastR EP model is well calibrated. To measure calibration, we follow Yurko et al. and perform leave-one-season-out (LOSO) calibration. In particular, for each of the 20 available seasons (2000-2019), we exclude one season, train the EP model on the other 19 seasons, and then compare the model’s predictions in the holdout season to what actually happened in that season. If the model is well calibrated, we would expect that, for example, 50 percent of plays with a touchdown probability of 50 percent prior to the play would have the next score be a touchdown for the possession team.

Let’s start with some setup. The file used here isn’t pushed because it’s large, but its creation can be seen here and the file can be accessed here.

set.seed(2013) #GoHawks
library(tidyverse)
library(xgboost)

#some helper files are in these
source('https://raw.githubusercontent.com/mrcaseb/nflfastR/master/R/helper_add_nflscrapr_mutations.R')
source('https://raw.githubusercontent.com/mrcaseb/nflfastR/master/R/helper_add_ep_wp.R')
source('https://raw.githubusercontent.com/mrcaseb/nflfastR/master/R/helper_add_cp_cpoe.R')

pbp_data <- readRDS(url('https://github.com/guga31bb/nflfastR-data/blob/master/models/cal_data.rds?raw=true'))
model_data <- pbp_data %>%
  #in 'R/helper_add_nflscrapr_mutations.R'
  make_model_mutations() %>%
  mutate(
    label = case_when(
      Next_Score_Half == "Touchdown" ~ 0,
      Next_Score_Half == "Opp_Touchdown" ~ 1,
      Next_Score_Half == "Field_Goal" ~ 2,
      Next_Score_Half == "Opp_Field_Goal" ~ 3,
      Next_Score_Half == "Safety" ~ 4,
      Next_Score_Half == "Opp_Safety" ~ 5,
      Next_Score_Half == "No_Score" ~ 6
    ),
    label = as.factor(label),
    #use nflscrapR weights
    Drive_Score_Dist = Drive_Score_Half - drive,
    Drive_Score_Dist_W = (max(Drive_Score_Dist) - Drive_Score_Dist) /
      (max(Drive_Score_Dist) - min(Drive_Score_Dist)),
    ScoreDiff_W = (max(abs(score_differential), na.rm=T) - abs(score_differential)) /
      (max(abs(score_differential), na.rm=T) - min(abs(score_differential), na.rm=T)),
    Total_W = Drive_Score_Dist_W + ScoreDiff_W,
    Total_W_Scaled = (Total_W - min(Total_W, na.rm=T)) /
      (max(Total_W, na.rm=T) - min(Total_W, na.rm=T))
  ) %>%
  filter(
    !is.na(defteam_timeouts_remaining), !is.na(posteam_timeouts_remaining),
    !is.na(yardline_100)
  ) %>%
  select(
    label,
    season,
    half_seconds_remaining,
    yardline_100,
    home,
    retractable,
    dome,
    outdoors,
    ydstogo,
    era0, era1, era2, era3, era4,
    down1, down2, down3, down4,
    posteam_timeouts_remaining,
    defteam_timeouts_remaining,
    Total_W_Scaled
  )

#idk why this is all necessary for xgb but it is
model_data <- model_data %>%
  mutate(label = as.numeric(label),
         label = label - 1)

rm(pbp_data)

seasons <- unique(model_data$season)

Input the stuff we’ll need to fit the model. The parameters were obtained from cross-validation, where each season was forced to be entirely contained in a given CV fold to prevent leakage in labels from one fold to another (for example, if a given drive were split up between folds).

nrounds = 500
params <-
  list(
    booster = "gbtree",
    objective = "multi:softprob",
    eval_metric = c("mlogloss"),
    num_class = 7,
    eta = 0.025,
    gamma = 1,
    subsample = 0.8,
    colsample_bytree = 0.8,
    max_depth = 5,
    min_child_weight = 1
  )

Now do the LOSO model fitting.

cv_results <- map_dfr(seasons, function(x) {

  test_data <- model_data %>%
    filter(season == x) %>%
    select(-season)
  train_data <- model_data %>%
    filter(season != x) %>%
    select(-season)

  full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -Total_W_Scaled)),
                                    label = train_data$label, weight = train_data$Total_W_Scaled)
  ep_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2)

  preds <- as.data.frame(
    matrix(predict(ep_model, as.matrix(test_data %>% select(-label, -Total_W_Scaled))), ncol=7, byrow=TRUE)
  )
  colnames(preds) <- c("Touchdown","Opp_Touchdown","Field_Goal","Opp_Field_Goal",
                       "Safety","Opp_Safety","No_Score")

  cv_data <- bind_cols(test_data, preds) %>% mutate(season = x)
  return(cv_data)

})

#get the BINS for the calibration plot
plot <- cv_results %>%
  select(Touchdown, Opp_Touchdown, Field_Goal, Opp_Field_Goal, Safety, Opp_Safety, No_Score, label) %>%
  pivot_longer(-label, names_to = 'type', values_to = 'pred_prob') %>%
  mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>%
  mutate(outcome = case_when(
    label == 0 ~ "Touchdown",
    label == 1 ~ "Opp_Touchdown",
    label == 2 ~ "Field_Goal",
    label == 3 ~ "Opp_Field_Goal",
    label == 4 ~ "Safety",
    label == 5 ~ "Opp_Safety",
    label == 6 ~ "No_Score"
  )) %>%
  group_by(type, bin_pred_prob) %>%
  mutate(correct = if_else(outcome == type, 1, 0)) %>%
  summarize(n_plays = n(),
            n_outcome = sum(correct),
            bin_actual_prob = n_outcome / n_plays)
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)

Here is the EP calibration plot. Points close to the diagonal dotted line are consistent with a well-calibrated model:

ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25),
                       lab = c("More times\nthan expected", "Fewer times\nthan expected"),
                       next_score_type = factor("No Score (0)"))
plot %>%
  #about .75M plays in total
  #filter(n_plays >= 50) %>%
    ungroup() %>%
  mutate(type = fct_relevel(type,
                                       "Opp_Safety", "Opp_Field_Goal",
                                       "Opp_Touchdown", "No_Score", "Safety",
                                       "Field_Goal", "Touchdown"
  ),
  type = fct_recode(type,
                               "-Field Goal (-3)" = "Opp_Field_Goal",
                               "-Safety (-2)" = "Opp_Safety",
                               "-Touchdown (-7)" = "Opp_Touchdown",
                               "Field Goal (3)" = "Field_Goal",
                               "No Score (0)" = "No_Score",
                               "Touchdown (7)" = "Touchdown",
                               "Safety (2)" = "Safety")) %>%
  ggplot() +
  geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) +
  geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") +
  geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) +
  coord_equal() +
  scale_x_continuous(limits = c(0,1)) +
  scale_y_continuous(limits = c(0,1)) +
  labs(size = "Number of plays",
       x = "Estimated next score probability",
       y = "Observed next score probability") +
  geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        strip.background = element_blank(),
        strip.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 10, angle = 90),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        legend.position = c(1, .05), legend.justification = c(1, 0)) +
  facet_wrap(~ type, ncol = 4)

There is some weirdness with the opponent safety predictions, but these dots represent an extremely small number of plays (10-50 plays out of about 750,000).

Now let’s get the calibration error using the measure developed in Yurko et al., and compare it to nflscrapR. First we need to get the nflscrapR predictions, which we have saved from the previous version of nflfastR which applied the nflscrapR models.

#calibration error
cv_cal_error <- plot %>%
  ungroup() %>%
  mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>%
  group_by(type) %>%
  summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE),
            n_scoring_event = sum(n_outcome, na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)

pbp_data <- readRDS(url('https://github.com/guga31bb/nflfastR-data/blob/master/models/cal_data_nflscrapr.rds?raw=true'))
#nflscrapr calibration error
nflscrapr <- pbp_data %>%
  select(td_prob, opp_td_prob, fg_prob, opp_fg_prob, safety_prob, opp_safety_prob, no_score_prob, Next_Score_Half) %>%
  pivot_longer(-Next_Score_Half, names_to = 'type', values_to = 'pred_prob') %>%
  mutate(bin_pred_prob = round(pred_prob / 0.05) * .05) %>%
  mutate(outcome = Next_Score_Half,
         type = case_when(
           type == "td_prob" ~ 'Touchdown',
           type == 'fg_prob' ~ "Field_Goal",
           type == "opp_td_prob" ~ "Opp_Touchdown",
           type == 'opp_fg_prob' ~ "Opp_Field_Goal",
           type == 'safety_prob' ~ "Safety",
           type == 'opp_safety_prob' ~ "Opp_Safety",
           type == "no_score_prob" ~ "No_Score"
         )) %>%
  group_by(type, bin_pred_prob) %>%
  mutate(correct = if_else(outcome == type, 1, 0)) %>%
  summarize(n_plays = n(),
            n_outcome = sum(correct),
            bin_actual_prob = n_outcome / n_plays) %>%
  ungroup() %>%
  mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>%
  group_by(type) %>%
  summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE),
            n_scoring_event = sum(n_outcome, na.rm = TRUE))
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)
rm(pbp_data)

message(glue::glue(
'
--CALIBRATION ERROR--

nflfastR:
{round(with(cv_cal_error, weighted.mean(weight_cal_error, n_scoring_event)), 4)}

nflscrapR:
{round(with(nflscrapr, weighted.mean(weight_cal_error, n_scoring_event)), 4)}
'
))
#> --CALIBRATION ERROR--
#> 
#> nflfastR:
#> 0.0059
#> 
#> nflscrapR:
#> 0.017

We see that the new EP model is better calibrated. Note that nflscrapR reports a calibration error of 0.01309723. The number is higher here because of the additional seasons included outside of the time period nflscrapR was trained on, and the lack of era adjustment in nflscrapR.

WP Model Calibration Results

As with EP, do some initial setup to get the data ready for fitting.

model_data <- readRDS(url('https://github.com/guga31bb/nflfastR-data/blob/master/models/cal_data.rds?raw=true'))
model_data <- model_data %>%
  make_model_mutations() %>%
  prepare_wp_data() %>%
  mutate(label = ifelse(posteam == Winner, 1, 0)) %>%
  filter(!is.na(ep) & !is.na(score_differential) & !is.na(play_type) & !is.na(label)) %>%
  select(
    label,
    receive_2h_ko,
    spread_time,
    half_seconds_remaining,
    game_seconds_remaining,
    ExpScoreDiff_Time_Ratio,
    score_differential,
    ep,
    down,
    ydstogo,
    home,
    posteam_timeouts_remaining,
    defteam_timeouts_remaining,
    season,
    #only needed for the plots here, not used in model
    qtr
  ) %>%
  filter(qtr <= 4)

nrounds = 65
params <-
  list(
    booster = "gbtree",
    objective = "binary:logistic",
    eval_metric = c("logloss"),
    eta = 0.2,
    gamma = 0,
    subsample=0.8,
    colsample_bytree=0.8,
    max_depth = 4,
    min_child_weight = 1
  )

Do the LOSO fitting:

cv_results <- map_dfr(seasons, function(x) {

  test_data <- model_data %>%
    filter(season == x) %>%
    select(-season)
  train_data <- model_data %>%
    filter(season != x) %>%
    select(-season)

  full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -qtr, -spread_time)),
                                    label = train_data$label)
  wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2)

  preds <- as.data.frame(
    matrix(predict(wp_model, as.matrix(test_data %>% select(-label, -qtr, -spread_time))))
  ) %>%
    dplyr::rename(wp = V1)

  cv_data <- bind_cols(test_data, preds) %>% mutate(season = x)
  return(cv_data)

})

#TIME FOR BINNING
wp_cv_loso_calibration_results <- cv_results %>%
  # Create BINS for wp:
  mutate(bin_pred_prob = round(wp / 0.05) * .05) %>%
  # Group by both the qtr and bin_pred_prob:
  group_by(qtr, bin_pred_prob) %>%
  # Calculate the calibration results:
  summarize(n_plays = n(),
            n_wins = length(which(label == 1)),
            bin_actual_prob = n_wins / n_plays)
#> `summarise()` regrouping output by 'qtr' (override with `.groups` argument)

The WP plot. Looks good!

# Create a label data frame for the chart:
ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25),
                       lab = c("More times\nthan expected", "Fewer times\nthan expected"),
                       qtr = factor("1st Quarter"))

# Create the calibration chart:
wp_cv_loso_calibration_results %>%
  ungroup() %>%
  mutate(qtr = fct_recode(factor(qtr), "1st Quarter" = "1", "2nd Quarter" = "2",
                          "3rd Quarter" = "3", "4th Quarter" = "4")) %>%
  ggplot() +
  geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) +
  geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") +
  geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) +
  coord_equal() +
  scale_x_continuous(limits = c(0,1)) +
  scale_y_continuous(limits = c(0,1)) +
  labs(size = "Number of plays",
       x = "Estimated win probability",
       y = "Observed win probability") +
  geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        strip.background = element_blank(),
        strip.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 10, angle = 90),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        legend.position = "bottom") +
  facet_wrap(~ qtr, ncol = 4)

And get the WP calibration error:

# Calculate the calibration error values:
wp_cv_cal_error <- wp_cv_loso_calibration_results %>%
  ungroup() %>%
  mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>%
  group_by(qtr) %>%
  summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE),
            n_wins = sum(n_wins, na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)

#get nflscrapR to compare
pbp_data <- readRDS(url('https://github.com/guga31bb/nflfastR-data/blob/master/models/cal_data_nflscrapr.rds?raw=true')) %>%
  mutate(label = ifelse(posteam == Winner, 1, 0)) %>%
  filter(qtr <= 4, !is.na(label), !is.na(posteam), !is.na(wp))

nflscrapR <- pbp_data %>%
  # Create binned probability column:
  mutate(bin_pred_prob = round(wp / 0.05) * .05) %>%
  # Group by both the qtr and bin_pred_prob:
  group_by(qtr, bin_pred_prob) %>%
  # Calculate the calibration results:
  summarize(n_plays = n(),
            n_wins = length(which(label == 1)),
            bin_actual_prob = n_wins / n_plays) %>%
  ungroup() %>%
  mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>%
  group_by(qtr) %>%
  summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE),
            n_wins = sum(n_wins, na.rm = TRUE))
#> `summarise()` regrouping output by 'qtr' (override with `.groups` argument)
#> `summarise()` ungrouping output (override with `.groups` argument)

message(glue::glue(
  '--CALIBRATION ERROR--

nflfastR:
{round(with(wp_cv_cal_error, weighted.mean(weight_cal_error, n_wins)), 4)}

nflscrapR:
{round(with(nflscrapR, weighted.mean(weight_cal_error, n_wins)), 4)}'
))
#> --CALIBRATION ERROR--
#> 
#> nflfastR:
#> 0.006
#> 
#> nflscrapR:
#> 0.0397

Again, the new WP model represents an improvement.

WP Model Calibration Results: with point spread

nflfastR has a secondary win probability model that also incorporates the pregame spread to more accurately reflect a team’s chances of winning. Below are calibration results for this model.

nrounds = 170
params <-
  list(
    booster = "gbtree",
    objective = "binary:logistic",
    eval_metric = c("logloss"),
    eta = 0.075,
    gamma = 3,
    subsample=0.8,
    colsample_bytree=0.8,
    max_depth = 5,
    min_child_weight = .9
  )

Do the LOSO fitting:

cv_results <- map_dfr(seasons, function(x) {

  test_data <- model_data %>%
    filter(season == x) %>%
    select(-season)
  train_data <- model_data %>%
    filter(season != x) %>%
    select(-season)

  full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-label, -qtr)),
                                    label = train_data$label)
  wp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2)

  preds <- as.data.frame(
    matrix(predict(wp_model, as.matrix(test_data %>% select(-label, -qtr))))
  ) %>%
    dplyr::rename(wp = V1)

  cv_data <- bind_cols(test_data, preds) %>% mutate(season = x)
  return(cv_data)

})

#TIME FOR BINNING
wp_cv_loso_calibration_results <- cv_results %>%
  # Create BINS for wp:
  mutate(bin_pred_prob = round(wp / 0.05) * .05) %>%
  # Group by both the qtr and bin_pred_prob:
  group_by(qtr, bin_pred_prob) %>%
  # Calculate the calibration results:
  summarize(n_plays = n(),
            n_wins = length(which(label == 1)),
            bin_actual_prob = n_wins / n_plays)
#> `summarise()` regrouping output by 'qtr' (override with `.groups` argument)

The WP plot.

# Create a label data frame for the chart:
ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25),
                       lab = c("More times\nthan expected", "Fewer times\nthan expected"),
                       qtr = factor("1st Quarter"))

# Create the calibration chart:
wp_cv_loso_calibration_results %>%
  ungroup() %>%
  mutate(qtr = fct_recode(factor(qtr), "1st Quarter" = "1", "2nd Quarter" = "2",
                          "3rd Quarter" = "3", "4th Quarter" = "4")) %>%
  ggplot() +
  geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) +
  geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") +
  geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) +
  coord_equal() +
  scale_x_continuous(limits = c(0,1)) +
  scale_y_continuous(limits = c(0,1)) +
  labs(size = "Number of plays",
       x = "Estimated win probability",
       y = "Observed win probability") +
  geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 2) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        strip.background = element_blank(),
        strip.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 10, angle = 90),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        legend.position = "bottom") +
  facet_wrap(~ qtr, ncol = 4)

And get the WP calibration error:

# Calculate the calibration error values:
wp_cv_cal_error <- wp_cv_loso_calibration_results %>%
  ungroup() %>%
  mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>%
  group_by(qtr) %>%
  summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE),
            n_wins = sum(n_wins, na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)
message(glue::glue(
  '--CALIBRATION ERROR--

nflfastR with Vegas line:
{round(with(wp_cv_cal_error, weighted.mean(weight_cal_error, n_wins)), 4)}

nflscrapR:
{round(with(nflscrapR, weighted.mean(weight_cal_error, n_wins)), 4)}'
))
#> --CALIBRATION ERROR--
#> 
#> nflfastR with Vegas line:
#> 0.0062
#> 
#> nflscrapR:
#> 0.0397

Again, the new WP model is better calibrated than nflscrapR. In our testing, incorporating the spread substantially improved the performance of the model as measured by cross-validation classification accuracy (reduced error rate from 27% to 23%) and log loss (reduced from .52 to .45). We include a time-decaying function of spread on its own as including spread on its own increases the LOSO calibration error, especially in the fourth quarter. We also tried removing the home indicator in the spread model, but this worsened the calibration results.

CP Model Calibration Results

By now, the process should be familiar.

pbp <- readRDS(url('https://github.com/guga31bb/nflfastR-data/blob/master/models/cal_data.rds?raw=true'))

model_data <- pbp %>%
  filter(season >= 2006) %>%
  make_model_mutations() %>%
  dplyr::mutate(receiver_player_name =
                  stringr::str_extract(desc, "(?<=((to)|(for))\\s[:digit:]{0,2}\\-{0,1})[A-Z][A-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"),
                pass_middle = dplyr::if_else(pass_location == 'middle', 1, 0),
                air_is_zero= dplyr::if_else(air_yards == 0,1,0),
                distance_to_sticks = air_yards - ydstogo
  ) %>%
  dplyr::filter(complete_pass == 1 | incomplete_pass == 1 | interception == 1) %>%
  dplyr::filter(!is.na(air_yards) & air_yards >= -15 & air_yards <70 & !is.na(receiver_player_name) & !is.na(pass_location)) %>%
  dplyr::select(
    season, complete_pass, air_yards, yardline_100, ydstogo,
    down1, down2, down3, down4, air_is_zero, pass_middle,
    era2, era3, era4, qb_hit, home,
    outdoors, retractable, dome, distance_to_sticks
  )
rm(pbp)


nrounds = 560
params <-
  list(
    booster = "gbtree",
    objective = "binary:logistic",
    eval_metric = c("logloss"),
    eta = 0.025,
    gamma = 5,
    subsample=0.8,
    colsample_bytree=0.8,
    max_depth = 4,
    min_child_weight = 6,
    base_score = mean(model_data$complete_pass)
  )

cv_results <- map_dfr(2006:2019, function(x) {

  test_data <- model_data %>%
    filter(season == x) %>%
    select(-season)
  train_data <- model_data %>%
    filter(season != x) %>%
    select(-season)

  full_train = xgboost::xgb.DMatrix(model.matrix(~.+0, data = train_data %>% select(-complete_pass)),
                                    label = train_data$complete_pass)
  cp_model <- xgboost::xgboost(params = params, data = full_train, nrounds = nrounds, verbose = 2)

  preds <- as.data.frame(
    matrix(predict(cp_model, as.matrix(test_data %>% select(-complete_pass))))
  ) %>%
    dplyr::rename(cp = V1)

  cv_data <- bind_cols(test_data, preds) %>% mutate(season = x)
  return(cv_data)

})

#TIME FOR BINNING
cp_cv_loso_calibration_results <- cv_results %>%
  # Create BINS for wp:
  mutate(
    bin_pred_prob = round(cp / 0.05) * .05,
    distance = case_when(
      air_yards < 5 ~ "Short",
      air_yards >= 5 & air_yards < 15 ~ "Intermediate",
      air_yards >= 15 ~ "Deep"
    )
    ) %>%
  # Group by both the qtr and bin_pred_prob:
  group_by(distance, bin_pred_prob) %>%
  # Calculate the calibration results:
  summarize(n_plays = n(),
            n_complete = length(which(complete_pass == 1)),
            bin_actual_prob = n_complete / n_plays)
#> `summarise()` regrouping output by 'distance' (override with `.groups` argument)

ann_text <- data.frame(x = c(.25, 0.75), y = c(0.75, 0.25),
                       lab = c("More times\nthan expected", "Fewer times\nthan expected")
                       )

Plot the results:

cp_cv_loso_calibration_results %>%
  ungroup() %>%
  mutate(distance = fct_relevel(distance,
                            "Short", "Intermediate", "Deep")
  ) %>%
  filter(n_plays > 10) %>%
  ggplot() +
  geom_point(aes(x = bin_pred_prob, y = bin_actual_prob, size = n_plays)) +
  geom_smooth(aes(x = bin_pred_prob, y = bin_actual_prob), method = "loess") +
  geom_abline(slope = 1, intercept = 0, color = "black", lty = 2) +
  coord_equal() +
  scale_x_continuous(limits = c(0,1)) +
  scale_y_continuous(limits = c(0,1)) +
  labs(size = "Number of plays",
       x = "Estimated completion percentage",
       y = "Observed completion percentage") +
  geom_text(data = ann_text, aes(x = x, y = y, label = lab), size = 3) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        strip.background = element_blank(),
        strip.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.text.x = element_text(size = 10, angle = 90),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        legend.position = "bottom") +
  facet_wrap(~ distance, ncol = 3)

And get the calibration error:

cp_cv_cal_error <- cp_cv_loso_calibration_results %>%
  ungroup() %>%
  mutate(cal_diff = abs(bin_pred_prob - bin_actual_prob)) %>%
  group_by(distance) %>%
  summarize(weight_cal_error = weighted.mean(cal_diff, n_plays, na.rm = TRUE),
            n_complete = sum(n_complete, na.rm = TRUE))
#> `summarise()` ungrouping output (override with `.groups` argument)

round(with(cp_cv_cal_error, weighted.mean(weight_cal_error, n_complete)), 4)
#> [1] 0.0052