All examples listed below assume that the following two libraries are installed and loaded.

If you have trouble understanding the code in the examples we highly recommend the nflfastR beginner’s guide in vignette("beginners_guide").

Example 1: replicate nflscrapR with fast_scraper

The functionality of nflscrapR can be duplicated by using fast_scraper(). This obtains the same information contained in nflscrapR (plus some extra) but much more quickly. To compare to nflscrapR, we use their data repository as the program no longer functions now that the NFL has taken down the old Gamecenter feed. Note that EP differs from nflscrapR as we use a newer era-adjusted model (more on this in vignette("nflfastR-models")).

This example also uses the built-in function clean_pbp() to create a ‘name’ column for the primary player involved (the QB on pass play or ball-carrier on run play).

read_csv(url('https://github.com/ryurko/nflscrapR-data/blob/master/play_by_play_data/regular_season/reg_pbp_2019.csv?raw=true')) %>%
  filter(home_team == 'SF' & away_team == 'SEA') %>%
  select(desc, play_type, ep, epa, home_wp) %>% head(5) %>% 
  knitr::kable(digits = 3)
desc play_type ep epa home_wp
J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. kickoff 0.815 0.000 NA
(15:00) T.Coleman left guard to SF 26 for 1 yard (J.Clowney). run 0.815 -0.606 0.500
(14:19) T.Coleman right tackle to SF 25 for -1 yards (P.Ford). run 0.209 -1.146 0.485
(13:45) (Shotgun) J.Garoppolo pass short middle to K.Bourne to SF 41 for 16 yards (J.Taylor). Caught at SF39. 2-yac pass -0.937 3.223 0.453
(12:58) PENALTY on SEA-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. no_play 2.286 0.774 0.551
fast_scraper('2019_10_SEA_SF') %>%
  clean_pbp() %>%
  select(desc, play_type, ep, epa, home_wp, name) %>% head(6) %>% 
  knitr::kable(digits = 3)
desc play_type ep epa home_wp name
GAME NA NA NA NA NA
5-J.Myers kicks 65 yards from SEA 35 to end zone, Touchback. kickoff 1.435 0.000 0.559 NA
(15:00) 26-T.Coleman left guard to SF 26 for 1 yard (90-J.Clowney). run 1.435 -0.524 0.559 T.Coleman
(14:19) 26-T.Coleman right tackle to SF 25 for -1 yards (97-P.Ford). run 0.911 -0.850 0.559 T.Coleman
(13:45) (Shotgun) 10-J.Garoppolo pass short middle to 84-K.Bourne to SF 41 for 16 yards (24-J.Taylor). Caught at SF39. 2-yac pass 0.061 2.417 0.518 J.Garoppolo
(12:58) PENALTY on SEA-91-J.Reed, Encroachment, 5 yards, enforced at SF 41 - No Play. no_play 2.478 0.565 0.584 NA

Example 2: scrape a batch of games very quickly with fast_scraper and parallel processing

This is a demonstration of nflfastR’s capabilities. While nflfastR can scrape a batch of games very quickly, please be respectful of Github’s servers and use the data repository which hosts all the scraped and cleaned data whenever possible. The only reason to ever actually use the scraper is if it’s in the middle of the season and we haven’t updated the repository with recent games (but we will try to keep it updated).

#get list of some games from 2019
games_2019 <- fast_scraper_schedules(2019) %>% head(10) %>% pull(game_id)

tictoc::tic(glue::glue('{length(games_2019)} games with nflfastR:'))
f <- fast_scraper(games_2019, pp = TRUE)
tictoc::toc()
#> 10 games with nflfastR:: 13.417 sec elapsed

Example 3: completion percentage over expected (CPOE)

Let’s look at CPOE leaders from the 2009 regular season.

As discussed above, nflfastR has a data repository for old seasons, so there’s no need to actually scrape them. Let’s use that here (the below reads .rds files, but .csv and .parquet are also available).

tictoc::tic('loading all games from 2009')
games_2009 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2009.rds')) %>% filter(season_type == 'REG')
tictoc::toc()
#> loading all games from 2009: 2.936 sec elapsed
games_2009 %>% filter(!is.na(cpoe)) %>% group_by(passer_player_name) %>%
  summarize(cpoe = mean(cpoe), Atts=n()) %>%
  filter(Atts > 200) %>%
  arrange(-cpoe) %>%
  head(5) %>% 
  knitr::kable(digits = 1)
passer_player_name cpoe Atts
D.Brees 7.5 509
P.Rivers 6.6 474
P.Manning 6.4 569
B.Favre 6.1 527
B.Roethlisberger 5.4 503

Example 4: using drive information

When working with nflfastR, drive results are automatically included. Let’s look at how much more likely teams were to score starting from 1st & 10 at their own 20 yard line in 2015 (the last year before touchbacks on kickoffs changed to the 25) than in 2000.

games_2000 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2000.rds'))
games_2015 <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2015.rds'))

pbp <- bind_rows(games_2000, games_2015)

pbp %>% filter(season_type == 'REG' & down == 1 & ydstogo == 10 & yardline_100 == 80) %>%
  mutate(drive_score = if_else(drive_end_transition %in% c("Touchdown", "Field Goal", "TOUCHDOWN", "FIELD_GOAL"), 1, 0)) %>%
  group_by(season) %>%
  summarize(drive_score = mean(drive_score)) %>% 
  knitr::kable(digits = 3)
season drive_score
2000 0.234
2015 0.305

So about 23% of 1st & 10 plays from teams’ own 20 would see the drive end up in a score in 2000, compared to 30% in 2015. This has implications for Expected Points models (see vignette("nflfastR-models")).

Example 5: Plot offensive and defensive EPA per play for a given season

Let’s build the NFL team tiers using offensive and defensive expected points added per play for the 2005 regular season. The logo urls of the espn logos are integrated into the ?teams_colors_logos data frame which is delivered with the package.

Let’s also use the included helper function clean_pbp(), which creates “rush” and “pass” columns that (a) properly count sacks and scrambles as pass plays and (b) properly include plays with penalties. Using this, we can keep only rush or pass plays.

library(ggimage)
pbp <- readRDS(url('https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2005.rds')) %>%
  filter(season_type == 'REG') %>% filter(!is.na(posteam) & (rush == 1 | pass == 1))
offense <- pbp %>% group_by(posteam) %>% summarise(off_epa = mean(epa, na.rm = TRUE))
defense <- pbp %>% group_by(defteam) %>% summarise(def_epa = mean(epa, na.rm = TRUE))
logos <- teams_colors_logos %>% select(team_abbr, team_logo_espn)

offense %>%
  inner_join(defense, by = c("posteam" = "defteam")) %>%
  inner_join(logos, by = c("posteam" = "team_abbr")) %>%
  ggplot(aes(x = off_epa, y = def_epa)) +
  geom_abline(slope = -1.5, intercept = c(.4, .3, .2, .1, 0, -.1, -.2, -.3), alpha = .2) +
  geom_hline(aes(yintercept = mean(off_epa)), color = "red", linetype = "dashed") +
  geom_vline(aes(xintercept = mean(def_epa)), color = "red", linetype = "dashed") +
  geom_image(aes(image = team_logo_espn), size = 0.05, asp = 16 / 9) +
  labs(
    x = "Offense EPA/play",
    y = "Defense EPA/play",
    caption = "Data: @nflfastR",
    title = "2005 NFL Offensive and Defensive EPA per Play"
  ) +
  theme_bw() +
  theme(
    aspect.ratio = 9 / 16,
    plot.title = element_text(size = 12, hjust = 0.5, face = "bold")
  ) +
  scale_y_reverse()

Example 6: Expected Points calculator

We have provided a calculator for working with the Expected Points model. Here is an example of how to use it, looking for how the Expected Points on a drive beginning following a touchback has changed over time.

While I have put in 'SEA' for home_team and posteam, this only matters for figuring out whether the team with the ball is the home team (there’s no actual effect for given team; it would be the same no matter what team is supplied).

data <- tibble::tibble(
  "season" = 1999:2019,
  'home_team' = 'SEA',
  'posteam' = 'SEA',
  'roof' = 'outdoors',
  'half_seconds_remaining' = 1800,
  'yardline_100' = c(rep(80, 17), rep(75, 4)),
  'down' = 1,
  'ydstogo' = 10,
  'posteam_timeouts_remaining' = 3,
  'defteam_timeouts_remaining' = 3
)

nflfastR::calculate_expected_points(data) %>%
  select(season, yardline_100, td_prob, ep) %>% 
  knitr::kable(digits = 2)
season yardline_100 td_prob ep
1999 80 0.33 0.61
2000 80 0.33 0.61
2001 80 0.33 0.61
2002 80 0.34 0.80
2003 80 0.34 0.80
2004 80 0.34 0.80
2005 80 0.34 0.80
2006 80 0.34 0.81
2007 80 0.34 0.81
2008 80 0.34 0.81
2009 80 0.34 0.81
2010 80 0.34 0.81
2011 80 0.34 0.81
2012 80 0.34 0.81
2013 80 0.34 0.81
2014 80 0.35 0.97
2015 80 0.35 0.97
2016 75 0.38 1.44
2017 75 0.38 1.44
2018 75 0.40 1.43
2019 75 0.40 1.43

Not surprisingly, offenses have become much more successful over time, with the kickoff touchback moving from the 20 to the 25 in 2016 providing an additional boost. Note that the td_prob in this example is the probability that the next score within the same half will be a touchdown scored by team with the ball, not the probability that the current drive will end in a touchdown (this is why the numbers are different from Example 4 above).

We could compare the most recent four years to the expectation for playing in a dome by inputting all the same things and changing the roof input:

data <- tibble::tibble(
  "season" = 2016:2019,
  "week" = 5,
  'home_team' = 'SEA',
  'posteam' = 'SEA',
  'roof' = 'dome',
  'half_seconds_remaining' = 1800,
  'yardline_100' = c(rep(75, 4)),
  'down' = 1,
  'ydstogo' = 10,
  'posteam_timeouts_remaining' = 3,
  'defteam_timeouts_remaining' = 3
)

nflfastR::calculate_expected_points(data) %>%
  select(season, yardline_100, td_prob, ep) %>% 
  knitr::kable(digits = 2)
season yardline_100 td_prob ep
2016 75 0.41 1.74
2017 75 0.41 1.74
2018 75 0.43 1.74
2019 75 0.43 1.74

So for 2018 and 2019, 1st & 10 from a home team’s own 25 yard line had EP of 1.43 outdoors and 1.74 in domes.

Example 7: Win probability calculator

We have also provided a calculator for working with the win probability models. Here is an example of how to use it, looking for how the win probability to begin the game depends on the pre-game spread.

While I have put in 'SEA' for home_team and posteam, this only matters for figuring out whether the team with the ball is the home team (there’s no actual effect for given team; it would be the same no matter what team is supplied).

data <- tibble::tibble(
  'receive_2h_ko' = 0,
  'ep' = 1,
  'home_team' = 'SEA',
  'posteam' = 'SEA',
  'score_differential' = 0,
  'half_seconds_remaining' = 1800,
  'game_seconds_remaining' = 3600,
  'spread_line' = c(0, 5, 10, 15),
  'down' = 1,
  'ydstogo' = 10,
  'posteam_timeouts_remaining' = 3,
  'defteam_timeouts_remaining' = 3
)

nflfastR::calculate_win_probability(data) %>%
  select(spread_line, wp, vegas_wp) %>% 
  knitr::kable(digits = 2)
spread_line wp vegas_wp
0 0.56 0.57
5 0.56 0.67
10 0.56 0.80
15 0.56 0.91

Not surprisingly, vegas_wp increases with the amount a team was coming into the game favored by. Weirdly, the model thinks home teams are more likely to win even when the spread is 0. I’m not sure how much to believe the model on that one, but leaving home in the model did make the model better at out of sample predictions, so who knows.

Example 8: Using the build-in database function

If you’re comfortable using dplyr functions to manipulate and tidy data, you’re ready to use a database. Why should you use a database?

  • The provided function in nflfastR makes it extremely easy to build a database and keep it updated
  • Play-by-play data over 20+ seasons takes up a lot of memory: working with a database allows you to only bring into memory what you actually need
  • R makes it extremely easy to work with databases.

Start: install and load packages

To start, we need to install the two packages required for this that aren’t installed automatically when nflfastR installs: DBI and RSQLite:

As with always, you only need to install these once. They don’t need to be loaded to build the database because nflfastR knows how to use them, but we do need them later on when working with the database.

Build database

There’s exactly one function in nflfastR that works with databases: update_db. Some notes:

  • If you use update_db() with no arguments, it will build a SQLite database called pbp_db in your current working directory, with play-by-play data in a table called nflfastR_pbp.
  • You can specify a different directory with dbdir.
  • You can specify a different filename with dbname.
  • You can specify a different table name with tblname.
  • If you want to rebuild the database from scratch for whatever reason, supply force_rebuild = TRUE. This is primarily intended for the case when we update the play-by-play data in the data repo due to fixing a bug and you want to force the database to be wiped and updated.

Let’s say I just want to dump a database into the current working directory. Here we go!

update_db()
#> Can't find database ./pbp_db. Will try to create it and load the play by play data into the data table "nflfastR_pbp".
#> Starting download of 21 seasons between 1999 and 2019...
#> Checking for missing completed games...
#> You have 5580 games and are missing 0.

This created a database in the current directory called pbp_db.

Wait, that’s it? That’s it! What if it’s partway through the season and you want to make sure all the new games are added to the database? What do you run? update_db()! (just make sure you’re in the directory the database is saved in or you supply the right file path)

update_db()
#> Checking for missing completed games...
#> You have 5580 games and are missing 0.

Connect to database

Now we can make a connection to the database. This is the only part that will look a little bit foreign, but all you need to know is where your database is located. If it’s in your current working directory, this will work:

connection <- dbConnect(SQLite(), "./pbp_db")
connection
#> <SQLiteConnection>
#>   Path: /Users/runner/work/pages_dummy/pages_dummy/vignettes/pbp_db
#>   Extensions: TRUE

It looks like nothing happened, but we now have a connection to the database. Now we’re ready to do stuff. If you aren’t familiar with databases, they’re organized around tables. Here’s how to see which tables are present in our database:

dbListTables(connection)
#> [1] "nflfastR_pbp"

Since we went with the defaults, there’s a table called nflfastR_pbp. Another useful function is to see the fields (i.e., columns) in a table:

dbListFields(connection, "nflfastR_pbp") %>%
  head(10)
#>  [1] "play_id"       "game_id"       "home_team"     "away_team"    
#>  [5] "season_type"   "week"          "posteam"       "posteam_type" 
#>  [9] "defteam"       "side_of_field"

This is the same list as the list of columns in nflfastR play-by-play. Notice we had to supply the name of the table above ("nflfastR_pbp").

With all that out of the way, there’s only a couple more things to learn. The main driver here is tbl, which helps get output with a specific table in a database:

pbp_db <- tbl(connection, "nflfastR_pbp")

And now, everything will magically just “work”: you can forget you’re even working with a database!

pbp_db %>%
  group_by(season) %>%
  summarize(n=n())
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.30.1
#> #   [/Users/runner/work/pages_dummy/pages_dummy/vignettes/pbp_db]
#>    season     n
#>     <int> <int>
#>  1   1999 46030
#>  2   2000 45474
#>  3   2001 45433
#>  4   2002 47819
#>  5   2003 47334
#>  6   2004 47201
#>  7   2005 47343
#>  8   2006 46867
#>  9   2007 46788
#> 10   2008 46444
#> # … with more rows

pbp_db %>%
  filter(rush == 1 | pass == 1, down <= 2, !is.na(epa), !is.na(posteam)) %>%
  group_by(pass) %>%
  summarize(mean_epa = mean(epa))
#> Warning: Missing values are always removed in SQL.
#> Use `mean(x, na.rm = TRUE)` to silence this warning
#> This warning is displayed only once per session.
#> # Source:   lazy query [?? x 2]
#> # Database: sqlite 3.30.1
#> #   [/Users/runner/work/pages_dummy/pages_dummy/vignettes/pbp_db]
#>    pass mean_epa
#>   <dbl>    <dbl>
#> 1     0  -0.0955
#> 2     1   0.0734

So far, everything has stayed in the database. If you want to bring a query into memory, just use collect() at the end:

russ <- pbp_db %>%
  filter(name == "R.Wilson" & posteam == "SEA") %>%
  select(desc, epa) %>%
  collect()

russ
#> # A tibble: 5,673 x 2
#>    desc                                                                      epa
#>    <chr>                                                                   <dbl>
#>  1 (14:12) 3-R.Wilson pass short right to 18-S.Rice to SEA 34 for 9 ya…  1.06   
#>  2 (12:53) 3-R.Wilson pass incomplete deep left to 18-S.Rice. PENALTY …  2.68   
#>  3 (11:25) (Shotgun) 3-R.Wilson pass incomplete short right to 18-S.Ri… -1.32   
#>  4 (10:24) (Shotgun) 3-R.Wilson pass short left to 18-S.Rice to ARI 31…  0.919  
#>  5 (9:47) 3-R.Wilson scrambles right end ran ob at ARI 27 for 4 yards … -0.00504
#>  6 (8:35) 3-R.Wilson pass incomplete short right to 18-S.Rice.          -0.425  
#>  7 (7:54) (Shotgun) 3-R.Wilson left end pushed ob at ARI 9 for 4 yards… -1.19   
#>  8 (:27) 3-R.Wilson sacked at SEA 17 for -5 yards (51-P.Lenon). Penalt… -0.971  
#>  9 (14:28) (Shotgun) 3-R.Wilson pass short right to 17-B.Edwards to SE…  1.91   
#> 10 (13:59) 3-R.Wilson pass incomplete deep left to 87-B.Obomanu.        -0.501  
#> # … with 5,663 more rows

So we’ve searched through about 1 million rows of data across 300+ columns and only brought about 5,500 rows and two columns into memory. Pretty neat! This is how I supply the data to the shiny apps on rbsdm.com without running out of memory on the server. Now there’s only one more thing to remember. When you’re finished doing what you need with the database:

dbDisconnect(connection)

For more details on using a database with nflfastR, see Thomas Mock’s life-changing post here.

Example 9: Working with roster and position data

This section used to contain an example of working with roster data. Unfortunately, we have not found a way to obtain roster data that can be joined to the new play by play, so for now, it is empty. We would like to be able to get position data but haven’t yet.

The clean_pbp() function does a lot of work cleaning up player names and IDs for the purpose of joining them to roster data, but we do not have any roster data to join to. Note that player IDs are inconsistent between the old (1999-2010) and new (2011 - present) data sources so use IDs with caution. Unfortunately there is nothing we can do about this as the NFL changed their system for IDs in the underlying data.