Here’s an outline of how to download the Retrosheet play files for a particular season, and also how to compute the runs values for all plays.

There is a blog posting on this at https://baseballwithr.wordpress.com/2014/02/10/downloading-retrosheet-data-and-runs-expectancy/

  1. Download the Chadwick files following the advice from the website mentioned on the blog post – this will parse the original source files.

  2. I set up the current working director to have the following file structure.

alt text

alt text

  1. I download two R function files from my gist site.
library(devtools)
source_gist("https://gist.github.com/bayesball/8892981",
            filename="parse.retrosheet2.pbp.R")
## Sourcing https://gist.githubusercontent.com/bayesball/8892981/raw/f98f811f325516e2a825ec27d43deb61abb4c90b/parse.retrosheet2.pbp.R
## SHA-1 hash of file is a66efea0791ba88ea0afd3c8988636019df6d7b0
source_gist("https://gist.github.com/bayesball/8892999",
            filename="compute.runs.expectancy.R")
## Sourcing https://gist.githubusercontent.com/bayesball/8892999/raw/cdf23abdf5336eec1734faf41b8d2e84592f43ed/compute.runs.expectancy.R
## SHA-1 hash of file is ae6911df1212d1bbb65d1f7490f80ca09fb3f0eb
  1. Now I am ready to download the Retrosheet play-by-play data for the 2018 season.
parse.retrosheet2.pbp(2018)

I navigate to the download file and check that three files are there.

setwd("download.folder/unzipped")
dir()
##  [1] "all1995.csv"    "all2000.csv"    "all2016.csv"    "all2017.csv"   
##  [5] "all2018.csv"    "roster1995.csv" "roster2000.csv" "roster2016.csv"
##  [9] "roster2017.csv" "roster2018.csv"
  1. Last, to compute the runs expectancies, I run this second function, saving the result in d2018.
d2018 <- compute.runs.expectancy(2018)
  1. To check that you are getting sensible values …

Display the starting state, the new state and the runs value for the first few plays:

library(tidyverse)
d2018 %>% 
  select(STATE, NEW.STATE, RUNS.VALUE) %>% 
  head()

Display the runs expectancies for all states:

d2018 %>% group_by(STATE) %>% 
  summarize(R = first(RUNS.STATE))

Show the runs values for all possible transitions between states:

d2018 %>% group_by(STATE, NEW.STATE) %>% 
  summarize(R = first(RUNS.VALUE))