Here’s an outline of how to download the Retrosheet play files for a particular season, and also how to compute the runs values for all plays.
There is a blog posting on this at https://baseballwithr.wordpress.com/2014/02/10/downloading-retrosheet-data-and-runs-expectancy/
Download the Chadwick files following the advice from the website mentioned on the blog post – this will parse the original source files.
I set up the current working director to have the following file structure.
library(devtools)
source_gist("https://gist.github.com/bayesball/8892981",
filename="parse.retrosheet2.pbp.R")
## Sourcing https://gist.githubusercontent.com/bayesball/8892981/raw/f98f811f325516e2a825ec27d43deb61abb4c90b/parse.retrosheet2.pbp.R
## SHA-1 hash of file is a66efea0791ba88ea0afd3c8988636019df6d7b0
source_gist("https://gist.github.com/bayesball/8892999",
filename="compute.runs.expectancy.R")
## Sourcing https://gist.githubusercontent.com/bayesball/8892999/raw/cdf23abdf5336eec1734faf41b8d2e84592f43ed/compute.runs.expectancy.R
## SHA-1 hash of file is ae6911df1212d1bbb65d1f7490f80ca09fb3f0eb
parse.retrosheet2.pbp(2018)
I navigate to the download file and check that three files are there.
setwd("download.folder/unzipped")
dir()
## [1] "all1995.csv" "all2000.csv" "all2016.csv" "all2017.csv"
## [5] "all2018.csv" "roster1995.csv" "roster2000.csv" "roster2016.csv"
## [9] "roster2017.csv" "roster2018.csv"
d2018 <- compute.runs.expectancy(2018)
Display the starting state, the new state and the runs value for the first few plays:
library(tidyverse)
d2018 %>%
select(STATE, NEW.STATE, RUNS.VALUE) %>%
head()
Display the runs expectancies for all states:
d2018 %>% group_by(STATE) %>%
summarize(R = first(RUNS.STATE))
Show the runs values for all possible transitions between states:
d2018 %>% group_by(STATE, NEW.STATE) %>%
summarize(R = first(RUNS.VALUE))