Here’s an outline of how to download the Retrosheet play files for a particular season.
There is a blog posting on this at https://baseballwithr.wordpress.com/2014/02/10/downloading-retrosheet-data-and-runs-expectancy/
Download the Chadwick files following the advice from the website mentioned on the blog post – this will parse the original source files.
I set up the current working director to have the following file structure.
alt text
parse.retrosheet2.pbp.R()
from my gist site.library(devtools)
source_gist("https://gist.github.com/bayesball/8892981",
filename="parse.retrosheet2.pbp.R")
parse.retrosheet2.pbp(2021)
read_csv()
function I read in the play-by-play data.library(readr)
d2021 <- read_csv("download.folder/unzipped/all2021.csv",
col_names = FALSE)
fields <- read.csv("https://raw.githubusercontent.com/beanumber/baseball_R/master/data/fields.csv")
names(d2021) <- fields[, "Header"]
d2021
is ready to explore. Here is the first row of the file.d2021[1, ]
## # A tibble: 1 × 97
## GAME_ID AWAY_TEAM_ID INN_CT BAT_HOME_ID OUTS_CT BALLS_CT STRIKES_CT
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ANA202104010 CHA 1 0 0 2 2
## # … with 90 more variables: PITCH_SEQ_TX <chr>, AWAY_SCORE_CT <dbl>,
## # HOME_SCORE_CT <dbl>, BAT_ID <chr>, BAT_HAND_CD <chr>, RESP_BAT_ID <chr>,
## # RESP_BAT_HAND_CD <chr>, PIT_ID <chr>, PIT_HAND_CD <chr>, RESP_PIT_ID <chr>,
## # RESP_PIT_HAND_CD <chr>, POS2_FLD_ID <chr>, POS3_FLD_ID <chr>,
## # POS4_FLD_ID <chr>, POS5_FLD_ID <chr>, POS6_FLD_ID <chr>, POS7_FLD_ID <chr>,
## # POS8_FLD_ID <chr>, POS9_FLD_ID <chr>, BASE1_RUN_ID <chr>,
## # BASE2_RUN_ID <chr>, BASE3_RUN_ID <chr>, EVENT_TX <chr>, LEADOFF_FL <lgl>, …