Downloading Retrosheet Data

Here’s an outline of how to download the Retrosheet play files for a particular season.

There is a blog posting on this at https://baseballwithr.wordpress.com/2014/02/10/downloading-retrosheet-data-and-runs-expectancy/

Download the Chadwick files following the advice from the website mentioned on the blog post – this will parse the original source files.
I set up the current working director to have the following file structure.

alt text

I download the R function parse.retrosheet2.pbp.R() from my gist site.

library(devtools)
source_gist("https://gist.github.com/bayesball/8892981",
            filename="parse.retrosheet2.pbp.R")

Now I am ready to download the Retrosheet play-by-play data for the 2021 season.

parse.retrosheet2.pbp(2021)

Two new files “all2021.csv” and “roster.csv” are stored in the “download.folder/unzipped” folder. Using the read_csv() function I read in the play-by-play data.

library(readr)
d2021 <- read_csv("download.folder/unzipped/all2021.csv",
                  col_names = FALSE)

Last I want to add variable names to the header of the data file. I read a header file from our Github site and use this to assign the variable names.

fields <- read.csv("https://raw.githubusercontent.com/beanumber/baseball_R/master/data/fields.csv")
names(d2021) <- fields[, "Header"]

The data frame d2021 is ready to explore. Here is the first row of the file.

d2021[1, ]

## # A tibble: 1 × 97
##   GAME_ID      AWAY_TEAM_ID INN_CT BAT_HOME_ID OUTS_CT BALLS_CT STRIKES_CT
##   <chr>        <chr>         <dbl>       <dbl>   <dbl>    <dbl>      <dbl>
## 1 ANA202104010 CHA               1           0       0        2          2
## # … with 90 more variables: PITCH_SEQ_TX <chr>, AWAY_SCORE_CT <dbl>,
## #   HOME_SCORE_CT <dbl>, BAT_ID <chr>, BAT_HAND_CD <chr>, RESP_BAT_ID <chr>,
## #   RESP_BAT_HAND_CD <chr>, PIT_ID <chr>, PIT_HAND_CD <chr>, RESP_PIT_ID <chr>,
## #   RESP_PIT_HAND_CD <chr>, POS2_FLD_ID <chr>, POS3_FLD_ID <chr>,
## #   POS4_FLD_ID <chr>, POS5_FLD_ID <chr>, POS6_FLD_ID <chr>, POS7_FLD_ID <chr>,
## #   POS8_FLD_ID <chr>, POS9_FLD_ID <chr>, BASE1_RUN_ID <chr>,
## #   BASE2_RUN_ID <chr>, BASE3_RUN_ID <chr>, EVENT_TX <chr>, LEADOFF_FL <lgl>, …

Downloading Retrosheet Data

Jim Albert

December 5, 2021