library(dplyr)
library(Lahman)
<- filter(Teams, yearID == 2015)
teams2015 names(teams2015)[18:19] <- c("X2B", "X3B")
$SF <- as.numeric(teams2015$SF)
teams2015$HBP <- as.numeric(teams2015$HBP)
teams2015<- mutate(teams2015,
teams2015 X1B = H - X2B - X3B - HR,
TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
SLG = TB / AB,
OBP = (H + BB + HBP) /
+ BB + HBP + SF)) (AB
R Code for Visualizing Baseball
1 Gentle Introduction to ggplot2
1.1 Introduction
ggplot2
is a R package for graphing data based on the “The Grammar of Graphics” framework introduced by Leland Wilkinson. This package is used to construct all of the graphs for the book Visualizing Baseball. The purpose of this document to introduce ggplot2
for a familiar baseball dataset. In this document, I introduce the basic framework and illustrate the use of ggplot2
to construct graphs for different types of variables.
1.2 Some Baseball Data
Collect hitting data for all teams in the 2015 baseball season. For each team, I compute its slugging percentage SLG and its on-base percentage OBP.
1.3 Three Basic Components of a ggplot2 Graph
To construct a graph using ggplot2
, one needs …
A data frame that contains the data that you want to graph.
Aesthetics or roles assigned to particular variables in the data frame.
A geometric object (or geom for short) which is what you are plotting.
For example, suppose we wish to construct a scatterplot of the on-base percentage and the slugging percentages for all teams in the 2015 season.
The data frame
teams2015
contains the data and OBP and SLG are the variables of interest.To construct a scatterplot, you need to have a variable on the horizontal axis (x) and a variable on the vertical axis (y). If I want
OBP
to be the horizontal axis variable andSLG
the vertical axis variable, I would assign the aestheticsOBP
to x andSLG
to y.
These two steps are communicated by the command
library(ggplot2)
ggplot(data=teams2015, aes(x=OBP, y=SLG))
The ggplot2
function only sets up the axis – it does not plot anything. To construct a scatterplot, we need to add a point geometric object which is the function geom_point
. Now we see the scatterplot. There is a clear positive association between a team’s OBP
and its SLG
.
ggplot(data=teams2015, aes(x=OBP, y=SLG)) +
geom_point()
1.4 Other Aesthetics (color, shape, and size)
There are other roles or aesthetics that you can assign to variables.
For example, the variable lgID
gives the league (AL or NL). We can assign lgID
to the color aesthetic – so the points are colored by the league variable. This tells us that the team with the highest OBP
and highest SLG
was from the American League.
ggplot(data=teams2015,
aes(x=OBP, y=SLG, color=lgID)) +
geom_point()
There are other aesthetics like shape and size.
Here I can use different plotting symbols for each league by assigning lgID
to the shape aesthetic. Personally, I think the different shapes are harder to distinguish than the different colors.
ggplot(data=teams2015,
aes(x=OBP, y=SLG, shape=lgID)) +
geom_point()
The variable W
is the number of team wins. Here I assign the variable W
to the size aesthetic. Notice that the team with the highest OBP
and SLG
values appeared to win a lot of games in the 2015 season.
ggplot(data=teams2015,
aes(x=OBP, y=SLG, size=W)) +
geom_point()
1.5 Facetting
In ggplot2
, it is easy to break the plot into several panels defined by a categorical variable – these different panels are called facets. Suppose I want to construct panels of scatterplots of OBP
by SLG
, where the panels are defined by the league variable.
From this graph, it appears that the AL teams generally had higher SLG
values than the NL teams.
ggplot(data=teams2015,
aes(x=OBP, y=SLG)) +
geom_point() +
facet_grid(~ lgID)
1.6 Some Plot Geoms
There are many possible geometric objects (geoms) that one can use depending on the number of variables and variable types.
1.6.1 A Single Numeric Variable
Suppose one wants to construct a histogram of the OBP
’s for the 30 times. Here the single aesthetic is x and we use geom_histogram. I indicate that we want to apply five bins in the histogram.
ggplot(data=teams2015, aes(x=OBP)) +
geom_histogram(bins=5)
1.6.2 A Single Categorical Variable
A bar chart is a graph of a single categorical variable that we can produce using the geom_bar geom. This graph confirms that there are 15 teams in each league.
ggplot(data=teams2015, aes(x=lgID)) + geom_bar()
1.6.3 One Categorical Variable and One Numeric Variable
We said earlier that it appeared that the slugging percentages were greater for teams in the American League. A more direct way to graphically compare the two groups of SLG
values is by parallel boxplots. Here we assign the x aesthetic to lgID
, the y aesthetic to SLG
, and use the geom_boxplot
geom.
ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
geom_boxplot()
Another geom that one can use in this scenario is geom_jitter
which produces jittered points. The width option controls the width of the horizontal range of the jittering.
ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
geom_jitter(width=0.1)
1.7 Modifying the Axes
In ggplot2
, it is possible to modify all aspects of the graph. I illustrate some basic modificaitons here. I use the ggtitle
function to add a plot title, and use the ylab and xlab functions to add x and y labels.
ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
geom_jitter(width=0.1) +
ggtitle("Slugging Percentages in the 2015 Season by League") +
xlab("League") + ylab("Slugging Percentage")
1.8 Learning More About ggplot2
Hopefully this introduction gets you interested in trying out ggplot2
for your own graphs. I encourage you to try out some of the ggplot2
example scripts for the different chapters and then apply ggplot2
for your own problems.
2 Chapter 1 - History of Baseball
2.1 History of Home Run Hitting
We load the packages Lahman
, dplyr
and ggplot2
.
library(Lahman)
library(dplyr)
library(ggplot2)
2.1.1 The Data
The Lahman
package contains season to season data for players and teams from the Sean Lahman database. For this history of home runs graph, want to collect the number of home runs hit (variable HR
) and number of games played (variable G
) for all teams for all seasons since 1900.
Here are a few sample rows of our data.
select(sample_n(Teams, 8), yearID, teamID, HR, G)
yearID teamID HR G
1 1974 SFN 93 162
2 1951 PHA 102 154
3 2008 SFN 94 162
4 2018 DET 135 162
5 1881 CHN 12 84
6 1983 CHA 157 162
7 2014 COL 186 162
8 1987 BOS 174 162
2.1.2 Creating the Variables of Interest
I use the summarize
function to collect the total number of home runs hit and the total number of games played for each season.
<- summarize(group_by(Teams, yearID),
S HR = sum(HR),
G = sum(G))
I use the filter
function to select the summary data for only seasons 1900 or later.
<- filter(S, yearID >= 1900) S2
2.1.3 Constructing the Graph
Now I can construct the plot. I construct a scatterplot of yearID
(horizontal) against the number of home runs hit for each team per game (variable HR / G
). I add a smoothing loess curve to the plot to see the general pattern.
ggplot(S2, aes(yearID, HR / G)) +
geom_point() +
geom_smooth(method="loess", span=0.3, se=FALSE)
2.2 History of Home Run Leaders
Next suppose we want to graph the leading number of home runs against season for all seasons past 1900.
2.2.1 The Data
The Batting
data frame in the Lahman
package contains the number of home runs hit by each player each season.
2.2.2 Creating the Variables of Interest
Actually, the Batting
table contains more than one row for players who play for more than one team in a given season. So the first use the summarize
function to find the number of home runs for each player each season.
<- summarize(group_by(Batting, yearID, playerID),
S HR=sum(HR, na.rm=TRUE))
Next, for each season, we collect the maximum number of home runs that are hit. Also we select only the rows of the data frame where the season is at least 1900.
<- summarize(group_by(S, yearID),
S1 HR=max(HR, na.rm=TRUE))
<- filter(S1, yearID >= 1900) S2
Here are the first few rows of our table.
head(S2)
# A tibble: 6 × 2
yearID HR
<int> <int>
1 1900 12
2 1901 16
3 1902 16
4 1903 13
5 1904 10
6 1905 9
2.2.3 Constructing the Graph
Now I can construct the plot. I construct a scatterplot of yearID
(horizontal) against the maximum number of home runs hit (variable HR
). I add a smoothing loess curve to the plot to see the general pattern.
ggplot(S2, aes(yearID, HR)) +
geom_point() +
geom_smooth(method="loess", span=0.3, se=FALSE)
3 Chapter 2 - Career Trajectories
3.1 Plotting a Career Trajectory
Here is a function plot_hr_trajectory()
that will graph a specific player’s home run trajectory. It uses three packages: Lahman
contains the season-to-season data, dplyr
helps with data management, stringr
helps with one string operation, and ggplot2
does the graphing.
Here is some insight how plot_hr_trajectory()
works:
The input is the player’s full name in quotes.
Using the
People
data frame in theLahman
package, I find theplayerID
and birth information for that player.From the
Batting
data frame of hitting data, I collectHR
,AB
for all seasons of the player’s career.I find the
Age
variable by first finding the player’s birthyear, adjusting the birthyear depending on the birthmonth, and then definingAge
.I use
ggplot2
to construct a scatterplot and smoothing curve for the home run rateHR / AB
.
<- function(playername){
plot_hr_trajectory require(Lahman)
require(dplyr)
require(stringr)
require(ggplot2)
<- unlist(str_split(playername, " "))
names <- filter(People, nameLast==names[2],
info ==names[1])
nameFirst
<- filter(Batting, playerID==info$playerID)
bdata <- mutate(bdata,
bdata birthyear = ifelse(info$birthMonth >= 7,
$birthYear + 1, info$birthYear),
infoAge = yearID - birthyear)
ggplot(bdata, aes(yearID, HR / AB)) +
geom_point() +
geom_smooth(method="loess", se=FALSE)
}
3.2 Plotting Two Trajectories
I illustrate using this function for two players. Note that I am saving the ggplot2
plotting object in a variable. By just typing the variable name, I see the graph.
<- plot_hr_trajectory("Mickey Mantle")
p1 p1
<- plot_hr_trajectory("Mike Schmidt")
p2 p2
3.3 Comparing Trajectories
The ggplot2
object contains the plotting data. So I combine the data from the two earlier plotting objects to construct a graph that compares the two trajectories.
ggplot(rbind(p1$data, p2$data), aes(Age, HR / AB)) +
geom_point() +
geom_smooth(method="loess", se=FALSE) +
facet_wrap(~ playerID, ncol=1)
4 Chapter 3 - Runs Expectancy
This chapter illustrates graphing the famous runs expectancy matrix.
First load some required packages.
library(readr)
library(knitr)
library(ggplot2)
4.1 The Data
To obtain the runs expectancy matrix, one needs the Retrosheet play-by-play data for a particular season. I have computed the runs expectancies using 2015 season data. I have stored the data into a csv file that we read into R and store in the variable RR
.
<- read_csv("https://bayesball.github.io/VB/data/runs2015.csv") RR
Use the kable
function to display the data frame containing the runs expectancies.
kable(RR)
…1 | STATE | Mean | Outs | Bases | O |
---|---|---|---|---|---|
1 | 000 0 | 0.4738828 | OUTS = 0 | 000 | 0 |
2 | 000 1 | 0.2514400 | OUTS = 1 | 000 | 1 |
3 | 000 2 | 0.0988068 | OUTS = 2 | 000 | 2 |
4 | 001 0 | 1.4011407 | OUTS = 0 | 003 | 0 |
5 | 001 1 | 0.9643617 | OUTS = 1 | 003 | 1 |
6 | 001 2 | 0.3630464 | OUTS = 2 | 003 | 2 |
7 | 010 0 | 1.1109418 | OUTS = 0 | 020 | 0 |
8 | 010 1 | 0.6637977 | OUTS = 1 | 020 | 1 |
9 | 010 2 | 0.3036562 | OUTS = 2 | 020 | 2 |
10 | 011 0 | 2.0450000 | OUTS = 0 | 023 | 0 |
11 | 011 1 | 1.3655761 | OUTS = 1 | 023 | 1 |
12 | 011 2 | 0.5598688 | OUTS = 2 | 023 | 2 |
13 | 100 0 | 0.8577522 | OUTS = 0 | 100 | 0 |
14 | 100 1 | 0.5046115 | OUTS = 1 | 100 | 1 |
15 | 100 2 | 0.2266157 | OUTS = 2 | 100 | 2 |
16 | 101 0 | 1.7113951 | OUTS = 0 | 103 | 0 |
17 | 101 1 | 1.1209412 | OUTS = 1 | 103 | 1 |
18 | 101 2 | 0.4528302 | OUTS = 2 | 103 | 2 |
19 | 110 0 | 1.4727344 | OUTS = 0 | 120 | 0 |
20 | 110 1 | 0.8881782 | OUTS = 1 | 120 | 1 |
21 | 110 2 | 0.4296086 | OUTS = 2 | 120 | 2 |
22 | 111 0 | 2.2865412 | OUTS = 0 | 123 | 0 |
23 | 111 1 | 1.5900901 | OUTS = 1 | 123 | 1 |
24 | 111 2 | 0.7925729 | OUTS = 2 | 123 | 2 |
4.2 Graph of the Matrix
Here I am constructing a scatterplot of the Bases
variable against the mean runs variable Mean
where the plotting symbol is the O
variable (number of outs).
ggplot(RR, aes(Bases, Mean, label=O)) +
geom_point(size=3) +
geom_label(color="black", size=4,
fontface="bold") +
ylab("Runs Scored in \n Remainder of Inning") +
xlab("Runners on Base") +
theme(axis.text = element_text(size=16),
axis.title = element_text(size=16))
5 Chapter 4 - Count Effects
Load in a few helpful packages.
library(readr)
library(ggplot2)
library(dplyr)
5.1 The Data
Using the Retrosheet play-by-play data for the 2015 season, I found the expected runs in the remainder of the inning for plate appearances that pass through each possible count. I store these expected runs values in the csv file “count2015a.csv”.
I read this file into R – variable name of data frame is d
– and show the first few lines.
<- read_csv("https://bayesball.github.io/VB/data/count2015a.csv")
d head(d)
# A tibble: 6 × 6
count strikes balls N.Pitches Type Runs
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 0-0 0 0 0 Neutral -0.000798
2 1-0 0 1 1 Batter 0.0339
3 0-1 1 0 1 Pitcher -0.0387
4 2-0 0 2 2 Batter 0.0940
5 1-1 1 1 2 Pitcher -0.0153
6 0-2 2 0 2 Pitcher -0.0894
5.2 The Graph
In this graph, the Pitch Number (variable N.Pitches
) is graphed against the Runs Value (variable Runs
), using the Count (variable count
) as the plotting label.
ggplot(d, aes(N.Pitches, Runs, label=count)) +
geom_point() +
geom_path(data=filter(d, strikes==0),
aes(N.Pitches, Runs), color="blue") +
geom_path(data=filter(d, strikes==1),
aes(N.Pitches, Runs), color="blue") +
geom_path(data=filter(d, strikes==2),
aes(N.Pitches, Runs), color="blue") +
geom_path(data=filter(d, balls==0),
aes(N.Pitches, Runs), color="blue") +
geom_path(data=filter(d, balls==1),
aes(N.Pitches, Runs), color="blue") +
geom_path(data=filter(d, balls==2),
aes(N.Pitches, Runs), color="blue") +
geom_path(data=filter(d, balls==3),
aes(N.Pitches, Runs), color="blue") +
xlab("Pitch Number") +
ylab("Runs Value") +
ggtitle("") +
geom_hline(yintercept=0, color="black") +
geom_label()
5.3 Data
Above we considered the runs value of plate appearances that pass through each possible count. Here we consider the runs values of balls put in play on each possible count. These runs values are found using 2016 Retrosheet play-by-play data. The data is saved in the csv file “count2015b.csv”. We read in this data and save in the variable S
.
<- read_csv("https://bayesball.github.io/VB/data/count2015b.csv")
S head(S)
# A tibble: 6 × 6
count Runs strikes balls N.Pitches N
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0-0 0.0402 0 0 0 20668
2 0-1 0.0163 1 0 1 16560
3 0-2 0.0162 2 0 2 8374
4 1-0 0.0506 0 1 1 12366
5 1-1 0.0369 1 1 2 15601
6 1-2 0.0184 2 1 3 14508
5.4 The Graph
In this graph, the Pitch Number (variable N.Pitches
) is graphed against the Runs Value (variable Runs
), using the Count (variable count
) as the plotting label.
ggplot(S, aes(N.Pitches, Runs, label=count, size=N)) +
xlab("Number of Pitch") +
ylab("Runs Value") +
geom_hline(yintercept=0, color="black") +
geom_label()
6 Chapter 5 - PITCHf/x Data
6.1 The Data
Using the pitchRX
package, I downloaded all of the pitch data for all games in the 2016 season. From this large dataset, I collected the data for 2044 pitches thrown by Clayton Kershaw.
Here I read in the pitchFX data and show a few lines.
library(readr)
<- read_csv("https://bayesball.github.io/VB/data/kershaw2016.csv")
CK head(CK)
# A tibble: 6 × 15
pitch_type px pz des num gameday_link start_speed spin_dir
<chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 FF 0.089 2.75 Called Strike 7 gid_2016_04… 90.1 187.
2 FF 0.083 2.72 Swinging Str… 7 gid_2016_04… 92.3 159.
3 FF -2.65 2.69 Ball 7 gid_2016_04… 91.3 206.
4 CU -0.644 -0.231 Ball 7 gid_2016_04… 73.5 354.
5 FF 0.642 4.52 Ball 7 gid_2016_04… 93.1 180.
6 FF -1.41 2.33 Swinging Str… 7 gid_2016_04… 94.3 155.
# ℹ 7 more variables: spin_rate <dbl>, pfx_x <dbl>, pfx_z <dbl>, type <chr>,
# pitcher_name <chr>, event <chr>, stand <chr>
Here are the variables in the data frame CK
.
- pitch_type - type of pitch thrown
- px - horizontal location in zone
- pz - vertical location in zone
- des - outcome of pitch
- start_speed - speed of pitch as it leaves the pitcher’s hand
- event - outcome of the plate appearance
- stand
Load several packages.
library(ggplot2)
library(dplyr)
library(stringr)
6.2 Pitch Types Thrown
To get an understanding of what pitch types are thrown, we construct a dotplot of the frequencies of the pitch types (variable pitch_type
).
<- filter(summarize(group_by(CK, pitch_type),
S_CK N=n()),
%in% c("SL", "FF", "CU", "CH"))
pitch_type ggplot(S_CK, aes(pitch_type, N)) +
geom_point(size=3, color="blue") +
coord_flip() +
ggtitle("Frequencies of Pitch Type of Clayton Kershaw") +
theme(plot.title = element_text(size = 14,
hjust = 0.5))
6.3 Pitch Speeds
These different pitch types are thrown at different speeds. The following display is a boxplot of the speeds (varialbe start_speed
) of the four types of pitches thrown by Kershaw.
ggplot(filter(CK, pitch_type %in%
c("SL", "FF", "CU", "CH")),
aes(pitch_type, start_speed)) +
geom_boxplot() + coord_flip() +
ggtitle("Pitch Speeds") +
theme(plot.title = element_text(size = 14,
hjust = 0.5)) +
ylim(70, 100)
6.4 Pitch Breaks
These pitch types are also distinguished by their movement or break. The variables pfx_x
and pfx_z
give the horizontal and vertical break amounts. (The perspective is from the catcher behind the plate.) The following graph shows the movements for each type of pitch.
<- filter(CK, pitch_type %in% c("CU",
CK "FF", "SL"))
ggplot(CK,
aes(pfx_x, pfx_z, shape=pitch_type)) +
geom_point(color="blue", size=2, alpha=0.5) +
ggtitle("Pitch Breaks") +
theme(plot.title = element_text(size = 14,
hjust = 0.5)) +
xlab("Horizontal Break") + ylab("Vertical Break")
6.5 Pitch Locations
The variables px
and pz
give the horizontal and vertical locations of the pitch viewed from the catcher’s perspective. The zone for an average hitter is added to the plots so we can see which pitches are inside and outside of the zone.
<- 3.5
topKzone <- 1.6
botKzone <- -0.85
inKzone <- 0.85
outKzone <- data.frame(
kZone x=c(inKzone, inKzone, outKzone, outKzone, inKzone),
y=c(botKzone, topKzone, topKzone, botKzone, botKzone)
)ggplot(CK) +
geom_point(data= filter(CK, pitch_type=="CU"),
aes(px, pz), shape=1) +
geom_point(data= filter(CK, pitch_type=="FF"),
aes(px, pz), shape=2) +
geom_point(data= filter(CK, pitch_type=="SL"),
aes(px, pz), shape=3) +
geom_path(aes(x, y), data=kZone, lwd=1, col="blue") +
facet_wrap(~ pitch_type, ncol=2) +
xlim(-2, 2) + ylim(-0.5, 5) +
theme(strip.text = element_text(size = rel(1.5),
hjust=0.5,
color = "black")) +
ggtitle("Pitch Locations") +
theme(plot.title = element_text(size = 14,
hjust = 0.5))
Two-dimensional contour plots (from fitting a two-dimensional density estimate) are helpful for visualizing the locations of the different types of pitches.
ggplot(CK) +
geom_density_2d(aes(px, pz), color="black") +
geom_path(aes(x, y), data=kZone, lwd=1, col="blue") +
facet_wrap(~ pitch_type, ncol=2) +
xlim(-2, 2) + ylim(-0.5, 5) +
theme(strip.text = element_text(size = rel(1.5),
hjust=0.5,
color = "black")) +
ggtitle("Pitch Locations") +
theme(plot.title = element_text(size = 14,
hjust = 0.5))
Warning: Removed 35 rows containing non-finite values (`stat_density2d()`).
6.6 Pitch Outcomes
What are the outcomes of these different types of pitches? We use the variable des
which gives a description of the pitch outcome.
<- summarize(group_by(CK, pitch_type, des), N=n())
SO <- mutate(SO,
SO Outcome=ifelse(str_detect(des, "Foul") == TRUE, "Foul",
ifelse(str_detect(des, "Swing") == TRUE |
== "Missed Bunt", "Swing and Miss",
des ifelse(str_detect(des, "Ball") == TRUE, "Ball",
ifelse(str_detect(des, "In play") == TRUE, "In play",
des)))))<- summarize(group_by(SO, pitch_type, Outcome),
SOS F=sum(N))
<- summarize(group_by(SO, pitch_type),
SOS1 Total=sum(N))
inner_join(SOS, SOS1) %>%
mutate(Percentage = 100 * F / Total) -> SOS
ggplot(SOS,
aes(Outcome, Percentage)) +
geom_point(size=3, color="blue") +
coord_flip() + facet_wrap(~ pitch_type, ncol=1) +
theme(strip.text = element_text(size = rel(1.5),
hjust=0.5,
color = "black"))
6.7 Outcome of a Swing
What if the batter swings at the pitch? We focus on the frequencies of the three outcomes “Foul”, “In play”, and “Miss” for each pitch type.
<- mutate(CK,
CK Foul = str_detect(des, "Foul"),
InPlay = str_detect(des, "In play"),
Miss = str_detect(des, "Swing"),
Swing = Foul | InPlay | Miss)
<- filter(CK, Swing == TRUE)
CK_swing ggplot(CK_swing, aes(px, pz, color=Miss)) +
geom_point(alpha=0.75) +
facet_wrap(~ pitch_type, ncol=2) +
geom_path(aes(x, y), data=kZone, lwd=1, col="black") +
facet_wrap(~ pitch_type, ncol=2) +
xlim(-2, 2) + ylim(-0.5, 5) +
scale_colour_manual(values = c("gray60", "blue")) +
theme(strip.text = element_text(size = rel(1.5),
hjust=0.5,
color = "black"))
7 Chapter 6 - Batted Balls
Load some necessary packages.
library(dplyr)
library(ggplot2)
library(stringr)
library(Lahman)
library(readr)
7.1 The Data
The ESPN home run tracker http://www.hittrackeronline.com/ contains a number of variables for each home run hit during the current season. I collected this data for five baseball seasons (2012 through 2016) and the csv file homeruns.csv contains data on 24,299 home runs hit during these five seasons.
<- read_csv("https://bayesball.github.io/VB/data/homeruns.csv")
d head(d)
# A tibble: 6 × 16
Date Video Path Hitter H_Team Pitcher P_Team Inning Ballpark `Type/Luck`
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 10/3/12 Video View Longori… TB Arriet… BAL 6 Tropica… JE
2 10/3/12 Video View Johnson… CHW Huff, … CLE 2 Progres… ND
3 10/3/12 Video View Maybin,… SD Kintzl… MIL 6 Miller … PL
4 10/3/12 Video View Cano, R… NYY Morten… BOS 5 Yankee … ND
5 10/3/12 Video View Moore, … WSH Lee, C… PHI 6 Nationa… ND
6 10/3/12 Video View Longori… TB Tillma… BAL 4 Tropica… ND
# ℹ 6 more variables: True_Dist <dbl>, Speed_off_Bat <dbl>,
# Elevation_Angle <dbl>, Horiz_Angle <dbl>, Apex <dbl>, N_Parks <dbl>
7.2 (Figure 6.4) distribution of horizontal angle
In the book, I define the horizontal angle which is 180 - Horiz_Angle where Horiz_Angle
is the definition of the horizontal angle on the website.
Here is a density plot of the collection of horizontal angles.
ggplot(d, aes(180 - Horiz_Angle)) +
geom_density() +
xlim(30, 180 - 30) +
xlab("Horizontal Angle") +
ylab("Density") +
geom_vline(xintercept=90) +
annotate("text", x=40, y=0.015,
label="Left\nField", size=6) +
annotate("text", x=140, y=0.015,
label="Right\nField", size=6)
7.3 (Figure 6.5) relationship of distance and horizontal angle
Here I graph the horizontal angle against the home run distance and add a smoothing curve to show the general pattern.
ggplot(d, aes(180 - Horiz_Angle, True_Dist)) +
geom_point(alpha=0.1) + geom_smooth() +
ylim(300, 500) + xlim(45, 130) +
xlab("Horizontal Angle") +
ylab("Distance")
7.4 Relationship of direction and handedness of batter
Here I get information about the batting side of each hitter and merge this information with the main dataset.
<- str_split(d$Hitter, ",")
Names <- function(j, k)
one_row str_trim(Names[[j]][k])
$LastName <- sapply(1:24299, one_row, 1)
d$FirstName <- sapply(1:24299, one_row, 2)
d<- inner_join(d,
d2 select(People, nameLast, nameFirst, bats),
by=c("LastName"="nameLast",
"FirstName"="nameFirst"))
$Batting <- ifelse(d2$bats=="R",
d2"Right-Handed Hitter",
"Left-Handed Hitter")
Here I look the right and left batter effects – show how the distribution of the horizontal angle varies among right and left-handed hitters.
ggplot(filter(d2, bats=="R" | bats=="L"),
aes(180 - Horiz_Angle)) +
geom_density(size=1.0) + xlim(45, 130) +
xlab("Horizontal Angle") +
ylab("Density") +
facet_wrap(~ Batting, ncol=1) +
theme(strip.text = element_text(face="bold", size=16))
7.5 Ballpark effects
Here I look at the proportion of left-sided hr for all parks (Figure 6.7)
<- summarise(group_by(d2, Ballpark),
S NL=sum(180 - Horiz_Angle < 90),
NR=sum(180 - Horiz_Angle > 90),
PL=NL / (NL + NR))
ggplot(filter(S, NL + NR > 200), aes(Ballpark, PL)) +
geom_point() + coord_flip() +
ylab("Proportion of Home Runs to Left") +
geom_hline(yintercept = 0.5)
I focus on 12 extreme parks
<- filter(S, NL + NR > 200)
S200 <- arrange(S200, desc(PL))
S200 <- rbind(slice(S200, 1:8),
Sextreme slice(S200, 28:31))
<- as.character(arrange(Sextreme, PL)$Ballpark)
ballparks $Ballpark <- factor(d2$Ballpark,
d2levels=ballparks)
(Figure 6.8) This shows the distribution of the horizontal angle for each of these extreme parks.
ggplot(filter(d2, bats=="R" | bats=="L",
%in% Sextreme$Ballpark),
Ballpark aes(180 - Horiz_Angle)) +
geom_density() +
facet_wrap(~ Ballpark, ncol=4) +
geom_vline(xintercept = 90, color="blue") +
xlab("Horizontal Angle") + ylab("Density")
8 Chapter 7 - Plate Discipline
8.1 Plate Discipline Statistics for Batters
Load several useful packages.
library(tidyverse)
library(ggplot2)
8.2 The Data
Collect several useful tables from Fangraphs. The first dataset contains basic hitting statistics and the second dataset has stats related to plate discipline. We merge the two datasets, creating a single data frame, 146 observations and 33 variables.
<- read_csv("https://bayesball.github.io/VB/data/Dashboard_2016.csv")
d1 <- read_csv("https://bayesball.github.io/VB/data/Plate_Discipline_2016.csv")
d2 <- inner_join(d1, d2, by="playerid")
d <- c(14, 25:33)
vars <- d[, vars]
d_subset names(d_subset) <- c("OBP", "O_Swing", "Z_Swing", "Swing",
"O_Contact", "Z_Contact",
"Contact", "Zone",
"F_Strike", "SwStr")
names(d)[c(14, 25:33)] <- names(d_subset)
8.3 Swing and Contact Rates
Following graph constructs a scatterplot of the swing and contact rates for all hitters with a smoothing curve added.
ggplot(d, aes(Swing, Contact)) +
geom_point(size=2) +
geom_smooth(se=FALSE) +
xlab("Swing Rate") + ylab("Contact Rate")
8.4 Relationship with Strikeout Rate
We divide the players into “high” and “low” strikeout groups. We use contact and swing rates to predict (by a logistic model) the probability a hitter is in the high strikeout group. A line is added to the scatterplot – points above (below) the line are predicted to be in the low (high) K groups.
$K_Rate <- with(d, ifelse(K > .1875, "HI", "LO"))
d$y <- ifelse(d$K_Rate=="HI", 1, 0)
dglm(y ~ Contact + Swing, data=d, family=binomial) -> F
ggplot(d, aes(Swing, Contact,
color=K_Rate)) +
geom_point(size=3) +
xlab("Swing Rate") + ylab("Contact Rate") +
geom_abline(intercept = coef(F)[1] / (-coef(F)[2]),
slope = coef(F)[3] / (-coef(F)[2])) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("black", "grey60"))
8.5 Relationship with Walk Rate
We divide the players into “high” and “low” walk groups. We use contact and swing rates to predict (by a logistic model) the probability a hitter is in the high walk group. A line is added to the scatterplot – points to the left (to the right) of the line are predicted to be in the high (low) walk groups.
$BB_Cat <- with(d, ifelse(BB > .082, "HI", "LO"))
d$y <- ifelse(d$BB_Cat=="HI", 1, 0)
dglm(y ~ Contact + Swing, data=d, family=binomial) -> F
ggplot(d, aes(Swing, Contact,
color=BB_Cat)) +
xlab("Swing Rate") + ylab("Contact Rate") +
geom_point(size=3) +
geom_abline(intercept = coef(F)[1] / (-coef(F)[2]),
slope = coef(F)[3] / (-coef(F)[2])) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("black", "grey60"))
8.6 Contrasting the top and bottom K hitters
We first identify the players who have the smallest (TOP) and largest (BOTTOM) strikeout rates.
<- mutate(d,
d K_Type=ifelse(K < .12, "TOP",
ifelse(K > .25, "BOTTOM", NA)))
select(filter(d, K_Type == "TOP"),
Name.x, Team.x, K)
# A tibble: 16 × 3
Name.x Team.x K
<chr> <chr> <dbl>
1 Mookie Betts Red Sox 0.11
2 Jose Altuve Astros 0.098
3 Adrian Beltre Rangers 0.103
4 Daniel Murphy Nationals 0.098
5 Dustin Pedroia Red Sox 0.105
6 Jose Ramirez Indians 0.1
7 Buster Posey Giants 0.111
8 Ender Inciarte Braves 0.118
9 Martin Prado Marlins 0.105
10 Yadier Molina Cardinals 0.108
11 Joe Panik Giants 0.089
12 Jose Iglesias Tigers 0.097
13 Yunel Escobar Angels 0.118
14 Melky Cabrera White Sox 0.107
15 Brandon Phillips Reds 0.116
16 Albert Pujols Angels 0.115
select(filter(d, K_Type == "BOTTOM"),
Name.x, Team.x, K)
# A tibble: 16 × 3
Name.x Team.x K
<chr> <chr> <dbl>
1 Jonathan Villar Brewers 0.256
2 Adam Duvall Reds 0.27
3 Chris Davis Orioles 0.329
4 Khris Davis Athletics 0.272
5 Jake Lamb Diamondbacks 0.259
6 Leonys Martin Mariners 0.259
7 Mark Trumbo Orioles 0.255
8 Russell Martin Blue Jays 0.277
9 Danny Espinosa Nationals 0.29
10 Travis Shaw Red Sox 0.251
11 Michael Saunders Blue Jays 0.281
12 Justin Upton Tigers 0.286
13 Alex Gordon Royals 0.292
14 Melvin Upton Jr. - - - 0.288
15 Mike Napoli Indians 0.301
16 Chris Carter Brewers 0.32
Similarly we identify the players with the largest (TOP) and smallest (BOTTOM) walk rates
<- mutate(d,
d BB_Type=ifelse(BB > .13, "TOP",
ifelse(BB < .05, "BOTTOM", NA)))
select(filter(d, BB_Type == "TOP"),
Name.x, Team.x, BB)
# A tibble: 13 × 3
Name.x Team.x BB
<chr> <chr> <dbl>
1 Mike Trout Angels 0.17
2 Josh Donaldson Blue Jays 0.156
3 Joey Votto Reds 0.16
4 Paul Goldschmidt Diamondbacks 0.156
5 Dexter Fowler Cubs 0.143
6 Brandon Belt Giants 0.159
7 Ben Zobrist Cubs 0.152
8 Carlos Santana Indians 0.144
9 Bryce Harper Nationals 0.172
10 Matt Carpenter Cardinals 0.143
11 Chris Davis Orioles 0.132
12 Jose Bautista Blue Jays 0.168
13 Joe Mauer Twins 0.137
select(filter(d, BB_Type == "BOTTOM"),
Name.x, Team.x, BB)
# A tibble: 15 × 3
Name.x Team.x BB
<chr> <chr> <dbl>
1 Starling Marte Pirates 0.043
2 Kevin Pillar Blue Jays 0.041
3 Eduardo Nunez - - - 0.049
4 Didi Gregorius Yankees 0.032
5 Freddy Galvis Phillies 0.04
6 Salvador Perez Royals 0.04
7 Jonathan Schoop Orioles 0.032
8 Rougned Odor Rangers 0.03
9 Josh Harrison Pirates 0.034
10 Starlin Castro Yankees 0.039
11 Brandon Phillips Reds 0.031
12 Adonis Garcia Braves 0.043
13 Alcides Escobar Royals 0.04
14 Marwin Gonzalez Astros 0.042
15 Alexei Ramirez - - - 0.042
8.7 Comparing Top and Bottom Strikeout Hitters
This scatterplot compares the top and bottom K groups with respect to the contact rates in the zone and outside of the zone.
ggplot(filter(d, K_Type %in% c("TOP", "BOTTOM")),
aes(Z_Contact, O_Contact, color=K_Type)) +
geom_point(size=3) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("grey50", "black" ))
8.8 Comparing Top and Bottom Walk Hitters
This scatterplot compares the top and bottom BB groups with respect to the swing rates in the zone and outside of the zone.
ggplot(filter(d, BB_Type %in% c("TOP", "BOTTOM")),
aes(Z_Swing, O_Swing, color=BB_Type)) +
geom_point(size=3) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("grey50", "black" ))
9 Chapter 8 - Probability and Modeling
Load in some necessary packages.
library(dplyr)
library(ggplot2)
library(stringr)
library(readr)
9.1 The Data
The FanGraphs page http://www.fangraphs.com/plays.aspx?date=2016-11-02&team=Indians&dh=0 provides a play log for Game 7 of the 2016 World Series. The table on that page was downloaded and stored in a csv file that is read into R.
<- read_csv("https://bayesball.github.io/VB/data/WSGame7.csv")
d $Play_Number <- 1:dim(d)[1]
d$WE <- as.numeric(str_replace(d$WE, "%", ""))
dhead(d)
# A tibble: 6 × 13
Pitcher Player Inn. Outs Base Score Play LI RE WE WPA RE24
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C Kluber D Fowler 1 0 ___ 0-1 Dext… 0.87 0.48 39.8 0.102 1
2 C Kluber K Schwa… 1 0 ___ 0-1 Kyle… 0.79 0.48 36.7 0.032 0.37
3 C Kluber K Bryant 1 0 1__ 0-1 Kris… 1.3 0.85 39.6 -0.029 -0.35
4 C Kluber A Rizzo 1 1 1__ 0-1 Anth… 1.05 0.5 42.1 -0.025 -0.28
5 C Kluber K Schwa… 1 2 1__ 0-1 Kyle… 0.72 0.22 41.2 0.009 0.09
6 C Kluber B Zobri… 1 2 _2_ 0-1 Ben … 1.04 0.31 44.1 -0.029 -0.31
# ℹ 1 more variable: Play_Number <int>
9.2 Plot of Win Probabilities of a Game
The WE
column of the data frame gives the win probability as a percentage. The below plot graphs the win probability against the Play_Number
variable. I add additional text indicating the inning of the game.
ggplot(d, aes(Play_Number, WE / 100)) +
geom_point(size=2) +
geom_line() +
ylim(0, 1) +
ggtitle("") +
ylab("Probability Indians Win") +
geom_hline(yintercept = .50, color="blue", size=1.5) +
annotate("text", x=cumsum(c(0, 10, 7, 9, 9, 12, 8,
8, 10, 8)) +
c(10, 7, 9, 9, 12, 8,
8, 10, 8, 14) / 2,
y=0.90,
label = as.character(1:10), size=5) +
annotate("text", x=45, y=0.98,
label="INNING", size=6) +
xlab("Play Number")
9.3 Plot of Leverages
The variable LI
is the leverage of the game situation defined by the score, inning, runners on base and number of outs. This graph plots the leverage values against the play number.
ggplot(d, aes(Play_Number, LI)) +
geom_segment(aes(xend = Play_Number, yend = 0),
size = 2, lineend = "butt") +
xlab("Play Number") +
ylab("Leverage") +
ylim(0, 5.8) +
annotate("text", x=cumsum(c(0, 10, 7, 9, 9, 12, 8,
8, 10, 8)) +
c(10, 7, 9, 9, 12, 8,
8, 10, 8, 14) / 2,
y=5,
label = as.character(1:10), size=5) +
annotate("text", x=45, y=5.5, label="INNING", size=6)
9.4 Plot of Win Probability Added
The variable WPA
provides the change in the win probability for each play. This graph plots WPA
against the play number.
ggplot(d, aes(Play_Number, WPA)) +
geom_segment(aes(xend = Play_Number, yend = 0),
size = 2, lineend = "butt") +
xlab("Play Number") +
ylab("Win Probability Added") +
ylim(-0.24, 0.6) +
annotate("text", x=cumsum(c(0, 10, 7, 9, 9, 12, 8,
8, 10, 8)) +
c(10, 7, 9, 9, 12, 8,
8, 10, 8, 14) / 2,
y=0.53,
label = as.character(1:10), size=5) +
annotate("text", x=45, y=0.60, label="INNING", size=6) +
annotate('text', x=71, y=0.45, label="Davis\nHR") +
annotate('text', x=85, y=0.38, label="Zobrist\n2B") +
annotate('text', x=77, y=-0.22, label="Baez\nSO")
10 Chapter 9 - Streakiness and Clutch Play
10.1 Streakiness Graphs
Load the BayesTestStreak
package (available on Github). This package will be used to generate the streakiness graphs of this chapter.
Note: One can install the BayesTestStreak
package by use of the install_github()
function from the remotes
package. (The installation need be done only once.)
remotes::install_github("bayesball/BayesTestStreak")
library(BayesTestStreak)
library(gridExtra)
By the way, to see the R code, one just types the name of the function. For example, to see the code for the moving average function, type moving_average_plot
.
moving_average_plot
function (mavg_data)
{
ggplot(mavg_data, aes(x = Index, ymax = Average, ymin = AVG)) +
geom_ribbon(fill = "blue") + theme_minimal()
}
<bytecode: 0x7fb26bb43fe0>
<environment: namespace:BayesTestStreak>
10.2 The Data
First I use the find_id
function in the package to find the Retrosheet ids for these two hitters.
<- find_id("Neil Walker")
walker_id <- find_id("Nori Aoki") aoki_id
Collect the hit/out sequences for both players.
<- streak_data(walker_id, pbp2016, "H", AB=TRUE)
walker <- streak_data(aoki_id, pbp2016, "H", AB=TRUE) aoki
10.3 “Rug Plots”
Here are simple lines showing the AB occurrences of all seasons during the season
plot_streak_data(walker) + theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5)) + ggtitle("Walker")
plot_streak_data(aoki) + theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5)) + ggtitle("Aoki")
10.4 Moving average plots
The function moving_average
computes the moving averages and moving_average_plot
constructs the moving average plot.
<- moving_average(walker, 50)
walker_s_data moving_average_plot(walker_s_data) +
theme(plot.title = element_text(colour = "blue",
size = 18,
hjust = 0.5)) + ggtitle("Walker")
<- moving_average(aoki, 50)
aoki_s_data moving_average_plot(aoki_s_data) +
theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5)) + ggtitle("Aoki")
For comparison, better to put the two moving average plots on the same scale:
<- moving_average_plot(walker_s_data) +
p1 ylim(.1, .5) +
annotate("text", x=200, y=0.45,
label="Neil Walker", size=7) +
ylab("Moving Average") + xlab("")
<- moving_average_plot(aoki_s_data) +
p2 ylim(.1, .5) +
annotate("text", x=200, y=0.45,
label="Nori Aoki", size=7) +
ylab("Moving Average") + xlab("At Bat Number")
grid.arrange(p1, p2)
10.5 Geometric Plots
<- find_spacings(walker)
sp geometric_plot(sp) +
theme(plot.title = element_text(colour = "blue", size = 18,
hjust = 0.5)) + ggtitle("Walker")
<- find_spacings(aoki)
sp geometric_plot(sp) +
theme(plot.title = element_text(colour = "blue",
size = 18,
hjust = 0.5)) + ggtitle("Aoki")