Load several useful packages.
library(tidyverse)
library(ggplot2)
Collect several useful tables from Fangraphs. The first dataset contains basic hitting statistics and the second dataset has stats related to plate discipline. We merge the two datasets, creating a single data frame, 146 observations and 33 variables.
d1 <- read_csv("https://bayesball.github.io/VB/data/Dashboard_2016.csv")
d2 <- read_csv("https://bayesball.github.io/VB/data/Plate_Discipline_2016.csv")
d <- inner_join(d1, d2, by="playerid")
vars <- c(14, 25:33)
d_subset <- d[, vars]
names(d_subset) <- c("OBP", "O_Swing", "Z_Swing", "Swing",
"O_Contact", "Z_Contact",
"Contact", "Zone",
"F_Strike", "SwStr")
names(d)[c(14, 25:33)] <- names(d_subset)
Following graph constructs a scatterplot of the swing and contact rates for all hitters with a smoothing curve added.
ggplot(d, aes(Swing, Contact)) +
geom_point(size=2) +
geom_smooth(se=FALSE) +
xlab("Swing Rate") + ylab("Contact Rate")
We divide the players into “high” and “low” strikeout groups. We use contact and swing rates to predict (by a logistic model) the probability a hitter is in the high strikeout group. A line is added to the scatterplot – points above (below) the line are predicted to be in the low (high) K groups.
d$K_Rate <- with(d, ifelse(K > .1875, "HI", "LO"))
d$y <- ifelse(d$K_Rate=="HI", 1, 0)
glm(y ~ Contact + Swing, data=d, family=binomial) -> F
ggplot(d, aes(Swing, Contact,
color=K_Rate)) +
geom_point(size=3) +
xlab("Swing Rate") + ylab("Contact Rate") +
geom_abline(intercept = coef(F)[1] / (-coef(F)[2]),
slope = coef(F)[3] / (-coef(F)[2])) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("black", "grey60"))
We divide the players into “high” and “low” walk groups. We use contact and swing rates to predict (by a logistic model) the probability a hitter is in the high walk group. A line is added to the scatterplot – points to the left (to the right) of the line are predicted to be in the high (low) walk groups.
d$BB_Cat <- with(d, ifelse(BB > .082, "HI", "LO"))
d$y <- ifelse(d$BB_Cat=="HI", 1, 0)
glm(y ~ Contact + Swing, data=d, family=binomial) -> F
ggplot(d, aes(Swing, Contact,
color=BB_Cat)) +
xlab("Swing Rate") + ylab("Contact Rate") +
geom_point(size=3) +
geom_abline(intercept = coef(F)[1] / (-coef(F)[2]),
slope = coef(F)[3] / (-coef(F)[2])) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("black", "grey60"))
We first identify the players who have the smallest (TOP) and largest (BOTTOM) strikeout rates.
d <- mutate(d,
K_Type=ifelse(K < .12, "TOP",
ifelse(K > .25, "BOTTOM", NA)))
select(filter(d, K_Type == "TOP"),
Name.x, Team.x, K)
## # A tibble: 16 x 3
## Name.x Team.x K
## <chr> <chr> <dbl>
## 1 Mookie Betts Red Sox 0.110
## 2 Jose Altuve Astros 0.098
## 3 Adrian Beltre Rangers 0.103
## 4 Daniel Murphy Nationals 0.098
## 5 Dustin Pedroia Red Sox 0.105
## 6 Jose Ramirez Indians 0.100
## 7 Buster Posey Giants 0.111
## 8 Ender Inciarte Braves 0.118
## 9 Martin Prado Marlins 0.105
## 10 Yadier Molina Cardinals 0.108
## 11 Joe Panik Giants 0.089
## 12 Jose Iglesias Tigers 0.097
## 13 Yunel Escobar Angels 0.118
## 14 Melky Cabrera White Sox 0.107
## 15 Brandon Phillips Reds 0.116
## 16 Albert Pujols Angels 0.115
select(filter(d, K_Type == "BOTTOM"),
Name.x, Team.x, K)
## # A tibble: 16 x 3
## Name.x Team.x K
## <chr> <chr> <dbl>
## 1 Jonathan Villar Brewers 0.256
## 2 Adam Duvall Reds 0.270
## 3 Chris Davis Orioles 0.329
## 4 Khris Davis Athletics 0.272
## 5 Jake Lamb Diamondbacks 0.259
## 6 Leonys Martin Mariners 0.259
## 7 Mark Trumbo Orioles 0.255
## 8 Russell Martin Blue Jays 0.277
## 9 Danny Espinosa Nationals 0.290
## 10 Travis Shaw Red Sox 0.251
## 11 Michael Saunders Blue Jays 0.281
## 12 Justin Upton Tigers 0.286
## 13 Alex Gordon Royals 0.292
## 14 Melvin Upton Jr. - - - 0.288
## 15 Mike Napoli Indians 0.301
## 16 Chris Carter Brewers 0.320
Similarly we identify the players with the largest (TOP) and smallest (BOTTOM) walk rates
d <- mutate(d,
BB_Type=ifelse(BB > .13, "TOP",
ifelse(BB < .05, "BOTTOM", NA)))
select(filter(d, BB_Type == "TOP"),
Name.x, Team.x, BB)
## # A tibble: 13 x 3
## Name.x Team.x BB
## <chr> <chr> <dbl>
## 1 Mike Trout Angels 0.170
## 2 Josh Donaldson Blue Jays 0.156
## 3 Joey Votto Reds 0.160
## 4 Paul Goldschmidt Diamondbacks 0.156
## 5 Dexter Fowler Cubs 0.143
## 6 Brandon Belt Giants 0.159
## 7 Ben Zobrist Cubs 0.152
## 8 Carlos Santana Indians 0.144
## 9 Bryce Harper Nationals 0.172
## 10 Matt Carpenter Cardinals 0.143
## 11 Chris Davis Orioles 0.132
## 12 Jose Bautista Blue Jays 0.168
## 13 Joe Mauer Twins 0.137
select(filter(d, BB_Type == "BOTTOM"),
Name.x, Team.x, BB)
## # A tibble: 15 x 3
## Name.x Team.x BB
## <chr> <chr> <dbl>
## 1 Starling Marte Pirates 0.043
## 2 Kevin Pillar Blue Jays 0.041
## 3 Eduardo Nunez - - - 0.049
## 4 Didi Gregorius Yankees 0.032
## 5 Freddy Galvis Phillies 0.040
## 6 Salvador Perez Royals 0.040
## 7 Jonathan Schoop Orioles 0.032
## 8 Rougned Odor Rangers 0.030
## 9 Josh Harrison Pirates 0.034
## 10 Starlin Castro Yankees 0.039
## 11 Brandon Phillips Reds 0.031
## 12 Adonis Garcia Braves 0.043
## 13 Alcides Escobar Royals 0.040
## 14 Marwin Gonzalez Astros 0.042
## 15 Alexei Ramirez - - - 0.042
This scatterplot compares the top and bottom K groups with respect to the contact rates in the zone and outside of the zone.
ggplot(filter(d, K_Type %in% c("TOP", "BOTTOM")),
aes(Z_Contact, O_Contact, color=K_Type)) +
geom_point(size=3) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("grey50", "black" ))
This scatterplot compares the top and bottom BB groups with respect to the swing rates in the zone and outside of the zone.
ggplot(filter(d, BB_Type %in% c("TOP", "BOTTOM")),
aes(Z_Swing, O_Swing, color=BB_Type)) +
geom_point(size=3) +
scale_shape(solid = FALSE) +
scale_colour_manual(values = c("grey50", "black" ))