ggplot2
is a R package for graphing data based on the “The Grammar of Graphics” framework introduced by Leland Wilkinson. This package is used to construct all of the graphs for the book Visualizing Baseball. The purpose of this document to introduce ggplot2 for a familiar baseball dataset. In this document, I introduce the basic framework and illustrate the use of ggplot2 to construct graphs for different types of variables.
Collect hitting data for all teams in the 2015 baseball season. For each team, I compute its slugging percentage SLG
and its on-base percentage OBP
.
library(dplyr)
library(Lahman)
teams2015 <- filter(Teams, yearID == 2015)
names(teams2015)[18:19] <- c("X2B", "X3B")
teams2015$SF <- as.numeric(teams2015$SF)
teams2015$HBP <- as.numeric(teams2015$HBP)
teams2015 <- mutate(teams2015,
X1B = H - X2B - X3B - HR,
TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
SLG = TB / AB,
OBP = (H + BB + HBP) /
(AB + BB + HBP + SF))
To construct a graph using ggplot2
, one needs …
A data frame that contains the data that you want to graph.
Aesthetics or roles assigned to particular variables in the data frame.
A geometric object (or geom for short) which is what you are plotting.
For example, suppose we wish to construct a scatterplot of the on-base percentage and the slugging percentages for all teams in the 2015 season.
The data frame teams2015
contains the data and OBP
and SLG
are the variables of interest.
To construct a scatterplot, you need to have a variable on the horizontal axis (x
) and a variable on the vertical axis (y
). If I want OBP
to be the horizontal axis variable and SLG
the vertical axis variable, I would assign the aethetics OBP
to x
and SLG
to y
.
Steps 1 and 2 are communicated by the command
library(ggplot2)
ggplot(data=teams2015, aes(x=OBP, y=SLG))
The ggplot2
function only sets up the axis – it does not plot anything. To construct a scatterplot, we need to add a point geometric object which is the function geom_point
. Now we see the scatterplot. There is a clear positive association between a team’s OBP and its SLG.
ggplot(data=teams2015, aes(x=OBP, y=SLG)) +
geom_point()
There are other roles or aesthetics that you can assign to variables.
For example, the variable lgID
gives the league (AL
or NL
). We can assign lgID
to the color aesthetic – so the points are colored by the league variable. This tells us that the team with the highest OBP and highest SLG was from the American League.
ggplot(data=teams2015,
aes(x=OBP, y=SLG, color=lgID)) +
geom_point()
There are other aesthetics like shape and size.
Here I can use different plotting symbols for each league by assigning lgID
to the shape aesthetic. Personally, I think the different shapes are harder to distinguish than the different colors.
ggplot(data=teams2015,
aes(x=OBP, y=SLG, shape=lgID)) +
geom_point()
The variable W
is the number of team wins. Here I assign the variable W
to the size aesthetic. Notice that the team with the highest OBP and SLG values appeared to win a lot of games in the 2015 season.
ggplot(data=teams2015,
aes(x=OBP, y=SLG, size=W)) +
geom_point()
In ggplot2, it is easy to break the plot into several panels defined by a categorical variable – these different panels are called facets. Suppose I want to construct panels of scatterplots of OBP
by SLG
, where the panels are defined by the league variable.
From this graph, it appears that the AL teams generally had higher SLG values than the NL teams.
ggplot(data=teams2015,
aes(x=OBP, y=SLG)) +
geom_point() +
facet_grid(~ lgID)
There are many possible geometric objects (geoms) that one can use depending on the number of variables and variable types.
Suppose one wants to construct a histogram of the OBP’s for the 30 times. Here the single aesthetic is x and we use geom_histogram
. I indicate that we want to apply five bins in the histogram.
ggplot(data=teams2015, aes(x=OBP)) +
geom_histogram(bins=5)
A bar chart is a graph of a single categorical variable that we can produce using the geom_bar
geom. This graph confirms that there are 15 teams in each league.
ggplot(data=teams2015, aes(x=lgID)) + geom_bar()
We said earlier that it appeared that the slugging percentages were greater for teams in the American League. A more direct way to graphically compare the two groups of SLG values is by parallel boxplots. Here we assign the x aesthetic to lgID
, the y aesthetic to SLG
, and use the geom_boxplot
geom.
ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
geom_boxplot()
Another geom that one can use in this scenario is geom_jitter
which produces jittered points. The width
option controls the width of the horizontal range of the jittering.
ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
geom_jitter(width=0.1)
In ggplot2, it is possible to modify all aspects of the graph. I illustrate some basic modificaitons here. I use the ggtitle
function to add a plot title, and use the ylab
and xlab
functions to add x and y labels.
ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
geom_jitter(width=0.1) +
ggtitle("Slugging Percentages in the 2015 Season by League") +
xlab("League") + ylab("Slugging Percentage")
Hopefully this introduction gets you interested in trying out ggplot2 for your own graphs. I encourage you to try out some of the ggplot2 example scripts for the different chapters and then apply ggplot2 for your own problems.