Introduction

ggplot2 is a R package for graphing data based on the “The Grammar of Graphics” framework introduced by Leland Wilkinson. This package is used to construct all of the graphs for the book Visualizing Baseball. The purpose of this document to introduce ggplot2 for a familiar baseball dataset. In this document, I introduce the basic framework and illustrate the use of ggplot2 to construct graphs for different types of variables.

Some Baseball Data

Collect hitting data for all teams in the 2015 baseball season. For each team, I compute its slugging percentage SLG and its on-base percentage OBP.

library(dplyr)
library(Lahman)
teams2015 <- filter(Teams, yearID == 2015)
names(teams2015)[18:19] <- c("X2B", "X3B")
teams2015$SF <- as.numeric(teams2015$SF)
teams2015$HBP <- as.numeric(teams2015$HBP)
teams2015 <- mutate(teams2015,
                    X1B = H - X2B - X3B - HR,
                    TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
                    SLG = TB / AB,
                    OBP = (H + BB + HBP) / 
                      (AB + BB + HBP + SF))

Three Basic Components of a ggplot2 Graph

To construct a graph using ggplot2, one needs …

  1. A data frame that contains the data that you want to graph.

  2. Aesthetics or roles assigned to particular variables in the data frame.

  3. A geometric object (or geom for short) which is what you are plotting.

For example, suppose we wish to construct a scatterplot of the on-base percentage and the slugging percentages for all teams in the 2015 season.

  1. The data frame teams2015 contains the data and OBP and SLG are the variables of interest.

  2. To construct a scatterplot, you need to have a variable on the horizontal axis (x) and a variable on the vertical axis (y). If I want OBP to be the horizontal axis variable and SLG the vertical axis variable, I would assign the aethetics OBP to x and SLG to y.

Steps 1 and 2 are communicated by the command

library(ggplot2)
ggplot(data=teams2015, aes(x=OBP, y=SLG))

The ggplot2 function only sets up the axis – it does not plot anything. To construct a scatterplot, we need to add a point geometric object which is the function geom_point. Now we see the scatterplot. There is a clear positive association between a team’s OBP and its SLG.

ggplot(data=teams2015, aes(x=OBP, y=SLG)) +
  geom_point()

Other Aesthetics (color, shape, and size)

There are other roles or aesthetics that you can assign to variables.

For example, the variable lgID gives the league (AL or NL). We can assign lgID to the color aesthetic – so the points are colored by the league variable. This tells us that the team with the highest OBP and highest SLG was from the American League.

ggplot(data=teams2015, 
       aes(x=OBP, y=SLG, color=lgID)) +
  geom_point()

There are other aesthetics like shape and size.

Here I can use different plotting symbols for each league by assigning lgID to the shape aesthetic. Personally, I think the different shapes are harder to distinguish than the different colors.

ggplot(data=teams2015, 
       aes(x=OBP, y=SLG, shape=lgID)) +
  geom_point()

The variable W is the number of team wins. Here I assign the variable W to the size aesthetic. Notice that the team with the highest OBP and SLG values appeared to win a lot of games in the 2015 season.

ggplot(data=teams2015, 
       aes(x=OBP, y=SLG, size=W)) +
  geom_point()

Facetting

In ggplot2, it is easy to break the plot into several panels defined by a categorical variable – these different panels are called facets. Suppose I want to construct panels of scatterplots of OBP by SLG, where the panels are defined by the league variable.

From this graph, it appears that the AL teams generally had higher SLG values than the NL teams.

ggplot(data=teams2015, 
       aes(x=OBP, y=SLG)) +
  geom_point() +
  facet_grid(~ lgID)

Some Plot Geoms

There are many possible geometric objects (geoms) that one can use depending on the number of variables and variable types.

A Single Numeric Variable

Suppose one wants to construct a histogram of the OBP’s for the 30 times. Here the single aesthetic is x and we use geom_histogram. I indicate that we want to apply five bins in the histogram.

ggplot(data=teams2015, aes(x=OBP)) + 
  geom_histogram(bins=5)

A Single Categorical Variable

A bar chart is a graph of a single categorical variable that we can produce using the geom_bar geom. This graph confirms that there are 15 teams in each league.

ggplot(data=teams2015, aes(x=lgID)) + geom_bar()

One Categorical Variable and One Numeric Variable

We said earlier that it appeared that the slugging percentages were greater for teams in the American League. A more direct way to graphically compare the two groups of SLG values is by parallel boxplots. Here we assign the x aesthetic to lgID, the y aesthetic to SLG, and use the geom_boxplot geom.

ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
  geom_boxplot()

Another geom that one can use in this scenario is geom_jitter which produces jittered points. The width option controls the width of the horizontal range of the jittering.

ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
  geom_jitter(width=0.1)

Modifying the Axes

In ggplot2, it is possible to modify all aspects of the graph. I illustrate some basic modificaitons here. I use the ggtitle function to add a plot title, and use the ylab and xlab functions to add x and y labels.

ggplot(data=teams2015, aes(x=lgID, y=SLG)) +
  geom_jitter(width=0.1) +
  ggtitle("Slugging Percentages in the 2015 Season by League") +
  xlab("League") + ylab("Slugging Percentage")

Learning More About ggplot2

Hopefully this introduction gets you interested in trying out ggplot2 for your own graphs. I encourage you to try out some of the ggplot2 example scripts for the different chapters and then apply ggplot2 for your own problems.