We load the packages Lahman
, dplyr
and ggplot2
.
library(Lahman)
library(dplyr)
library(ggplot2)
The Lahman
package contains season to season data for players and teams from the Sean Lahman database. For this history of home runs graph, want to collect the number of home runs hit (variable HR
) and number of games played (variable G
) for all teams for all seasons since 1900.
Here are a few sample rows of our data.
select(sample_n(Teams, 8), yearID, teamID, HR, G)
## yearID teamID HR G
## 2625 2009 WAS 156 162
## 1664 1975 BOS 134 160
## 2252 1997 MIN 132 162
## 491 1907 PHI 12 149
## 2291 1998 SFN 161 163
## 1180 1949 PIT 126 154
## 1924 1985 CIN 114 162
## 1895 1984 CAL 150 162
I use the summarize
function to collect the total number of home runs hit and the total number of games played for each season.
S <- summarize(group_by(Teams, yearID),
HR = sum(HR),
G = sum(G))
I use the filter
function to select the summary data for only seasons 1900 or later.
S2 <- filter(S, yearID >= 1900)
Now I can construct the plot. I construct a scatterplot of yearID
(horizontal) against the number of home runs hit for each team per game (variable HR / G
). I add a smoothing loess curve to the plot to see the general pattern.
ggplot(S2, aes(yearID, HR / G)) +
geom_point() +
geom_smooth(method="loess", span=0.3, se=FALSE)
Next suppose we want to graph the leading number of home runs against season for all seasons past 1900.
The Batting
data frame in the Lahman
package contains the number of home runs hit by each player each season.
Actually, the Batting
table contains more than one row for players who play for more than one team in a given season. So the first use the summarize
function to find the number of home runs for each player each season.
S <- summarize(group_by(Batting, yearID, playerID),
HR=sum(HR, na.rm=TRUE))
Next, for each season, we collect the maximum number of home runs that are hit. Also we select only the rows of the data frame where the season is at least 1900.
S1 <- summarize(group_by(S, yearID),
HR=max(HR, na.rm=TRUE))
S2 <- filter(S1, yearID >= 1900)
Here are the first few rows of our table.
head(S2)
## # A tibble: 6 x 2
## yearID HR
## <int> <int>
## 1 1900 12
## 2 1901 16
## 3 1902 16
## 4 1903 13
## 5 1904 10
## 6 1905 9
Now I can construct the plot. I construct a scatterplot of yearID
(horizontal) against the maximum number of home runs hit (variable HR
). I add a smoothing loess curve to the plot to see the general pattern.
ggplot(S2, aes(yearID, HR)) +
geom_point() +
geom_smooth(method="loess", span=0.3, se=FALSE)