5 Boxplots
5.1 The Data:
In this topic, we start discussing how to compare batches of data effectively. Our dataset is taken from the 2001 Boston Marathon race. On the www.bostonmarathon.org
website, one can obtain results for participants of different genders, ages, and home countries. Here we focus on the time-to-completion for woman runners. We take a sample of women of ages 20, 30, 40, 50, and 60. In the display below, we show the data and then construct parallel stemplots of the times (in minutes) for all the runners in our study. The unit in our stemplot is one, so the shortest time among all 20-year women in our sample had a finish time of 150 minutes, which is equivalent to 2 1/2 hours.
Official times (minutes) of some women runners in the 2001 Boston Marathon.
Age=20
244 213 274 240 225 269 214 223 271 237
232 229 209 272 230 229 203 236 222 239
233 150
age=30
194 207 259 287 319 252 237 330 236 210
226 213 241 235 194 216 272 227 278 211
219 259 237 234 205
age=40
286 256 247 166 275 284 239 235 163 214
227 346 210 223 238 221 271 224 248 231
314 224 258 244 262
age=50
281 287 222 251 253 302 235 231 254 253
262 231 230 284 326 349 269 327 258 270
260 279 263 245 271
age=60
219 338 278 315 278 258 274 233 280 270
271
PARALLEL STEMPLOTS
One unit = 1 minute.
AGE=20 AGE=30 AGE=40 AGE=50 AGE=60
15 0 15 15 15 15
16 16 16 36 16 16
17 17 17 17 17
18 18 18 18 18
19 19 44 19 19 19
20 39 20 57 20 20 20
21 34 21 01369 21 04 21 21 9
22 23599 22 67 22 13447 22 2 22
23 023679 23 45677 23 1589 23 0115 23 3
24 04 24 1 24 478 24 5 24
25 25 299 25 68 25 13348 25 8
26 9 26 26 2 26 0239 26
27 124 27 28 27 15 27 019 27 01488
28 28 7 28 46 28 147 28 0
29 29 29 29 29
30 30 30 30 2 30
31 31 9 31 4 31 31 5
32 32 32 32 67 32
33 33 0 33 33 33 8
34 34 6 34 34 9 34
We are interested in graphically comparing the batches of times from the five age groups. An effective display is based on a boxplot, which is a graph of a five-number summary with outliers indicated.
5.2 Constructing A Single Boxplot
Let’s first illustrate the construction of a single boxplot for the times of the 20-year old women. There are \(n\) = 22 runners. So the location of the median is (22+1)/2 = 11 1/2, and the location of the fourths is (11+1)/2 = 6. From the stemplot, we find
\[ LO = 150, F_L = 222, M = 231, F_U = 240, HI = 274 . \]
Do we have any outliers? Here the fourth spread is \(d_F = 240 - 222 = 18\) and a step is 1.5 (18) = 27. The inner fences are at \[ 222 - 27 = 195 \,\, {\rm and} \,\, 240 + 27 = 267 \] Looking at the stemplot, we see one time (150) beyond the lower fence and four times (269, 271, 272, 274) beyond the upper fence. Certainly the low outlier is interesting since that corresponds to a very fast marathon runner.
To draw a boxplot:
- Draw a number line with tic marks covering the range of the data.
- Draw a box where the lines of the box correspond to the locations of the fourths and the median. (See diagram.)
- Indicate the outliers using separate plotting points.
- To complete the box, draw lines out from the box to the most extreme values that are not outliers. (These points are called adjacent values.)
Of course, we don’t need our labels and so our boxplot would look like
Here is a software generated boxplot display using the ggplot2
package in R.
library(LearnEDAfunctions)
library(tidyverse)
ggplot(filter(boston.marathon, age == 20),
aes(x = 1, y = time)) + xlim(0, 2) +
geom_boxplot() + coord_flip() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
5.3 Interpreting a Boxplot
Before we use boxplots to compare batches, let us spend some time interpreting a boxplot for a single batch. The figure below shows the histogram and corresponding boxplot for two datasets. The first dataset (left side) is symmetric with long tails on both sides.
If we look at the corresponding boxplot of this symmetric dataset, we see
- the location of the median (red line) is roughly half-way across the box (the location of the fourths)
- the lengths of the right and left whiskers (the lines extending from the box) are about the same – this means that the width of the lower quarter of the data is equal to the width of the upper quarter
Let’s contrast this boxplot of the symmetric batch with the boxplot of the batch on the right. From the histogram, we see that this data is skewed right – most of the data is in the 0-4 range and the values tail off towards large values. If we look at the corresponding boxplot, we see
- the length of the box from the median to the upper fourth is longer than the length from the lower fourth to the median – this indicates skewness in the middle half of the data
- the length of the right whisker is significantly longer than the length of the left whisker – this shows right skewness in the tail portion of the data
After some practice looking at boxplots, you’ll see that a boxplot is pretty informative about the shape of a batch.
5.4 Boxplots to Compare Batches
Now we are ready to use boxplots to compare the batches of running times for the different age groups. For each batch, we compute (1) the five-number summary, (2) the fences, and (3) indicate any outliers. Below, we have summarized our calculations for the five age groups, and then we use the calculations to construct boxplots for the batches. We display all of the boxplots on a single plot using one scale.
Age = 20
Depth Lower Upper
N= 22
M 11.5 231.000
F 6.0 222.000 240.000
STEP = 27
FENCES = 195, 258
OUTLIERS: 150, 269, 271, 272, 274
Age = 30
Depth Lower Upper
N= 25
M 13.0 235.000
F 7.0 213.000 259.000
STEP = 69
FENCES: 144, 328
OUTLIERS: 330, 346
Age = 40
Depth Lower Upper
N= 25
M 13.0 239.000
F 7.0 224.000 262.000
STEP = 57
FENCES = 167, 319
OUTLIERS: 163, 166
Age = 50
Depth Lower Upper
N= 25
M 13.0 262.000
H 7.0 251.000 281.000
STEP = 45
FENCES: 206, 326
OUTLIERS: 327, 349
Age = 60
Depth Lower Upper
N= 11
M 6.0 274.000
H 3.5 264.000 279.000
STEP = 22.5
FENCES: 241.5, 301.5
OUTLIERS: 219, 233, 315, 338
ggplot(boston.marathon, aes(x = factor(age), y = time)) +
geom_boxplot() + coord_flip() +
xlab("Age") + ylab("Time")
What do we see in this display of boxplots?
It is easier to interpret this display when the boxplots are sorted by the medians of the groups. Here this sorting occurs naturally, since the 20 year-olds generally have smaller times than the 30 year-olds, and the 30 year-olds have smaller times for the 40 year-olds, and so on.
We notice a number of outlying points. In each age group, there are one or two unusually large times. Since we give special recognition to short times, we notice the 20-year woman who ran the race in 150 minutes.
If we focus on the three middle age groups, we notice that each group has about the same spread. (The spreads of the times for the 20 year-olds and the 60 year-olds are a bit smaller.) The lengths of the boxes for the three groups are about the same, indicating they have similar fourth spreads.
5.5 Comparisons using Medians
When batches have similar spreads, it is easy to make comparisons. Let’s illustrate this for the three middle age groups that have similar spreads. The medians and fourth spreads for these batches are
Median Fourth-Spread
age30 235 min 46 min
age40 239 min 38 min
age50 262 min 30 min
Since the times for the 30-year-old and 40-year-old groups have approximately the same spread, the batch of 40-year-old times can be obtained by adding 4 minutes (the difference in medians) to the batch of 30-year-old times. In other words,
\[ age40 = age30 + 4 \] which means that the 40-year-olds run, on average, 4 minutes longer than the 30-year-older runners.
Similarly, comparing the two older groups, we can say that \[ age50 = age40 + 23 \] which means that the batch of 50-year-old times can be found by adding 23 minutes to the 40-year-old times.
Do older women runners run slower than younger women in the Boston Marathon? Looking back at our boxplot display and comparing medians of the five groups, we see that women of ages 20, 30, and 40 have (approximately) the same median completion time. The median time for the 50 year-old runners seems significantly higher that the times for the 20-40 year-olds, and the runners of age 60 have a significantly higher median than the 50-year-olds. So it appears that the best times for women marathoners are in a broad range between 20 and 40 years, and the times don’t appear to deteriorate until after age 40.
This is a nice illustration, since the batches of data had similar spreads and this facilitated comparisons by comparing medians. We will see in our next example that batches can have varying spreads and this motivates a reexpression or change in the scale of the data so that the reexpressed batches have similar spreads