4 Single Batch: Summaries

4.1 Meet the Data

Data: Percentage change in population 2000-2009 for each state.

Source: The 2010 New York Times Almanac, page 277, and the U.S. Census Bureau website http://www.census.gov.

This data (some that is displayed in the following table) shows the change in population (measured in terms of a percentage) for all states in the United States between the years 2000 and 2009 (roughly between the 2000 and 2010 census). This is interesting data, since we are interested in the regions of the U.S. which are growing fast and slow. Specifically, we might want to know

  • what is a typical growth rate for a state in the last 9 years?
  • are there states whose growths are significantly different from the typical growth rate?
  • do the states with large population growths correspond to particular regions of the U.S.?

In this topic, we’ll discuss simple ways of summarizing a dataset. These summaries and associated displays will help in answering some of these questions.

library(LearnEDAfunctions)
library(tidyverse)
select(pop.change, State, Pct.change) %>% head()
##        State Pct.change
## 1    Alabama        5.9
## 2     Alaska       11.3
## 3    Arizona       28.6
## 4   Arkansas        8.1
## 5 California        9.1
## 6   Colorado       16.8

We begin by constructing a stemplot of the growth percentages. We break between the ones and tens places and use two leaves per stem. We have one unusual value – Nevada at the high end that we show on a separate HI line.

aplpack::stem.leaf(pop.change$Pct.change, depth=FALSE)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 51
##    0* | 000001
##     t | 222333333
##     f | 4445555
##     s | 66677777
##    0. | 889
##    1* | 000111
##     t | 233
##     f | 
##     s | 666
##    1. | 89
##    2* | 0
##     t | 
##     f | 4
##     s | 
##    2. | 8
## HI: 32.3

4.2 Ranks and Depths

To describe our summaries which we will call letter values, we have to first define a few terms. The rank of an observation is its order when data is arranged from lowest to highest. For example, if we have the following six test scores \[ 40, 43, 65, 77, 100, 66, \] 40 has rank 1, 43 has rank 2, 77 has rank 5, etc.

We can distinguish between two ranks – a downward rank (abbreviated drank) is the rank of an observation when the data are arranged from HI to LO. In contrast, the upward rank (abbreviated urank) of an observation is its rank when data are arranged from LO to HI.

In our test score example,

          43 has upward rank 2 and downward rank 5.

If \(n\) is the number of data values, it should be clear that

          drank + urank = n+1

The depth of an observation is the smaller of the two ranks. That is,

          depth = minimum{drank, urank}.

The extreme observations, the smallest and the largest, will each have a depth of 1. The table below gives the downward ranks, the upward ranks, and the depths for our test scores:

DATA  40  43  65  66  77 100
-----------------------------
URANK  1   2   3   4   5   6
DRANK  6   5   4   3   2   1
DEPTH  1   2   3   3   2   1

4.3 Letter Values: A Set of Summary Values

We define our summaries, called letter values, using depths. The first letter value, the median (denoted by \(M\)), is the value that divides the data into a lower half and an upper half. The depth of the median is \((n+1)/2\), where \(n\) is the number of items in our batch.

          Depth of median = (n + 1) / 2

The median divides the data into halves. We can continue by dividing each half (the lower half and the upper half) into halves. These summaries are called fourths (denoted by the letter \(F\)). We find them by computing their depths. The depth of a fourth is found by taking the integer part of the depth of the median, adding 1, and then dividing by 2:

          Depth of fourth = ([Depth of median] + 1) / 2

Let’s compute the median and the fourths for the state growth percentages. Here

          n = 51

and so

    depth(M) = (51 + 1) / 2 = 26 and depth(F) = (26 + 1) / 2 = 13 1/2.

So the median \(M\) is the 26th smallest (or largest) observation. The fourths, called the lower fourth and the upper fourth, are the observations that have depth 13 1/2. When we say a depth of 13 1/2, we mean that we wish to average the observations that have depths of 13 and 14.

4.4 Counting In

To find the median and fourths for our example, it its useful to add some extra numbers to our display. On each line of the stemplot, we write (on the left) the number of observations found on that line and more extreme lines. We see that there are 6 observations on the first line (and above), 15 observations are on the second line and above. Looking from the bottom, we see there are 2 observations on the bottom line (and below), there are 3 observations on the next-to-next-to-bottom line and below, etc. We call this

          counting in

We count in from both ends until we reach half of the data. We stop counting in at 22 at the top since one additional line of 8 would put us over 50% of the data; likewise we stop at counting in 18 from the bottom since one additional line would include more than half the data. The (8) on the fifth line is not counting in – it just tells us that there are 8 observations in this middle row.

aplpack::stem.leaf(pop.change$Pct.change, depth=TRUE)
## 1 | 2: represents 12
##  leaf unit: 1
##             n: 51
##    6    0* | 000001
##   15     t | 222333333
##   22     f | 4445555
##   (8)    s | 66677777
##   21    0. | 889
##   18    1* | 000111
##   12     t | 233
##          f | 
##    9     s | 666
##    6    1. | 89
##    4    2* | 0
##          t | 
##    3     f | 4
##          s | 
##    2    2. | 8
## HI: 32.3

Let’s find the median and fourths from the stemplot. The median has depth(\(M\)) = 26, and we see that this corresponds to \(M\) = 07. Recall that depth(\(F\)) = 13 1/2. Counting from the lowest observation, the observations with depths of 13 and 14 are 03 and 03, so the lower fourth is \(F_L\) = (03 + 03)/2 = 3. Counting from the largest observation, we see that the data values 11 and 11 have depths 13 and 14, so the upper fourth is \(F_U\) = (11 + 11)/2 = 11.

4.5 Five-number Summary

We can summarize our batch of data using five numbers: the smallest observation (\(LO\)), the lower fourth \(F_L\), the median \(M\), the upper fourth \(F_U\), and the largest observation (\(HI\)). Collectively, these numbers are called the five-number summary. Here the five-number summary is

fivenum(pop.change$Pct.change)
## [1]  0.30  3.65  7.00 11.60 32.30

What have we learned? A typical growth percentage of a state is 7 percent; approximately half of the states have growth percentages smaller than 7% and half have larger growth percentages. Moreover, since 3, 7, 11 divide the data into quarters, one quarter of the states have growth percentages smaller than 3%, one quarter of the states have growth percentages between 3% and 7% one quarter of the states have growth percentages between 7% and 11%, and one quarter of the states have growths between 11% and 32%. The extreme value is interesting: looking back at the data table, we see that Nevada has gained 32% in population.

4.6 Other Letter Values

Sometimes we will find it useful to compute other letter values that divide the tail regions of the data into smaller regions. Suppose we divide the lower quarter and the upper quarter of the data into halves – the dividing points are called eighths. The depth of an eighth is given by the formula

          Depth of eighth = ([Depth of fourth] + 1) / 2

In our example, we found depth(\(F\)) = 13 1/2, so

          Depth of eighth = ([13 1/2] + 1) / 2 = 7 .

The lower eighth and upper eighth have depths equal to 7. We return to our stemplot and find the 7th smallest and 7th largest values, which are 2 and 16. Approximately one eighth of the percentage increases in growth are smaller than 2%, and one eighth of the increases are larger than 16%.

For larger datasets, we will continue to divide the tail region to get other letter values as shown in the following table. Note that the depth of a letter value is found by using the depth of the previous letter value.

Letter Value Name Depth
\(M\) Median ([\(n\)] + 1) / 2
\(F\) Fourth ([depth(\(M\))] + 1) / 2
\(E\) Eighth ([depth(\(F\))] + 1) / 2
\(D\) Sixteenth ([depth(\(E\))] + 1) / 2
\(C\) Thirty-secondth ([depth(\(D\))] + 1) / 2
\(B\) Sixty-fourth ([depth(\(C\))] + 1) / 2
\(A\) One hundred and twenty-eighth ([depth(\(B\))] + 1) / 2

We will find these letter values useful in assessing the symmetry of a batch of data.

The lval function computes the set of letter values along with the mids and differences.

lval(pop.change$Pct.change)
##   depth   lo    hi   mids spreads
## M  26.0 7.00  7.00  7.000    0.00
## H  13.5 3.65 11.60  7.625    7.95
## E   7.0 2.10 16.80  9.450   14.70
## D   4.0 0.70 20.10 10.400   19.40
## C   2.5 0.50 26.65 13.575   26.15
## B   1.0 0.30 32.30 16.300   32.00

4.7 Measures of Center

Now that we have defined letter values, what is a good measurement of the center of a batch? A common measure is the mean, denoted by \(\bar x\), obtained by summing up the values and dividing by the number of observations. For exploratory work, we prefer the use of the median \(M\).

Why is the median preferable to the mean?

  • The median has a simpler interpretation than the mean — \(M\) divides the data into a lower half and an upper half.

  • Unlike the mean, the median \(M\) is resistant to extreme values. You are probably aware that a single large observation can have a significant impact on the value of . (Think of computing the mean salary for a company with 100 hourly workers and a president with a relatively large salary. The president’s salary will have a large impact on the mean salary.)

One criticism of the median is that it is dependent only on a single or two middle values in the batch. An alternative resistant measure of center is the tri-mean, which is a weighted average of the median and the two fourths:

The trimean is resistant (like the median \(M\)), since it cannot be distorted by a few large or small extreme values. But, by combining the fourths and the median, the tri-mean can reflect the lack of symmetry in the middle half of the data.

4.8 Measures of Spread

The usual measure of spread is the standard deviation \(s\) that is based on computing deviations from the mean. It suffers from the same lack-of-resistance problem as the mean – a single large value can distort the value of \(s\). So the standard deviation is not suitable for exploratory work.

For similar reasons, the range \(R = HI - LO\) is a poor measure of spread since it is based on only the two extreme values, and these two values may not reflect the general dispersion in the batch.

A better resistant measure of spread is the fourth-spread, denoted \(dF\), that is defined by the distance between the lower and upper fourths:

The fourth-spread has a simple interpretation – it’s the width of the middle 50% of the data.

4.9 Identifying Outliers

John Tukey devised a rule-of-thumb for identifying extreme observations in a batch. This rule-of-thumb is not designed to formally label particular data items as outliers. Rather this method sets apart a few unusually observations that may deserve further study.

The idea here is to set lower and upper fences in the data. If any of the observations fall beyond the fences, they are designated as possible outliers.

We first define a step which is equal to 1 1/2 times the fourth-spread: \[ STEP = 1.5 \times (F_U - F_L). \]

Then the lower fence is defined by one step smaller than the lower quartile, and the upper fence is defined as one step larger than the upper quartile:

\[ fence_{lower} = F_L - STEP, \, \, fence_{upper} = F_U + STEP. \] Any observations that fall beyond the fences are called ``outside”.

Tukey thought it was useful to have two sets of fences. The fences defined above can be called inner fences. To obtain outer fences, we got out two steps from the fourths:

\[ FENCE_{lower} = F_L - 2 \times STEP, \, \, FENCE_{upper} = F_U + 2 \times STEP. \] (We will call these outer fences FENCES.) Observations that fall beyond the outer fences can be regarded as ``really out”.

4.10 A New Example

For a second example, our almanac (The World Almanac 2001, page 237) gives the average gestation (in days) for 43 species of animals. Here’s part of the data and associated stemplot:

head(gestation.periods)
##         Animal Period
## 1          Ass    365
## 2       Baboon    187
## 3   Bear_black    219
## 4 Bear_grizzly    225
## 5   Bear_polar    240
## 6       Beaver    105
aplpack::stem.leaf(gestation.periods$Period)
## 1 | 2: represents 120
##  leaf unit: 10
##             n: 43
##    7    0* | 1123334
##   14    0. | 5666699
##   18    1* | 0001
##   (4)   1. | 5568
##   21    2* | 0123344
##   14    2. | 5588
##   10    3* | 3
##    9    3. | 566
##    6    4* | 0
##    5    4. | 558
## HI: 645 660

Here the dataset looks somewhat right skewed. There are a large number of animals (the small variety) with short gestation periods under 100 days. Also we see a cluster of periods in the 200-240 range. We note the two large values – each exceeding 600 days. We’re not surprised that these correspond to the two elephants in the table.

Let’s compute some letter values.

  1. There are \(n\) = 43 values, so the depth of the median is \(d(M)\) = (43+1)/2 = 22. Looking at the stemplot, we see that the 22nd value is 18, so \(M\) = 18.

  2. To find fourths, we compute the depth: \(d(F)\) = (22+1)/2 = 11 1/2. The lower and upper fourths are found by averaging the 11th and 12th values at each end. Looking at the stemplot, we find

\[ F_L = (6 + 6)/2 = 6, \, \, F_U = (28+28)/2 = 28 . \]

  1. We can keep going to find additional letter values. The depth of the eighth is \(d(E) = (11+1)/2 = 6\). Looking at the stemplot, these values are

\[ E_L = 3, E_U = 40 \]

  1. We set our fences to look for outliers. The fourth spread is

\[ dF = 28 - 6 = 22 \] and so a step is \[ STEP = 1.5 (22) = 33 . \]

The inner fences are located at \[ F_L - STEP = 6 - 33 = -27, \, \, F_U + STEP = 28 + 33 = 61 \] and the outer fences at \[ FL - 2 \times STEP = 6 - 2(33) = -60, \, \, F_U + 2 \times STEP = 61 + 33 = 94. \]

Do we have any outliers? Yes, the two elephant gestation periods are beyond the inner fence but within the outer fence at the high end. I think we would all agree that elephants are unusually large animals which likely goes together with their long gestation periods.

4.11 Relationship with Normal Data

In introductory statistics, we spend a lot of time talking about the normal distribution. If we have a bunch of normally distributed data, what do the fourths look like? Also should we expect to find any outliers?

Consider the normal curve with mean \(\mu\) and standard deviation \(\sigma\) that represents a population of normal measurements. It is easy to check that 50% of the probability content of a normal curve falls between \(\mu - 0.6745 \sigma\) and \(\mu + 0.6745 \sigma\) . So for normal measurements, \(F_L = \mu - 0.6745\) and \(F_U = \mu + 0.6745 \sigma\) and the fourth-spread is \(d_F = 2 (0.6745) \sigma = 1.349 \sigma\).

As an aside, this relationship gives us an alternative estimate of the standard deviation \(s\). Solving \(d_F = 1.349 \sigma\) for \(\sigma\) gives the relationship

\[ \sigma = d_F / 1.349. \]

So a simple way of estimating a standard deviation divides the fourth spread by 1.349. This is called the F pseudosigma. Why is this better than the usual estimate of \(\sigma\)? It’s better since, unlike the usual estimate, the F pseudosigma is resistant to extreme observations.

Continuing our discussion, how many outliers should we find for normal data? For normal data,

\[ STEP = 1.5 (1.349 \sigma ) = 2.0235 \sigma \] and the inner fences will be \[ F_L - STEP = \mu - 0.6745 \sigma - 2.0235 \sigma = \mu - 2.6980 \sigma \] \[ F_U + STEP = \mu + 0.6745 \sigma + 2.0235\sigma = \mu + 2.6980 \sigma. \]

The probability of being outside \(( \mu - 2.6980\sigma , \mu + 2.6980 \sigma )\) for a normal curve is .007. This means that only 0.7 % of normally distributed data will be classified as outliers. So, it is pretty rare to see outliers for normal data.

COMMENT: There is a slight flaw in the above argument. The normal curve represents the distribution for a large sample of normal data and 0.7% of this large sample will be outlying. If we take a small sample, then we will generally see a higher fraction of outliers. In fact, it has been established that the fraction of outliers for a normal sample of size \(n\) is approximately

          .00698 + .4 / n

For example, if we take a sample of size \(n\) = 20, then the proportion of outliers will be

          .00698 + .4/20 =.027

If we take repeated samples of size 20, then approximately 2.7 % of all these observations will be outlying.

I checked this result in a simulation. I took repeated samples of size 20 from a normal distribution. In 1000 samples, I found a total of 327 outliers. The fraction of outliers was 327/20000 = 0.016, which is a bit smaller than the result above. But this fraction is larger than the fraction 0.00698 from a “large” normal sample.