17 Median Polish
17.1 Meet the data
In the table below, we have displayed the normal daily high temperatures (in degrees Fahrenheit) for four months for five selected cities. This is called a two-way table. There is some measurement, here Temperature, that is classified with respect to two categorical variables, City and Month. We would like to understand how the temperature depends on the month and the city. For example, we know that it is generally colder in the winter months and warmer during the summer months. Also some cities, like Atlanta, that are located in the south tend to be warmer than northern cities like Minneapolis. We want to use a simple method to describe this relationship.
library(LearnEDAfunctions)
temperatures
## City January April July October
## 1 Atlanta 50 73 88 73
## 2 Detroit 30 58 83 62
## 3 Kansas_City 35 65 89 68
## 4 Minneapolis 21 57 84 59
## 5 Philadelphia 38 63 86 66
17.2 An additive model
A simple description of these data is an additive model of the form
\[ FIT = {\rm Overall \, temperature} + {\rm City \, effect} + {Month \, effect}. \] What this additive model says is that different cities tend to have different temperatures, and one city, say Atlanta, will be so many degrees warmer than another city, say Detroit. This difference in temperatures for these two cities will be the same for all months using this model. Also, one month, say July, will tend to be a particular number of degrees warmer than another month, say January – this difference in monthly temperatures will be the same for all cities.
17.3 Median polish
We describe a resistant method called median polish of fitting an additive model. This method is resistant so it will not be affected by extreme values in the table. We will later compare this fitting model with the better-known least-squares method of fitting an additive model.
To begin median polish, we take the median of each row of the table. We place the row medians in a column REFF that stands for Row Effect.
<- temperatures[, -1]
temps dimnames(temps)[[1]] <- temperatures[, 1]
<- apply(temps, 1, median)
REFF cbind(temps, REFF)
## January April July October REFF
## Atlanta 50 73 88 73 73.0
## Detroit 30 58 83 62 60.0
## Kansas_City 35 65 89 68 66.5
## Minneapolis 21 57 84 59 58.0
## Philadelphia 38 63 86 66 64.5
Next, we subtract out the row medians. For each temperature, we subtract the corresponding row median. For example, the temperature in Atlanta in January is 50 degrees – we subtract the Atlanta median 73, getting a difference of -23. If we do this operation for all temperatures in all rows, we get the following table:
<- sweep(temps, 1, REFF)
Residual <- cbind(Residual, REFF)
RowSweep RowSweep
## January April July October REFF
## Atlanta -23.0 0.0 15.0 0.0 73.0
## Detroit -30.0 -2.0 23.0 2.0 60.0
## Kansas_City -31.5 -1.5 22.5 1.5 66.5
## Minneapolis -37.0 -1.0 26.0 1.0 58.0
## Philadelphia -26.5 -1.5 21.5 1.5 64.5
We call the two steps {find row medians, subtract out the medians} a row sweep of the table.
Next, we take the median of each column of the table (including the row effect column). We put these column medians in a row called CEFF (for column effects).
<- apply(RowSweep, 2, median)
CEFF rbind(RowSweep, CEFF)
## January April July October REFF
## Atlanta -23.0 0.0 15.0 0.0 73.0
## Detroit -30.0 -2.0 23.0 2.0 60.0
## Kansas_City -31.5 -1.5 22.5 1.5 66.5
## Minneapolis -37.0 -1.0 26.0 1.0 58.0
## Philadelphia -26.5 -1.5 21.5 1.5 64.5
## 6 -30.0 -1.5 22.5 1.5 64.5
We subtract out the column medians, similar to above. For each entry in the table, we subtract the corresponding column median. The steps of taking column medians and subtracting them out is called a column sweep of the table.
<- sweep(RowSweep, 2, CEFF)
Residual <- rbind(Residual, CEFF)
ColSweep dimnames(ColSweep)[[1]][6] <- "CEFF"
ColSweep
## January April July October REFF
## Atlanta 7.0 1.5 -7.5 -1.5 8.5
## Detroit 0.0 -0.5 0.5 0.5 -4.5
## Kansas_City -1.5 0.0 0.0 0.0 2.0
## Minneapolis -7.0 0.5 3.5 -0.5 -6.5
## Philadelphia 3.5 0.0 -1.0 0.0 0.0
## CEFF -30.0 -1.5 22.5 1.5 64.5
At this point, we have performed one row sweep and one column sweep in the table. We continue by taking medians of each row. We place these row medians in a column called Rmed
.
<- ColSweep[, -5]
Resid <- ColSweep[, 5]
REFF <- apply(Resid, 1, median)
Rmed cbind(Resid, Rmed, REFF)
## January April July October Rmed REFF
## Atlanta 7.0 1.5 -7.5 -1.5 0.00 8.5
## Detroit 0.0 -0.5 0.5 0.5 0.25 -4.5
## Kansas_City -1.5 0.0 0.0 0.0 0.00 2.0
## Minneapolis -7.0 0.5 3.5 -0.5 0.00 -6.5
## Philadelphia 3.5 0.0 -1.0 0.0 0.00 0.0
## CEFF -30.0 -1.5 22.5 1.5 0.00 64.5
To adjust the values in the middle of the table and the row effects, we
- Add the row medians (rmed) to the row effects (REFF)
- Subtract the row medians (rmed) from the values in the middle.
After we do this, we get the following table:
<- REFF + Rmed
REFF <- sweep(Resid, 1, Rmed)
Resid <- cbind(Resid, REFF)
RowSweep RowSweep
## January April July October REFF
## Atlanta 7.00 1.50 -7.50 -1.50 8.50
## Detroit -0.25 -0.75 0.25 0.25 -4.25
## Kansas_City -1.50 0.00 0.00 0.00 2.00
## Minneapolis -7.00 0.50 3.50 -0.50 -6.50
## Philadelphia 3.50 0.00 -1.00 0.00 0.00
## CEFF -30.00 -1.50 22.50 1.50 64.50
Finally, we take the medians of each column and put the values in a new row called ceff
.
<- RowSweep[-6, ]
Resid <- RowSweep[6, ]
CEFF <- apply(Resid, 2, median)
ceff rbind(Resid, ceff, CEFF)
## January April July October REFF
## Atlanta 7.00 1.50 -7.50 -1.50 8.50
## Detroit -0.25 -0.75 0.25 0.25 -4.25
## Kansas_City -1.50 0.00 0.00 0.00 2.00
## Minneapolis -7.00 0.50 3.50 -0.50 -6.50
## Philadelphia 3.50 0.00 -1.00 0.00 0.00
## 6 -0.25 0.00 0.00 0.00 0.00
## CEFF -30.00 -1.50 22.50 1.50 64.50
To adjust the remaining values, we
- Add the ceff values to the column effects in CEFF.
- Subtract the ceff values from the values in the middle.
<- CEFF + ceff
CEFF <- sweep(Resid, 2, ceff)
Resid <- rbind(Resid, CEFF)
ColSweep ColSweep
## January April July October REFF
## Atlanta 7.25 1.50 -7.50 -1.50 8.50
## Detroit 0.00 -0.75 0.25 0.25 -4.25
## Kansas_City -1.25 0.00 0.00 0.00 2.00
## Minneapolis -6.75 0.50 3.50 -0.50 -6.50
## Philadelphia 3.75 0.00 -1.00 0.00 0.00
## CEFF -30.25 -1.50 22.50 1.50 64.50
We could continue this procedure (take out row medians and take out column medians) many more times. If we do this by hand, then it is usually sufficient to do 4 iterations – a row sweep, a column sweep, a row sweep, and a column sweep.
17.4 Interpreting the additive model
What we have done is fit an additive model to this table. Let’s interpret what this fitted model is telling us.
Atlanta’s January temperature is 50. We can represent this temperature as
Atlanta's temp in January =
(Common) + (Atlanta Row effect) + (January Col effect) + (Residual).
We can pick up these terms on the right hand side of the equation from the output of the median polish. We see the common value is 64.5, the Atlanta effect is 8.5, the January effect is -30.25 and the residual is 7.25.
So Atlanta’s January temp is \[ 50 = 64.5 + 8.5 - 30.25 + 7.25 . \] Likewise, we can express all of the temperatures of the table as the sum of four different components \[ COMMON + ROW \, EFFECT + COL \, EFFECT + RESIDUAL. \]
The function medpolish
repeats this process for a maximum of 10 iterations. We illustrate using this function on the temperature data and display the components.
<- medpolish(temps) additive.fit
## 1: 36.5
## Final: 36.25
additive.fit
##
## Median Polish Results (Dataset: "temps")
##
## Overall: 64.5
##
## Row Effects:
## Atlanta Detroit Kansas_City Minneapolis
## 8.50 -4.25 2.00 -6.50
## Philadelphia
## 0.00
##
## Column Effects:
## January April July October
## -30.25 -1.50 22.50 1.50
##
## Residuals:
## January April July October
## Atlanta 7.25 1.50 -7.50 -1.50
## Detroit 0.00 -0.75 0.25 0.25
## Kansas_City -1.25 0.00 0.00 0.00
## Minneapolis -6.75 0.50 3.50 -0.50
## Philadelphia 3.75 0.00 -1.00 0.00
17.4.1 Interpreting the row and column effects
Let’s try to make sense of the additive model produced by the medpolish
function. Our fit has the form
\[
COMMON + ROW \, EFFECT + COL \, EFFECT + RESIDUAL.
\]
and the common, row effects, and column effects are shown below.
—- | January | April | July | October | REFF |
---|---|---|---|---|---|
Atlanta | 8.5 | ||||
Detroit | -4.25 | ||||
Kansas City | 2 | ||||
Minneapolis | -6.5 | ||||
Philadelphia | 0 | ||||
CEFF | -30.25 | -1.5 | 22.5 | 1.5 | 64.5 |
If we wish to focus on the row effects (cities), we can combine the common and row effects to get the fit
\[ FIT = [COMMON + ROW \, EFFECT] + COL \, EFFECT \] \[ = ROW \, FIT + COL \, EFFECT \]
—- | January | April | July | October | REFF |
---|---|---|---|---|---|
Atlanta | 73 | ||||
Detroit | 60.25 | ||||
Kansas City | 66.5 | ||||
Minneapolis | 58 | ||||
Philadelphia | 64.5 | ||||
CEFF | -30.25 | -1.5 | 22.5 | 1.5 |
Looking at this table, specifically the row fits, we see that
- the average Atlanta temperature is 73 degrees
- generally, Atlanta is 73 - 60.25 = 12.75 degrees warmer than Detroit
- Philadelphia tends to be 6.5 degrees warmer than Minneapolis
If we were interested in the month effects (columns), we can combine the common and column effects to get the representation \[ FIT = [COMMON + COL \, EFFECT] + ROW \, EFFECT \] \[ = COL \, FIT + ROW \, EFFECT \] which is displayed below.
—- | January | April | July | October | REFF |
---|---|---|---|---|---|
Atlanta | 8.5 | ||||
Detroit | -4.25 | ||||
Kansas City | 2 | ||||
Minneapolis | -6.5 | ||||
Philadelphia | 0 | ||||
CEFF | 33.25 | 63 | 87 | 66 |
We see
- the temperature in April is on average 63 degrees.
- it tends to be 63 – 33.25 = 29.75 degrees warmer in April than January
- October and April have similar temps on average – October is 3 degrees warmer
17.5 Looking at the residuals
After we gain some general understanding about the fit (which cities and months tend to be warm or cold), we can look at the residuals, shown below.
$residuals additive.fit
## January April July October
## Atlanta 7.25 1.50 -7.50 -1.50
## Detroit 0.00 -0.75 0.25 0.25
## Kansas_City -1.25 0.00 0.00 0.00
## Minneapolis -6.75 0.50 3.50 -0.50
## Philadelphia 3.75 0.00 -1.00 0.00
- Are the residuals “large”? To be more clear, are the sizes of the residuals large relative to the row and column effects? Above we saw that the row effects (corresponding to cities) ranged from -6 to 8 and the column effects (months) ranged from -30 to 22. We see a few large residuals
Atlanta in January: 7.25
Minneapolis in January: -6.75
Atlanta in July: -7.5
These are large relative to the city effects but small relative to the month effects.
- Do we see any pattern in the residuals? We’ll talk later about one specific pattern we might see in the residuals. Here do we see any pattern in the large residuals? If we define large as 3 (in absolute value), then we see five large residuals that are all in the months of January and July. These two months are the most extreme ones (in terms of temperature) and the city temps in these months might show more variability, which would contribute to these large residuals.