11 Transformations Summary
11.1 Why do we reexpress data?
We’ve focused here on two reasons to reexpress data:
- to make a batch symmetric
- to stabilize spread across batches
Why are these desirable objectives?
A symmetric dataset is easier to view and summarize. Think of a normal curve (the best-known symmetric curve) – once you know the mean and standard deviation, then you know about 2/3 of the data fall within one standard deviation of the mean.
Actually, the second objective is probably more important. We often wish to compare batches and this comparison is much easier when the spreads of the batches are approximately equal.
11.2 Serendipitous effects of transformation
After you work with data awhile, you’ll discover that a reexpression that improves the batches with respect to one of these two objectives is likely to improve the batches with respect to the other objective. For example, if you are reexpressing to get equal spreads, you’ll find that the reexpressed batches are more symmetric than the original batches.
Why? It has to do with the type of data we typically encountered.
When data are COUNTS or AMOUNTS, increasing spread with increasing level AND right skewness often occur together.
Think of a data set that contains some type of count. This data set is bounded below by 0. This will result in
- right skewness –small counts are constrained by 0 and large counts are not constrained – this results in a pileup of data near 0
- unequal spreads across batches. Due to the lower bound of 0, data sets that have small counts will have small spread. Data sets with larger counts don’t have this bounding effect and will show more spread.
11.3 When is it worthwhile to transform?
There is one reason why one should not transform. It changes data to a less familiar scale and it will be harder to communicate the data to consumers.
But reexpression may be worthwhile when
- the range of the batch is large.
What is large? We want a large value of hi/lo. I don’t want to set a precise definition of ``large”, but a ratio of hi/lo = 20 is good (reexpression may be helpful), and a ratio of hi/lo = 2 is not (reexpression won’t help)
there is a clear message from a transformation plot (such as the spread vs. level plot)
there is some pattern in the residuals (we will discuss this later)