A Course in Exploratory Data Analysis
1.1 An EDA Course
This book contains the lecture notes for a course on Exploratory Data Analysis that I taught for many years at Bowling Green State University. I started teaching this course using John Tukey’s EDA book, but there were several issues. First, it seemed students had difficulties with Tukey’s particular writing style. Second, the book does not use any technology. So the plan was to write up lecture notes covering many of the ideas from Tukey’s text and to supplement the course with software. Originally, I illustrated the methods using EDA commands from Minitab, but later I focused on using functions from the statistical system R.
I have written a short package
LearnEDAfunction that contains all of the course datasets and functions for performing some of the EDA methods. This function is available on my Github site:
This site provides an overview of the datasets and EDA functions.
This book provides overviews of many of the topics from Tukey’s text. Here is a list of the major topics and corresponding chapters in this book:
|Working with a Single Batch||3, 4|
|Comparing Batches||5, 6, 7|
|Transformations||8, 9, 10, 11|
|Plotting||12, 13, 14, 15, 16|
|Two-Way Tables||17, 18, 19, 20|
|Binned Data||21, 22|
1.3 Preface Excerpt from EDA by John W. Tukey (Addison-Wesley, 1977)
This book is based on an important principle:
It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.
Learning first what you can do will help you to work more easily and effectively.
This book is about exploratory data analysis, about looking at data to see what it seems to say. It concentrates on simple arithmetic and easy-to-draw pictures. It regards whatever appearances we have recognized as partial descriptions, and tries to look beneath them for new insights. Its concern is with appearance, not with confirmation.
1.3.1 Examples, NOT case histories
The book does not exist to make the case that exploratory data analysis is useful. Rather it exists to expose its readers and users to a considerable variety of techniques for looking for effectively at one’s data. The examples are not intended to be complete case histories. Rather they show isolated techniques in action on real data. The emphasis is on general techniques, rather than specific problems.
A basic problem about any body of data is to make it more easily and effectively handleable by minds – our minds, her mind, his mind. To this general end:
- anything that makes a simpler description possible makes the description more easily handleable.
- anything that looks below the previously described surface makes the description more effective. So we shall always be glad (a) to simplify description and (b) to describe one layer deeper.
- to be able to say that we looked one layer deeper, and found nothing, is a definite step forward – though not as far as to be able to say that we looked deeper and found thus-and-such.
- to be able to say that “if we change our point of view in the following way … things are simpler” is always a gain – though not quite as much as to be able to say “if we don’t bother to change our point of view (some other) things are equally simple”.
Thus, for example, we regard learning that log pressure is almost a straight line in the negative reciprocal of absolute temperature as a real gain, as compared to saying that pressure increases with temperature at an evergrowing rate. Equally, we regard being able to say that a batch of values is roughly symmetrically distributed on a log scale as much better than to say that the raw values have a very skew distribution.
In rating ease of description, after almost any reasonable change of point of view, as very important, we are essentially asserting a belief in quantitative knowledge – a belief that most of the key questions in our work sooner or later demand answers to “by how much?” rather than merely to “in which direction?”
Consistent with this view, we believe, is a clear demand that pictures based on exploration of data should force their messages upon us. Picture that emphasize what we already know – ``security blankets” to reassure us – are frequently not worth the space they take. Pictures that have to be gone over with a reading glass to see the main point are wasteful of time and inadequate of effect. The greatest value of a picture is when it forces us to notice what we never expected to see.
We shall not try to say why specific techniques are the ones to use. Besides pressures of space and time, there are specific reasons for this. Many of the techniques are less than ten years old in their present form – some will improve noticeably. And where a technique is very good, it is not at all certain that we yet know why it is.
We have tried to use consistent techniques wherever this seemed reasonable, and have not worried where it didn’t. Apparent consistency speeds learning and remembering, but ought not be allowed to outweigh noticeable differences in performance.
In summary, then we:
- leave most interpretations off results to those who are experts in the subject-matter field involved.
- present techniques, not case histories.
- regard simple descriptions as good in themselves.
- feel free to ask for changes in point of view in order to gain such simplicity.
- demand impact from our pictures.
- regard every description (always incomplete!) as something to be lifted off and looked under (mainly by using residuals).
- regard consistency from one technique to another as desirable, not essential.