Chapter 5 Continuous Distributions

5.1 Introduction: A Baseball Spinner Game

The baseball board game All-Star Baseball has been honored as one of the fifty most influential board games of all time according to the Wikipedia Encyclopedia (http://en.wikipedia.org). This game is based on a collection of spinner cards, where one card represents the possible batting accomplishments for a single player. The game is played by placing a card on a spinner and a spin determines the batting result for that player.

A spinner card is constructed by use of the statistics collected for a player during a particular season. To illustrate this process, the table below shows the batting statistics for the famous player Mickey Mantle for the 1956 baseball season. When Mantle comes to bat, that is called a plate appearance (PA) – we see from the table that he had 632 plate appearances this season. There are several different events possible when Mantle came to bat – he could get a single (1B), a double (2B), a triple (3B), or a home run (HR). Also he could walk (BB), strike out (SO), or get other type of out.

PA	1B	2B	3B	HR	BB	SO	Other OUTS
632	109	22	5	52	99	112	233

The probability of each type of event can be found by dividing each count by the number of plate appearances. Each probability is converted to an angle on the spinner by multiplying each probability by the total number of degrees (360). From these degree measurements, a spinner is constructed, displayed in Figure 5.1, where the area of each wedge of the circle is proportional to the probability of that event occurring. A single plate appearance of Mickey Mantle can be simulated by spinning the spinner and observing the batting event.

PA	1B	2B	3B	HR	BB	SO	Other OUTS
632	109	22	5	52	99	112	233
Probability	0.172	0.035	0.008	0.082	0.157	0.177	0.369
Degrees in spinner	62	13	3	30	57	64	133

$Spinner constructed based on Mantle's statistics.$

Figure 5.1: Spinner constructed based on Mantle’s statistics.

The Binomial described in Chapter 4 is an example of a discrete random variable which takes on only values in a list, such as $\{0, 1, ..., 10\}$. How can one think about probabilities where the random variable is not discrete? As a simple example, consider the experiment of spinning the spinner in Figure 5.2 where the random variable $X$ is the recorded location. Here $X$ is a continuous random variable that can take on any value between 0 and 100.

$A spinner with continuous random outcomes.$

Figure 5.2: A spinner with continuous random outcomes.

In this chapter, probabilities for a continuous random variable will be shown to be represented by means of a smooth curve where the probability that $X$ falls in a given interval is equal to an area under the curve. Through a series of examples, we will illustrate probability calculations for this type of random variables.

5.2 The Uniform Distribution

Consider the spinner experiment described in Section 5.1 where the location of the spinner $X$ can be any number between 0 and 100. Our computer simulated spinning this spinner 20 times with the following results (rounded to the nearest tenth):

95.0	23.1	60.7	48.6	89.1	76.2	45.6	1.9	93.5	91.7
82.1	44.5	61.5	79.2	92.2	73.8	17.6	40.6	41.0	89.4

A histogram of these values of $X$ is shown in the Figure 5.3.

$Histogram of 20 simulated values of a spinner.$

Figure 5.3: Histogram of 20 simulated values of a spinner.

Although one thinks that any spin between 0 and 100 is equally likely to occur, there does not appear to be any obvious shape of this histogram. But the spinner was only spun 20 times. Let’s try spinning 1000 times– a histogram of the spins is shown in Figure 5.4.

$Histogram of 1000 simulated values of the spinner.$

Figure 5.4: Histogram of 1000 simulated values of the spinner.

Note that since there is a large sample of values, a small interval width was chosen for each bin in the histogram. Now a clearer shape in the histogram can be seen – although there is variation in the bar heights, the general shape of the histogram seems to be pretty flat or Uniform over the entire interval of possible values of $X$ between 0 and 100.

Suppose one was able to spin the spinner a large number of times. If one does this, then the shape of the histogram looks close to the Uniform density shown in Figure 5.5.

$Shape of the histogram for a large number of simulated values of the spinner.$

Figure 5.5: Shape of the histogram for a large number of simulated values of the spinner.

When the random variable $X$ is continuous, such as the case of the spinner result here, then one represents probabilities by means of a smooth curve that is called a density curve; more formally, a probability density curve. How does one find probabilities? When $X$ is continuous, then probabilities are represented by areas under the density curve.

As a simple example, what is the chance that the spinner result falls between 0 and 100? Since the scale of the spinner is from 0 to 100, one knows that all spins must fall in this interval, so the probability of $X$ landing in (0, 100) is 1. This probability is represented by the total area under the flat line between 0 and 100. Since the area of this rectangle is given by height times base, and the base is equal to 100, the height of this density curve must be 1/100 = 0.01. This is the value that should replace the “?” in Figure 5.5. In this case, one says that the spinner result has a Uniform distribution and the curve is a Uniform density.

By means of similar area computations, one finds other probabilities about the spinner location $X$.

What is the probability the spin falls between 20 and 60? That is, what is \[ P(20 < X < 60)? \] This probability is equal to the shaded area under the Uniform density between 20 and 60. See Figure 5.6 Using again the formula for the area of a rectangle, the base is $60 - 20 = 40$ and the height is 0.01, so \[ P(20 < X < 60) = 40 (0.01) = 0.4. \]

$Illustration of finding the probability of $P(20 < X < 60)$.$

Figure 5.6: Illustration of finding the probability of $P(20 < X < 60)$.

What is the probability the spin is greater than 80? That is, what is $P(X > 80)$? Figure 5.7 shows the area that needs to be computed to find this probability. Note that the area under the curve only between the values 80 and 100 is shaded, since $X$ cannot be larger than 100. Again by finding the area of the shaded rectangle, we see that $P(X > 80)$ = 20 (0.01) = 0.2.

$Illustration of finding the probability of $P(X > 80)$.$

Figure 5.7: Illustration of finding the probability of $P(X > 80)$.

Simulating from a Uniform Density

The R function runif() is helpful for simulating from a Uniform density. The arguments are the number of simulations and the minimum and maximum value of the support of the density. Below 50 values of a random spinner are simulated that fall uniformly on the interval from 0 to 50. The histogram in Figure 5.8 graphs these simulated spins with the Uniform density drawn on top.

spins <- runif(50, min = 0, max = 50)

$Histogram of 50 simulated Uniform values.$

Figure 5.8: Histogram of 50 simulated Uniform values.

5.3 Probability Density: Waiting for a Bus

Consider a random experiment where a continuous random variable $X$ is observed such as the location of the spinner in Section 5.2. Define the support of $X$ to be the set of possible values for $X$. For example, the support of $X$ for the spinner example is the interval (0, 100). To describe probabilities about $X$, a density function denoted by $f(x)$ is defined. Any function $f$ will not work – one requires that $f$ satisfy two properties:

Property 1. The probability density $f$ must be nonnegative which means that \[\begin{equation} f(x) \ge 0, {\rm for\, all\,} x. \tag{5.1} \end{equation}\]

Property 2. The total area under the probability density curve $f$ must be equal to 1. Mathematically, \[\begin{equation} \int_{-\infty}^\infty f(x) dx = 1. \tag{5.2} \end{equation}\]

To illustrate a probability density, suppose that a professor has a class that meets three times a week. To get to class, the professor walks to a bus stop and wait for a bus to go to school. From past experience, the professor knows that she can wait any time between 0 and 10 minutes for the bus, and she knows that each waiting time between 0 and 10 minutes is equally likely.

For a given week, what’s the chance that her longest wait will be under 7 minutes?

Let $W$ denote her longest waiting time for the week. One can show that the density for $W$ is given by \[ f(w) = \frac{3w^2}{1000}, 0 < w < 10. \] This density for this longest waiting time is shown in Figure 5.9.

$Density curve for the longest waiting time $W$.$

Figure 5.9: Density curve for the longest waiting time $W$.

Before we go any further, one should check if this is indeed a legitimate probability density:

Note from the graph that the density does not take on negative values, so the first property is satisfied.
Second, for it to be a probability density, the entire area under the curve must be equal to 1. One can check this by finding the integral of the density between 0 and 10 (the region where the density if positive): \[ \int_0^{10} \frac{3w^2}{1000} dw = \frac{w^3}{1000}\Big|^{10}_0 = \frac{10^3}{1000} - \frac{0^3}{1000} = 1. \]

The entire area under the curve is indeed equal to 1, so $f$ is a legitimate probability density. Now that $f$ is known to be a probability density, one can use it to find probabilities. To find the probability that this longest waiting time is less than 7 minutes, $P(W < 7)$, one wishes to compute the area under the density curve between 0 and 7, as shown in Figure 5.10.

$Density curve for the longest waiting time $W$, and $P(W < 7)$.$

Figure 5.10: Density curve for the longest waiting time $W$, and $P(W < 7)$.

This is equivalent to the integral \[ \int_0^{7} \frac{3w^2}{1000} dw \] and, by evaluating this, one obtains the probability \[ \int_0^{7} \frac{3w^2}{1000} dw = \frac{w^3}{1000}\Big|^{7}_0 = \frac{7^3}{1000} - \frac{0^3}{1000} = 0.343. \]

Suppose one is interested in the probability that the longest waiting time is between 6 and 8 minutes. This is represented by the shaded area in Figure 5.11.

$Density curve for the longest waiting time $W$, and $P(6 < W < 8)$.$

Figure 5.11: Density curve for the longest waiting time $W$, and $P(6 < W < 8)$.

To compute this area, one finds the integral of the density between 6 and 8: \[ \int_6^{8} \frac{3w^2}{1000} dw = \frac{w^3}{1000}\Big|^{8}_6 = \frac{8^3}{1000} - \frac{6^3}{1000} = 0.296. \]

Simulating Waiting Times

Recall that the waiting time variable $W$ was defined as the longest waiting time for the week where each of the separate waiting times has a Uniform distribution from 0 to 10 minutes. By simulating the process, one simulate values of $W$. By use of three applications of runif() one simulates 1000 waiting times for Monday, Wednesday, and Friday. The pmax() function is used to simulate the longest waiting time for each group of waiting times.

wait_monday <- runif(1000, min = 0, max = 10)
wait_wednesday <- runif(1000, min = 0, max = 10)
wait_friday <- runif(1000, min = 0, max = 10)
longest_wait <- pmax(wait_monday, 
                     wait_wednesday,
                     wait_friday)

Figure 5.12 shows 1000 simulated values of $W$ and the density function $3w^2 / 1000$ is drawn on top. It appears that the histogram is a good match to the actual density function.

$Histogram of 1000 simulated values of $W$ with the density function drawn on top.$

Figure 5.12: Histogram of 1000 simulated values of $W$ with the density function drawn on top.

5.4 The Cumulative Distribution Function

To find any probability about the maximum waiting time, one computes an area under the curve that is equivalent to integrating the density curve over a region. But there is a basic function that can be computed at the beginning that will simplify these probability computations.

Choose an arbitrary point $x$ – the cumulative distribution function at $x$, or cdf for short, is the probability that $W$ is less than or equal to $x$: \[\begin{equation} F(x) = P(W \le x) = \int_{-\infty}^x f(w) dw. \tag{5.3} \end{equation}\] Here suppose one chooses a value of $x$ in the interval (0, 10). Then $F(x)$ would be the area under the density curve between 0 and $x$ shown in Figure 5.13.

$Illustration of the cumulative density function.$

Figure 5.13: Illustration of the cumulative density function.

Writing this area as an integral, one computes $F(x)$ as \[ F(x) = P(W \le x) = \int_0^x \frac{3w^2}{1000} dw = \frac{w^3}{1000}\Big|^{x}_0 = \frac{x^3}{1000}. \] This formula is valid for any value of x in the interval (0, 10).

In fact, $F(x)$ is defined for all values of $x$ on the real line.

If $x$ is a value smaller or equal to 0, then we see from the figure that the probability that $W$ is smaller than $x$ is equal to 0. So $F(x) = 0$ for $x \le 0$.
On the other hand, if $x$ is greater or equal to 10, then the probability that $W$ is smaller than $x$ is 1. So $F(x) = 1$ for $x \ge 10$.

Putting all together, one sees that the cdf $F$ is given by \[ F(x) = \begin{cases} 0, & x \le 0 \\ x^3 / 1000, & 0 < x < 10 \\ 1, & x \ge 10, \end{cases} \] illustrated in Figure 5.14}.

$The cumulative density function, $F(x)$, of the bus waiting example.$

Figure 5.14: The cumulative density function, $F(x)$, of the bus waiting example.

5.4.1 Finding probabilities using the CDF

Once we have computed the cdf function $F$, probabilities are found simply by evaluating $F$ at different points. Fortunately, no additional integration is needed.

For example, to find the probability that the maximum waiting time $W$ is less than equal to 6 minutes, one just computes $F(6) = P(W \le 6) = 6^3 / 1000 = 0.216$ which is shown in Figure 5.15.

$The cumulative density function $F(x)$ and evaluation of $F(6) = P(W <= 6)$.$

Figure 5.15: The cumulative density function $F(x)$ and evaluation of $F(6) = P(W <= 6)$.

To compute the probability that the maximum waiting time exceeds 8 minutes, first note that “exceeding 8 minutes” is the complement event to “less than or equal to 8 minutes”, and so \[ P(W > 8) = 1 - P(W \le 8) = 1 - F(8) = 1 - \frac{8^3}{1000} = 0.488. \] Likewise, if one is interested in the chance that the waiting time $W$ falls between 2 and 4, represent the probability as the difference of two “less-than” probabilities, and then subtract the two values of F. \[ P(2 < W < 4) = P(W \le 4) - P(W \le 2) = F(4) - F(2) = \frac{4^3}{1000} - \frac{2^3}{1000} = 0.056. \]

Computing Probabilities by Simulation

For the waiting for a bus example, the variable longest_wait contains 1000 simulated values of our longest waiting time. This sample is used to compute approximate probabilities. To illustrate, to find the probability that the longest wait exceeds 8 minutes, one finds the proportion of simulated values of $W$ that exceeds 8.

mean(longest_wait > 8)

## [1] 0.492

In a similar fashion one approximates the probability a longest waiting time falls between 6 and 10 minutes.

mean(longest_wait > 6 & longest_wait < 10)

## [1] 0.797

5.5 Summarizing a Continuous Random Variable

Mean and standard deviation

One is interested in summarizing a continuous random variable. Natural summaries are given by the mean $\mu$ and the standard deviation $\sigma$, where these quantities are defined in a similar manner as for a discrete random variable, with the exception that summations are replaced by integrals.

The mean $\mu$, or equivalently the expected value of $X$, is given by \[\begin{equation} \mu = E(X) = \int_{-\infty}^\infty x f(x) dx. \tag{5.4} \end{equation}\] Just as in the discrete random variable case, there is an attractive interpretation of $\mu$. If one is able to observe a large number of values of $X$, then $\mu$ will be approximately equal to the sample mean $\bar X$ of these random values of $X$.

To define the spread of the values of $X$, one first computes the average squared deviation about the mean, the variance, \[\begin{equation} \sigma^2 = Var(X) = E(X - \mu)^2 = \int_{-\infty}^\infty (x - \mu)^2 f(x) dx. \tag{5.5} \end{equation}\] The standard deviation of $X$, $\sigma$, is defined to be the square root of the variance.

Let’s illustrate the computation of the mean and standard deviation for the bus waiting time problem. Using the definition of $f$, one gets that the mean is equal to \[ \mu = \int_0^{10} x \left(\frac{3x^2}{1000}\right) dx. \] Performing the integration, one gets \[ \mu = \int_0^{10} x \left(\frac{3x^2}{1000}\right) dx = \frac{3 x^4}{4000} \Big|^{10}_0 = \frac{3 (10)^4}{1000} = 7.5. \] On, the average, one expects the longest wait in a week to be 7.5 minutes.

The computation of the variance is a bit more tedious, but straightforward.
\[ \sigma^2 = \int_0^{10} (x -\mu)^2 \left(\frac{3x^2}{1000}\right) dx = 3.75. \] So the standard deviation of $X$ is $\sigma = \sqrt{3.75} = 1.94$.

Computing the Mean and Standard Deviation by Simulation

Earlier, we demonstrated simulating 1000 values of the longest waiting time $W$. To check the computations of the mean $\mu$ and standard deviation $\sigma$, one computes the sample mean and standard deviation of the simulated values.

mean(longest_wait)

## [1] 7.576458

sd(longest_wait)

## [1] 1.883679

One sees that these empirical values are close approximations to the exact values $\mu = 7.5$ and $\sigma = 1.94$.

Percentiles

Another useful summary of a continuous random variable is a percentile. The 70th percentile, for example, is the value of $X$, call it $x$, such that 70% of the probability is to the left, shown in Figure 5.16. That is, the 70th percentile, call it $x_{70}$, satisfies the equation \[ P(X \le x_{70}) = 0.70. \]

$Illustration of the 70th percentile.$

Figure 5.16: Illustration of the 70th percentile.

Since one recognizes the left hand side of the equation as equivalent to the cdf $F$ (which already has been computed as $x^3/1000$), the equation is written as \[ F(x_{70}) = 0.70, \] that is, \[ \frac{x_{70}^3}{1000} = 0.70. \] To find the 70th percentile, the above equation is solved for $x_{70}$ – after some algebra, we get \[ x_{70} = \sqrt[3]{700} = 8.88. \] This means that if one waits many weeks for this bus, approximately 70% of the longest waiting times will be shorter than 8.88 minutes.

Computing Percentiles by Simulation

For the waiting for a bus example, the variable longest\wait contains 1000 simulated values of our longest waiting time. This sample is used to compute approximate percentiles by computing sample percentiles of the simulated values. For example, by use of the quantile() function, one finds that the 10th and 90th percentiles of $W$ are approximately 4.80 and 9.66 minutes.

quantile(longest_wait, c(0.1, 0.9))

##      10%      90% 
## 4.856374 9.701844

The probability a longest waiting time is between 4.79 and 9.66 minutes is approximately 0.80.

5.6 Normal Distribution

Normal probability curve

One of the most popular races in the United States is marathon, a grueling 26-mile run. Most people are familiar with the Boston Marathon that is held in Boston, Massachusetts every April. But other cities in the U.S. hold yearly marathons. Here we look at data collected from Grandma’s Marathon that is held in Duluth, Minnesota every June.

In the year 2003, there were 2515 women who completed Grandma’s Marathon. The completion times in minutes for all of these women can be downloaded from the marathon’s website. A histogram of these times, measured in minutes, is shown in Figure 5.17.

$Histogram of women's completion times in the Grandma's Marathon.$

Figure 5.17: Histogram of women’s completion times in the Grandma’s Marathon.

Note that these measured times have a bell shape. Figure 5.18 superimposes a Normal curve on top of this histogram. Note that this curve is a pretty good match to the histogram. In fact, data like this marathon time data that are measurements are often well approximated by a Normal curve.

$Histogram of women's completion times in the Grandma's Marathon, with a Normal curve on top.$

Figure 5.18: Histogram of women’s completion times in the Grandma’s Marathon, with a Normal curve on top.

A Normal density curve has the general form \[\begin{equation} f(x) = \frac{1}{\sqrt{2 \pi} \sigma} \exp\left\{-\frac{(x - \mu)^2}{2 \sigma^2}\right\}, -\infty < x < \infty. \tag{5.6} \end{equation}\] This density curve is described by two parameters – the mean $\mu$ and the standard deviation $\sigma$. The mean $\mu$ is the center of the curve. Looking at the Normal curve above, one sees that the curve is centered about 270 minutes – actually the mean of the Normal curve is $\mu$ = 274. The number $\sigma$, the standard deviation, describes the spread of the curve. Here the Normal curve standard deviation is $\sigma$ = 43. If one knows the mean and standard deviation of the Normal curve, one can make reasonable predictions where the majority of times of the women runners will fall.

The famous Normal curve was independently discovered by several scientists. Abraham De Moivre in the 18th century showed that a Binomial probability for a large number of trials $n$ could be approximated by a Normal curve. Pierre Simon Laplace and Carl Friedrich Gauss also made important discoveries about this curve. By the 19th century, it was believed by some scientists such as Adolphe Quetelet that the Normal curve would represent the distribution of any group of homogeneous measurements. To illustrate his thinking, Quetelet considered the frequency measurements for the chest circumference measurements (in inches) for 5738 Scottish soldiers taken from the Edinburgh Medical and Surgical Journal (1817). A histogram of the chest measurements is shown in Figure 5.19. Quetelet’s beliefs were a bit incorrect – any group of measurements will not necessarily be Normal-shaped. However, it is generally true that a distribution of physical measurements from a homogeneous group, say heights of American women or foot lengths of Chinese men will generally have this bell shape.

$Histogram of chest circumference measurements of Scottish soldiers.$

Figure 5.19: Histogram of chest circumference measurements of Scottish soldiers.

In the previous sections of this chapter, the notion of a continuous random variable was introduced. Here the Normal curve is introduced that is a popular model for representing the distribution of a measurement random variable. Also it will be seen that the Normal curve is helpful for computing Binomial probabilities and for representing the distributions of means taken from a random sample.

Suppose that the Normal density with $\mu$ = 274 minutes and $\sigma$= 43 minutes represents the distribution of women racing times. Say one is interested in the probability that a runner completes the race less than 4 hours or 240 minutes. One computes this probability by finding an area under the Normal curve. Specifically, as indicated in Figure 5.20, this probability is the area under the curve for all times less than 240 minutes.

$Normal density with mean 274 and standard deviation 43, with illustration of the area under the curve less than 240 (minutes).$

Figure 5.20: Normal density with mean 274 and standard deviation 43, with illustration of the area under the curve less than 240 (minutes).

Normal Probability Calculations

One expresses this area as the integral \[ P(X \le 240) = \int_{-\infty}^{240} \frac{1}{\sqrt{2 \pi} \sigma} \exp\left\{-\frac{(x - \mu)^2}{2 \sigma^2}\right\} dx \] but unfortunately one cannot integrate this function analytically (like was done for a Uniform density) to find the probability. Instead one finds this area by use of the R pnorm() function in R. This function is used for three examples, illustrating the computation of three types of areas.

Returning to our example, recall the distribution of marathon times is approximately Normally distributed with mean $\mu$ = 274 and standard deviation $\sigma$ = 43.

Finding a “less than” area.

Suppose one is interested in the probability that a woman marathon runner completes the race in under 240 minutes.
That is, one wishes to find $P(X < 240)$ which is the area under the Normal curve to the left of 240. The function value gives the value of the cdf of a Normal random variable with mean $\mu = a$ and $\sigma = s$ evaluated at the value $x$. For our example, the mean and standard deviation are given by 274 and 43, respectively, so the desired probability is given by

pnorm(240, 274, 43)

## [1] 0.2145602

Finding a “between two values” area.

Suppose one is interested in computing the probability that a marathon runner completes a race between two values, such as $P(230 < X < 280)$, shown in Figure 5.21.

$Normal density with mean 274 and standard deviation 43, with illustration of the area under the curve between 230 and 280 (minutes).$

Figure 5.21: Normal density with mean 274 and standard deviation 43, with illustration of the area under the curve between 230 and 280 (minutes).

One writes this probability as the difference of two “less than” probabilities: \[\begin{align*} P(230 < X < 280) &= P(X < 280) - P(X < 230) \\ &= F(280) - F(230), \end{align*}\] where $F(x)$ is the cdf of a Normal(274, 43) random variable evaluated at $x$. Therefore, by use of the function, this probability is equal to

pnorm(280, 274, 43) - pnorm(230, 274, 43)

## [1] 0.4023928

Finding a “greater than” area.

Last, sometimes one will be interested in the probability that $X$ is greater than some value, such as $P(X > 300)$, the probability a runner takes more than 300 minutes to complete the race, shown in Figure 5.22.

$Normal density with mean 274 and standard deviation 43, with illustration of the area under the curve greater than 300 (minutes).$

Figure 5.22: Normal density with mean 274 and standard deviation 43, with illustration of the area under the curve greater than 300 (minutes).

This probability is found by the complement property of probability, that \[\begin{align*} P(X > 300) &= 1 - P(X \le 300) \\ &= 1 - F(300). \end{align*}\] Therefore, one uses the function to compute the probability that $X$ is smaller than 300, and then subtract the answer from 1.

1 - pnorm(300, 274, 43)

## [1] 0.2727054

Computing Normal percentiles

In the marathon completion times example, we were interested in computing a probability that was equivalent to finding an area under the Normal curve. A different problem is to compute a percentile of the distribution. In the marathon example, suppose that t-shirts will be given away to the runners who get the 25% fastest times. How fast does a runner need to run the race to get a t-shirt?

Here one wishes to compute the 25th percentile of the distribution of times. This is a time, call it $x_{25}$, such that 25 percent of all times are smaller than $x_{25}$. This is shown graphically in Figure 5.23.

Equivalently, we wish to find the value $x_{25}$ such that \[ P(X \le x_{25}) = F(x_{25}) = 0.25. \]

$Normal density with mean 274 and standard deviation 43, with illustration of the 25th percentile.$

Figure 5.23: Normal density with mean 274 and standard deviation 43, with illustration of the 25th percentile.

Calculating Normal Percentiles

Percentiles of a Normal curve are conveniently computed in R by use of the qnorm() function. Specifically, qnorm(p, m, s) gives the percentile of a Normal($m, s$) curve corresponding to a “left area” of p. In our example, the value of p is 0.25, and so the 25th percentile of the running times (with mean 274 minutes and standard deviation 43 minutes) is computed to be

qnorm(0.25, 274, 43)

## [1] 244.9969

This means one needs to run faster (lower) than 245.0 minutes to get a t-shirt in this competition.

Suppose one needs to complete the race faster than 10% of the runners to be invited to run in the race the following year. How fast does one need to run? If one wishes to have a 10% of the times to be larger than one’s time, this means that 90% of the times will be smaller than one’s time. That is, one wishes to find the 90th percentile, $x_{90}$ of the Normal distribution, shown in Figure 5.24.

$Normal density with mean 274 and standard deviation 43, with illustration of the 90th percentile.$

Figure 5.24: Normal density with mean 274 and standard deviation 43, with illustration of the 90th percentile.

qnorm(0.90, 274, 43)

## [1] 329.1067

So 329 minutes is the time to beat if one wishes to be invited to participate in next year’s race.

5.7 Binomial Probabilities and the Normal Curve

The Normal curve is useful for modeling batches of data, especially when one is collecting measurements of some process. But the Normal curve actually has a more important justification. We will explore several important results about the pattern of Binomial probabilities and sample means and we will find these results useful in our introduction to statistical inference.

First, consider different shapes of Binomial distributions. Suppose that half of one’s student body is female and one plans on taking a sample survey of $n$ students to learn if they are interested in using a new recreational sports complex that is proposed. Let $X$ denote the number of females in the sample. Assuming a random sample is chosen, it is known that $X$ will be distributed Binomial with parameters $n$ and $p=1/2$. What is the shape of the Binomial probabilities? Figure 5.25 displays the Binomial probabilities for sample sizes $n$ = 10, 20, 50, and 100.

$Binomial probabilities for sample sizes $n$ = 10, 20, 50, and 100, and success probability $p = 0.5$.$

Figure 5.25: Binomial probabilities for sample sizes $n$ = 10, 20, 50, and 100, and success probability $p = 0.5$.

$Binomial probabilities for sample sizes $n$ = 10, 20, 50, and 100, and success probability $p = 0.1$.$

Figure 5.26: Binomial probabilities for sample sizes $n$ = 10, 20, 50, and 100, and success probability $p = 0.1$.

What does one notice about these probability graphs? First, note that each distribution is symmetric about the mean $\mu = n p$. But, more interesting, the shape of the distribution seems to resemble a Normal curve as the number of trials $n$ increases.

Perhaps this pattern happens since one started with a Binomial distribution with $p$ = 0.5 and one would not see this behavior if a different value of $p$ was used. Suppose that only 10% of all students would use the new facility and let $X$ denote the number of students in your sample who say they would use the facility. The random variable $X$ would be distributed Binomial with parameters $n$ and $p$ = 0.1. Figure 5.26 shows the probability distributions again for the sample sizes $n$ = 10, 20, 50, and 100. As one might expect the shape of the probabilities for $n$=10 are not very Normal-shaped – the distribution is skewed right. But, note that as $n$ increases, the probabilities become more Normal-shaped and the Normal curve seems to be a good match for $n=100$.

Figures 5.25 and 5.26 illustrate a basic result: if one has a Binomial random variable $X$ with $n$ trials and probability of success $p$, then, as the number of trials $n$ approaches infinity, the distribution of the standardized score \[\begin{equation} Z = \frac{X - n p}{\sqrt{n p (1 - p)}} \tag{5.7} \end{equation}\] approaches a standard Normal random variable, that is a Normal distribution with mean 0 and standard deviation 1. This is a very useful result. It means, that for a large number of trials, one can approximate a Binomial random variable $X$ by a Normal random variable with mean and standard deviation \[\begin{equation} \mu = n p, \, \, \, \sigma = \sqrt{n p (1 - p)}. \tag{5.8} \end{equation}\]

This approximation result can be illustrated with our student survey example. Suppose that 10% of the student body would use the new recreational sports complex. One takes a random sample of 100 students — what’s the probability that 5 or fewer students in the sample would use the new facility?

The random variable $X$ in this problem is the number of students in the sample that would use the facility. This random variable has a Binomial distribution with $n = 100$ and $p = 0.1$ that is pictured as a histogram in Figure 5.27. By the approximation result, this distribution is approximated by a Normal curve with $\mu = 100 (0.1) = 10$ and $\sigma = \sqrt{100 (0.1) (0.9)} = 3$ This Normal curve is placed on top of the probability histogram in Figure – note that it is a pretty good fit to the histogram.

$Histogram of Binomial probabilities, with the approximated Normal curve on top.$

Figure 5.27: Histogram of Binomial probabilities, with the approximated Normal curve on top.

Binomial Computations Using a Normal Curve

One is interested in the probability that at most 5 students use the facility, that is, $P(X \le 5)$. This probability is approximated by the area under a Normal(10, 3) curve between $X=0$ and $X=5$. Using the R pnorm() function, we compute this Normal curve area to be

pnorm(5, 10, 3) - pnorm(0, 10, 3)

## [1] 0.04736129

In this case, one can also find this probability exactly by a calculator or computer program that computes Binomial probabilities. Using the pbinom() function, we find the probability that $X$ is at most 5 is

pbinom(5, size = 100, prob = 0.10)

## [1] 0.05757689

Here one sees that the Normal approximation gives a similar answer to the exact Binomial computation.

5.8 Sampling Distribution of the Mean

We have seen that Binomial probabilities are well-approximated by a Normal curve when the number of trials is large. There is a more general result about the shape of sample means that are taken from any population.

To begin our discussion about the sampling behavior of means, suppose one has a jar filled with a variety of candies of different weights. One is interested in learning about the mean weight of a candy in the jar. One could obtain the mean weight by measuring the weight for every single candy in the jar, and then finding the mean of these measurements. But that could be a lot of work. Instead of weighing all of the candies, suppose one selects a random sample of 10 candies from the jar and finds the mean of the weights of these 10 candies. What has one learned about the mean weight of all candies from this sample information?

To answer this type of question, one

assumes that one know about the weights of all candies in the jar;
looks at the pattern of means that one obtains when one takes random samples from the jar.

The group of items (here, candies) of interest is called the population. Assume first that one knows the population – that is, we know exactly the weights of all candies in the jar. There are five types of candies – Table 5.2 gives the weight of each type of candy (in grams) and the proportion of candies of that type.

Table 5.2 Population of candies in a jar.

	Weight	Proportion
fruity square	2	0.15
milk maid	5	0.35
jelly nougat	8	0.20
caramel	14	0.15
candy bars	18	0.15

Let $X$ denote the weight of a randomly selected candy from the jar. Note that $X$ is a discrete random variable with the probability distribution given in Table . This distribution is summarized by computing a mean $\mu$ and a standard deviation $\sigma$. The reader will be verify in the end-of-chapter exercises that $\mu$ = 8.4500 and $\sigma$ = 5.3617. So if one was really able to weigh each candy in the jar, one would find the mean weight to be $\mu$ = 8.4500 grams.

Suppose a random sample of 10 candies is selected with replacement from the jar and the mean is computed. Note that this is called the sample mean $\bar X$ to distinguish it from the population mean $\mu$.

Sampling Candies

This sampling can be simulated using the following R code. The distribution of candies is stored in the vectors weights and proportion. By use of the sample() function, one obtains the following candy weights:

weights <- c(2, 5, 8, 14, 18)
proportion <- c(.15, .35, .2, .15, .15)
sample(weights, size = 10, prob = proportion, replace = TRUE)

##  [1] 14  8 14  5  2 14  8  5  2  2

One computes the sample mean
\[ \bar X = (5+8+5+14+5+18+8+18+5+8)/10 = 9.4 \, {\rm gm}. \]

Suppose this process is repeated two more times – in the second sample, one obtains $\bar X$= 6.9 gm and in the third sample, one obtains $\bar X$= 8.8 gm. The three sample mean values are plotted in Figure 5.28.

$Graph of 3 sample means from 10 randomly selected candies.$

Figure 5.28: Graph of 3 sample means from 10 randomly selected candies.

Suppose that one continues to take random samples of 10 candies from the jar and plot the values of the sample means on a graph – one obtains the sampling distribution of the mean $\bar X$, shown in Figure 5.29.

$Histogram of the sampling distribution of the mean $ar X$.$

Figure 5.29: Histogram of the sampling distribution of the mean $ar X$.

Note that there is an interesting pattern of these sample means – they appear to have a Normal shape. This motivates an amazing result, called the Central Limit Theorem, about the pattern of sample means. If one takes sample means from any population with mean $\mu$ and standard deviation $\sigma$, then the sampling distribution of the means (for large enough sample size) will be approximately Normally distributed with mean and standard deviation \[\begin{equation} E(\bar X) = \mu, \, \, \, SD(\bar X) = \frac{\sigma}{\sqrt{n}}. \tag{5.9} \end{equation}\]

Let’s illustrate this result for our candy example. Recall that the population of candy weights had a mean and standard deviation given by $\mu$ = 8.45 and $\sigma$ = 5.36, respectively. If one takes samples of size $n = 10$, then, by this result, the sample mean $\bar X$ will be approximately Normally distributed where \[ E(\bar X) = 8.45, \, \, \, SD(\bar X) = \frac{5.36}{\sqrt{10}} = 1.69. \]

This Normal curve is drawn on top of the histogram of sample means, shown in Figure 5.30.

$Histogram of the sampling distribution of the mean $ar X$, with approximated Normal curve on top.$

Figure 5.30: Histogram of the sampling distribution of the mean $ar X$, with approximated Normal curve on top.

There are two important points to mention about this result.

First the expected value of the sample means, $E(\bar X)$ , is equal to the population mean $\mu$. When one takes a random sample, it is possible that the sample mean $\bar X$ is far away from the population mean $\mu$. But, if one takes many random samples, then, on the average, the sample mean will be close to the population mean.
Second, note that the spread of the sample means, as measured by the standard deviation, is equal to $\sigma / \sqrt{n}$. Since the spread of the population is $\sigma$, note that the spread of the sample means will be smaller than the spread of the population. Moreover, if one takes random samples of a larger size, then the spread of the sample means will decrease.

The second point can be illustrated in the context of our candy example. Above, one selected random samples of size $n$ = 10 and computed the sample means. Suppose instead one selected repeated samples of size $n = 25$ from the candy jar – how does the sampling distribution of means change?

Using R, one can simulate the process of taking samples of size 25 – histograms of the sample means are shown in Figure 5.31. By the Central Limit Theorem, the sample means will be approximately Normal-shaped with mean and standard deviation \[ E(\bar X) = 8.45, \, \, \, SD(\bar X) = \frac{5.36}{\sqrt{25}} = 1.07. \]

$Histogram of the sampling distribution of the mean $ar X$, with sample sizes $n = 10$ and $n = 25$.$

Figure 5.31: Histogram of the sampling distribution of the mean $ar X$, with sample sizes $n = 10$ and $n = 25$.

Comparing the $n$ = 10 sample means with the $n$ = 25 sample means Figure 5.31, what’s the difference? Both sets of sample means are Normally distributed with an average equal to the population mean. But the $n$ = 25 sample means have a smaller spread – this means that as you take bigger samples, the sample mean $\bar X$ is more likely to be close to the population mean $\mu$. The simulation is left as an end-of-chapter exercise.

The Central Limit Theorem works for any population

We illustrate the Central Limit Theorem for a second example where the population has a distinctive non-Normal shape. At one university, many of the students’ hometowns are within 40 miles of the school. There also are a large number of students whose homes are between 80-120 miles of the university. Given the population of “distances of home” of all students, it is interesting to see what happens when we take random samples from this population.

If we let $X$ denote “distance from home”, imagine that the population of distances is described by the continuous density curve in Figure 5.32. Two humps can be seen in this density – these correspond to the large number of students whose homes are in the ranges 0 to 40 miles and 70 to 130 miles. Suppose the mean and standard deviation of this population are given by $\mu = 60$ miles and $\sigma = 41.6$ miles, respectively.

$Density curve of the population of distances.$

Figure 5.32: Density curve of the population of distances.

Now imagine that one takes a random sample of $n$ students from this population and computes the sample mean from this sample. For example, suppose one takes a random sample of 20 students and collect the distances from home from these students – once one has collected the 20 distances, one computes the sample mean $\bar X$. Here are two samples and the values of $\bar X$ :

Sample 1: 102 22 23 24 114 102 114 102 22 19 88 31 30 100 111 105 105 17 100 21
xbar = 67.6 mi.

Sample 2: 12 127 33 34 73 19 111 99 16 20 22 16 24 62 22 76 91 115 117 93
xbar = 59.1 mi.

If this sampling process is repeated many times, what will the distribution of sample means look like? Also, what is the effect of the sample size $n$? To answer this question, one can let the computer simulate repeated samples of sizes $n = 1, n = 2, n = 5$, and $n = 20$. The histograms in Figure 5.33 show the distributions of sample means for the four sample sizes.

$Histograms of random samples of distances, with sample sizes of $n = 1, n = 2, n = 5$, and $n = 20$.$

Figure 5.33: Histograms of random samples of distances, with sample sizes of $n = 1, n = 2, n = 5$, and $n = 20$.

As one might expect, if samples of size 1 are selected, our sample means look just like the original population. If samples of size 2 ares selected, then the sample means have a funny three-hump distribution. But, note as one takes samples of larger sizes, the sampling distribution of means looks more like a Normal curve. This is what one expects from the Central Limit Theorem result – no matter what the population shape, the distribution of the sample means will be approximately Normal if the sample size is large enough.

What is the distribution of the sample means when we take samples of size $n$ = 20? One just applies the Central Limit Theorem result. The sample means will be approximately Normal with mean and standard deviation \[\begin{equation} E(\bar X) = \mu, \, \, \, SD(\bar X) = \frac{\sigma}{\sqrt{n}}. \tag{5.10} \end{equation}\] Since one knows the mean and standard deviation of the population and the sample size, one just substitute these quantities and obtains \[ E(\bar X) = 60, \, \, \, SD(\bar X) = \frac{41.6}{\sqrt{20}} = 9.3. \] These results can be used to answer some questions.

What is the probability that a student’s distance from home is between 40 and 60 miles?

Actually this is a difficult question to answer exactly, since one does not know the exact shape of the population. But, looking at the graph of the population, one sees that the curve takes on very small values between 40 and 60 miles. So this probability is close to zero – very few students live between 40 and 60 miles from our school.

What is the probability that, if one takes a sample of 20 students, the mean distance from home for these twenty students is between 40 and 60 miles?

This is a different question than the first one. This question is asking about the chance that the sample mean falls between 40 and 60 miles. Since the sampling distribution of $\bar X$ is approximately Normal with mean 60 and standard deviation 9.3, one can compute this by using R. Using the function, one obtains

pnorm(60, 60, 9.3) - pnorm(40, 60, 9.3)

## [1] 0.4842436

It is interesting to note that although it is unlikely for students to live between 40 and 60 miles from the school, it is pretty likely for the sample mean for a group of 20 students to fall between 40 and 60 miles.

What is the probability that the mean distance exceeds 100 miles?

Here one wants to find the probability that $\bar X$ is greater than 100, that is $P(\bar X > 100)$. Using R, one computes

1 - pnorm(100, 60, 9.3)

## [1] 8.498565e-06

This probability is essentially zero, which means that it is highly unlikely that a sample mean of 20 student distances will exceed 100 miles.

5.9 Exercises

Waiting at a ATM Machine

You are waiting at your local ATM machine and as usual, you are waiting in a line. Suppose you know that your waiting time can be between 0 to 5 minutes and any value between 0 and 5 minutes is equally likely.

The graph below shows the density function for $X$, the waiting time. What is the height of this function?

Find the probability you wait more than 2 minutes.
Find the probability you wait between 2 and 3 minutes.

Morning Wake-Up

Suppose you wake up at a random time in the morning between 6 am and 12 pm.

Find the probability you wake up before 11 am.
Find the probability you wake up between 8 and 10 am.
What is an “average” or typical time you will wake up? Explain how you computed this number.
Find the standard deviation of the time.

The Median Waiting Time

In the “waiting for a bus” example, suppose that you record the median time $T$ (in minutes) that you wait for the bus on the three days. The density function for this median time is given by \[ f(t) = \frac{6 t (10 - t)}{1000}, \, \, 0 < t < 10. \]

Draw a graph of this density function.
Find the probability that the median time is between 5 and 7 minutes.
Find the cdf $F(t)$ for all values of $t$.
Using the cdf you found in part c, find the probability the median time is over 6 minutes.
Find the 75% percentile of your median waiting time.

The Sum of Two Spins

Suppose you spin two spinners, where the location of the arrow for each spinner is equally likely to fall between 0 and 10.

If you let $S$ be the sum of the two spins, it can be shown that the density function of $S$ is given by \[ f(s) = \begin{cases} s/100 , & 0 < s \le 10 \\ (20-s) / 100, & 10 < s \le 20, \\ \end{cases} \] and shown by the figure below.

Check that this function satisfies the two properties of a probability density function.
Find the probability the sum of the two spins is smaller than 5.
Find the cdf function $F$.
Using the cdf function, find the probability the sum of spins falls between 8 and 12.
Using the cdf function, find the probability the sum of spins exceeds 12.

Salaries for Professional Basketball Players

Let $X$ denote the salary (in millions of dollars) of a professional basketball player. A reasonable density function for $X$ is given by \[ f(x) = \frac{0.15}{x^{1.3}}, \, \, x \ge 0.1 \] shown by the figure below.

What proportion of basketball players earn more than 1 million dollars?
What proportion of players earn between 1 and 2 million dollars?
Find the cdf function.
Using the cdf function, find the probability a player earns less than one-half a million dollars.
Find the “average” salary of a NBA player.

Grading on a Curve

Suppose the grades on a math test are distributed according to the curve. \[ f(x) = \frac{x}{5000}, 0 < x < 100. \]

Draw a graph of this density curve.
Find the mean grade on this test.
What proportion of students who take this test get a grade of 90 or higher?
What proportion of students get a $C$ grade, where $C$ is defined to be between 70 and 80?
Is this test harder or easier than the test grades in your statistics class? Explain.

Time to Clean Your Room

Suppose the time that it takes you to clean your room (in hours) is a random variable $X$ with the cdf function given below. A graph of the cdf is also shown. \[ F(x) = \begin{cases} 0 , & x < 0 \\ 0.75 (2 x ^ 3 / 3 - x ^ 4 / 4) & 0 \le x \le 2 \\ 1, & x > 2 \end{cases} \]

Find the probability you can clean your room in under one hour.
Find the probability it takes you over one and a half hour to clean your room?
Using the graph, find a value $M$ such that it is equally likely that $X$ is smaller than $M$ and $X$ is larger than $M$. [Hint: $M$ is the 50th percentile of $X$.]

Time to Complete a Race

Suppose a group of children are running a race. The times (in minutes) that the children complete the race can be described by the density function \[ f(x) = \frac{4 + (x - 3)^2}{21}, 3 < x < 6. \]

Graph this density function.
Looking at your graph, is it more common to have a slow time (near 6 minutes) or a fast time (near 3 minutes)?
Find the probability a child completes the race in under 4 minutes.
Find the probability that a child’s time exceeds 5 1/2 minutes.
Find the median running time.

Spinning a Random Spinner

Suppose you flip a coin. If the coin lands heads, you spin a spinner that is equally likely to fall at any point in the interval (0, 4). If the coin lands tails, you spin a different spinner that lands at any point in the interval (2, 6). If $X$ denotes your spin, the density function for $X$ is graphed below.

Check that this graphed function is indeed a probability density.
Find the probability that $X$ is greater than 5.
Find the probability that $X$ falls between 1 and 3.

Lifetimes of Light Bulbs

Suppose that a company is interested in the amount of time that a particular type of light bulb will last until it burns out. After sampling the lifetimes for a large group of light bulbs, it is decided that the lifetime $X$ (in hours) is well-described by the exponential distribution of the form \[ f(x) = \frac{1}{100} e^{-x / 100}, x > 0. \] The cdf for $X$ is drawn below.

In addition, the cdf is computed for some values of $X$ in the following table.

$x$	$F(x)$	$x$	$F(x)$
0	0	180	0.8347
30	0.2592	210	0.8775
60	0.4512	240	0.9093
90	0.5934	270	0.9328
120	0.6988	300	0.9502
150	0.7769

Find the probability that a lifetime of a bulb will be less than 90 hours.
Find the probability the lifetime is between 120 and 180 hours.
From the table, approximate the median lifetime.
Approximate the 95th percentile.

Locations of Dart Throws

Suppose you throw a dart at a circular target such that the dart is equally likely to land in any location on the target. The locations for a large number of dart throws are shown in the figure below.

Let $X$ denote the distance of a throw from the bulls eye. It can be shown that the density function of $X$ has the form
\[ f(x) = \frac{x}{2}, 0 < x < 2. \]

Find the probability your throw lands within a distance of 1 unit from the target.
Find the probability your throw lands between .5 and 1.5 units from the target.
If you threw the dart many times at the target, find your average distance from the target.

Heights of Men

Suppose heights of American men are approximately Normally distributed with mean 70 inches and standard deviation 4 inches.

What proportion of men is between 68 and 74 inches?
What proportion of men is taller than 6 feet?
Find the 90th percentile of heights.

Test Scores

Test scores in a precalculus test are approximately Normally distributed with mean 75 and standard deviation 10. If you choose a student at random from this class

What is the probability he or she gets an $A$ (over 90)?
What is the probability he or she gets a $C$ (between 70 and 80)?
What is the letter grade of the lower quartile of the scores?

Body Temperatures

The normal body temperature was measured for 130 subjects in an article published in the Journal of the American Medical Association. These body temperatures are approximately Normally distributed with mean $\mu$ = 98.2 degrees and standard deviation $\sigma$= 0.73.

Most people believe that the mean body temperature of healthy individuals is 98.6 degrees, but actually the mean body temperature is smaller than 98.6. What proportion of healthy individuals have body temperatures smaller than 98.6?
Suppose a person has a body temperature of 96 degrees. What is the probability of having a temperature less than or equal to 96 degrees? Based on this computation, would you say that a temperature of 96 degrees is unusual? Why?
Suppose that a doctor diagnoses a person as sick if his or her body temperature is above the 95th percentile of the temperature of “healthy” individuals. Find this body temperature that will give a sick diagnosis.

Baseball Batting Averages

Batting averages of baseball players can be well approximated by a Normal curve. The figure below displays the batting averages of players during the 2003 baseball season with at least 300 at-bats (opportunities to hit). The mean and standard deviation of the matching Normal curve shown in the figure are $\mu$ = 0.274 and $\sigma$ = 0.027, respectively.

If you choose a baseball player at random, find the probability his batting average is over 0.300. (This is a useful benchmark for a “good” batting average.)
Find the probability this player has a batting average between 0.200 and 0.250.
A baseball player is said to hit below the Mendoza line (named for weak-hitting baseball player Minnie Mendoza) if his batting average is under 0.200. Given our model, find the probability that a player hits below the Mendoza line.
Suppose that a player has an incentive clause in his contract that states that he will earn an additional $1 million if his batting average is in the top 15%. How well does the player have to hit to get this additional salary?

Emergency Calls

Suppose that the AAA reports that the average time it takes to respond to an emergency call on the highway is 25 minutes. Assume that the times to respond to emergency calls are approximately Normally distributed with mean 25 minutes and standard deviation 4 minutes.

If your car gets stuck on a highway and you call the AAA for help, find the probability that it will take longer than 30 minutes to get help.
Find the probability that you’ll wait between 20 and 30 minutes for help.
Find a time such that you are 90% sure that the wait will be smaller than this number.

Buying a Battery for your iPod

Suppose you need to buy a new battery for your iPod. Brand $A$ lasts an average of 11 hours and Brand $B$ lasts an average of 12 hours. You plan on using your iPod for eight hours on a trip and you want to choose the battery that is most likely to last 8 hours (that is, have a life that is least as long as 8 hours).

Based on this information, can you decide which battery to purchase? Why or why not?
Suppose that the battery lives for Brand $A$ are Normally distributed with mean 11 hours and standard deviation 1.5 hours, and the battery lives for Brand $B$ are Normally distributed with mean 12 hours and standard 2 hours. Compute the probability that each battery will last at least 8 hours.
On the basis of this calculation in part (b), which battery should you purchase?

Lengths of Pregnancies

It is known that the lengths of completed pregnancies are approximately Normally distributed with mean 266 days and standard deviation 16 days.

What is the probability a pregnancy will last more than 270 days?
Find an interval that will contain the middle 50% of the pregnancy lengths.
Suppose a doctor wishes to tell a mother that he is 90% confident that the pregnancy will be shorter than $x$ days. Find the value of $x$.

Attendances at Baseball Games

Attendance for home games of the Cleveland Indians for a recent baseball season can be approximated by a Normal curve with mean $\mu$ = 24,667 and standard deviation $\sigma$ = 6144.

Consider the attendance for one randomly selected game during the 2006 season.

Find the probability the attendance exceeds 30,000.
Find the probability the attendance is between 20,000 and 30,000.
Suppose that the attendance at one game in the following season is 12,000. Based on the Normal curve, compute the probability that the attendance is at most 12,000. Based on this computation, is this attendance unusual? Why?

Coin Flipping

Suppose you flip a fair coin 1000 times.

How many heads do you expect to get?
Find the probability that the number of heads is between 480 and 520.
Suppose your friend gets 550 heads. What is the probability of getting at least 550 heads? Do you believe that your friend’s coin really was fair? Explain.

Use of Online Banking Services

Suppose that a newspaper article claims that 80% of adults currently use online banking services. You wonder if the proportion of adults who use online banking services in your community, $p$, is actually this large. You take a sample of 100 adults and 70 tell you they use online banking.

If the newspaper article is accurate, find the probability that 70 or fewer of your sample would use on-line banking.
Based on your computation, is there sufficient evidence to suggest that less than 80% of your community use online banking services? Explain.

Time to Complete a Race

Suppose a group of children are running a race. The times (in minutes) that the children complete the race can be described by the density function \[ f(x) = \frac{4+ ( x-3)^2}{21}, 3 < x < 6. \] A graph of this density is shown below. The mean and standard deviation of this density are given by 4.83 and 0.84 minutes, respectively.

Suppose 25 students run this race and you find the mean completion time. Find the probability that the mean time exceeds 5 minutes.
Find an interval that you are 90% confident contains the mean completion time for the 25 students.

Snowfall Accumulation

Your local meteorologist has collected data on snowfall for the past 100 years. Based on these data, you are told that the amount of snowfall in January is approximately Normally distributed with mean 15 inches and standard deviation 4 inches.

Find the probability you get more than 20 inches of snow this year.
In the next ten years, find the probability that the average snowfall (for these ten years) will exceed 20 inches.

Total Waiting Time at a Bank

You are waiting to be served at your bank. From past experience, you know that your time to be served has a Uniform distribution between 0 and 10 minutes.

Find the mean and standard deviation of your waiting time.
The Central Limit Theorem can be also stated in terms of the sum of random variables. If the random variables $X_1, ..., X_n$ represent a random sample drawn from a population with mean $\mu$ and standard deviation $\sigma$, then the sum of random variables $S = \sum_{i=1}^n X_i$, for large sample size $n$, will be approximately Normally distributed with mean $n \mu$ and standard deviation $\sqrt{n} \sigma$ . Suppose you wait at the bank for 30 days. Use this version of the Central Limit Theorem to find the probability that your total waiting time will exceed three hours.

Total Errors in Check Recording

Suppose you record the amount of a written check to the nearest dollar. It is reasonable to assume that the error between the actual check amount and the written amount has a Uniform distribution between $-0.50$ and $+0.50$.

Find the mean and standard deviation of one error.
Suppose you write 100 checks in a single month and $S$ denotes the total error in recording these checks. Find the probability that $S$ is smaller than $5. (Use the version of the Central Limit Theorem described in Exercise 5.)
Find an interval of the form $(-c, c)$ so that $P(-c < S < c) = 0.95$.

Distribution of Measurements

Suppose that a group of measurements is approximately Normally distributed with mean $\mu$ and standard deviation $\sigma$.

Find the probability that a measurement falls within one standard deviation of the mean.
Is it likely that you collect a measurement that is larger than $\mu +3 \sigma$ ? Explain.
Find an interval that contains the middle 50% of the measurements.

Salaries of Professional Football Players

Suppose you learn that the mean salary of all professional football players this season is 7 million dollars with a standard deviation of 2 million dollars.

Do you believe that the distribution of salaries is approximately Normally distributed? If your answer is no, sketch a plausible distribution for the salaries.
From your graph, find an approximate probability that a salary is smaller than $6 million.
Suppose you take a random sample of 30 salaries. Find the probability that the mean salary for this sample is smaller than $6 million.

Weights of Candies

In the candy bowl example, the probability distribution of the candy weight $X$ is given in the following table.

Table 5.1. Weight and proportion of 5 types of candies.

	$x$	$P(X = x)$
fruity square	2	0.15
milk maid	5	0.35
jelly nougat	8	0.20
caramel	14	0.15
candy bars	18	0.15

Verify by calculation that the mean and standard deviation of $X$ are given by $\mu$ = 8.4500 and $\sigma$ = 5.3617, respectively.

Sleeping Times

Suppose sleeping times of college students are approximately Normally distributed. You are told that 25% of students sleep less than 6.5 hours and 25% of students sleep longer than 8 hours. Given this information, determine the mean and standard deviation of the Normal distribution.

R Exercises

A Continuous Spinner

Suppose you spin a spinner where all values from 0 to 100 are equally likely.

Write down the density function for $X$, one spin from this spinner.
Use the following command to simulate 1000 values from this Uniform distribution and store the values in the vector spinner:

spinner <- runif(1000, min = 0, max = 100)

Construct a histogram of the simulated spins.
Use the simulated spins to approximate the probability $P(X > 70)$.

Simulating a Normal Distribution

Suppose monthly snowfalls in Rochester, New York are Normally distributed with mean 25 inches and standard deviation 10 inches.

Using the rnorm() function, simulate snowfalls for 1000 hypothetical months in Rochester.
Construct a graph of these snowfall amounts.
Approximate from the simulated values the probability that a snowfall falls in the interval (20, 30). Compare your answer with the exact probability found using the pnorm() function.
From the simulated values, find an interval that contains the middle 80% of the snowfalls. Compare your answer with the exact interval found using the qnorm() function.

Waiting for a Bus

In the example, the amount of time that one waits for a bus has a Uniform distribution from 0 to 10 minutes. One waits for a bus on Monday, Wednesday, and Friday and records the minimum of the three waiting times.

Write a program to simulate 1000 values of this minimum waiting time.
One can show that the minimum waiting time $Y$ has density given by \[ f(y) = \frac{3}{1000} (10 - y) ^ 2, \, \, 0 < y < 10. \] Compare a histogram of simulated values from (a) with this density function to confirm that you have indeed simulated from the correct distribution.

Weights of Candies (continued)

Suppose one takes a sample of 10 candies from the distribution of candy weights shown in Exercise 28.

Write a function to take a random sample of 10 candies from the bowl and return the sample mean $\bar X$.
Use the replicate() function to repeat this process for 1000 iterations – store the sample means in the vector xbars.
Construct a histogram of the sample means and comment on its shape. Also find the mean and standard deviation of the sample means.
Repeat this exercise using samples of size $n = 25$. Are there any changes in the mean and standard deviation of the sample means?

Spins and the Central Limit Theorem

Suppose you are spinning a spinner with equally likely outcomes 1, 2, 3, 4, 5. $X$ represents a single spin from this spinner.

Find the mean $\mu$ and standard deviation $\sigma$ of $X$.
Write a function to simulate 10 spins from this spinner and compute the sample mean $\bar X$.
Simulate 1000 samples of 10 spins, obtaining a vector of sample means.
Construct a histogram of the sample means and comment on its shape. Also find the mean and standard deviation of the sample means.
Check your calculations in part (d) by finding the exact mean and standard deviation of the sample mean $\bar X$.