4  Learning About a Proportion

4.1 Introduction

Suppose data \(y\) is observed from a sampling distribution \(f(y | \theta)\) that depends on an unknown parameter \(\theta\). We assume that one has beliefs about \(\theta\) before sampling that are expressed through a prior density \(g(\theta | y)\). Once a value of \(y\) is observed, then one’s updated beliefs about the parameter \(\theta\) are reflected in the posterior density, the conditional density of \(\theta\) given \(y\): \[ g(\theta | y) = \frac{f(y | \theta)g(\theta) }{f(y)}, \] where \(f(y)\) is the marginal density of \(y\) \[ f(y) = \int f(y | \theta) g(\theta) d\theta . \]

In the computation of the posterior density, note that the only terms involving the unknown parameter \(\theta\) are the likelihood function \(L(\theta) = f(y | \theta)\) and the prior density \(g(\theta)\). Bayes’ rule says that the posterior density is proportional to the product of the likelihood and the prior, or \[ g(\theta | y) \propto L(\theta) g(\theta). \]

In a Bayesian analysis, both the posterior density and the marginal density play important roles. The posterior density contains all information about the parameter contained in both the prior density and the data. One performs different types of inference by computing relevant summaries of the posterior density. The marginal density \(f(y)\) reflects the distribution of the data \(y\) before observing any data. This density is often called the predictive density since \(f(y)\) is used to make predictions about future data values.

4.2 An Example on Learning About a Proportion

In this chapter, we discuss the basic elements of a Bayesian analysis through the problem of learning about a population proportion \(p\). We take a random sample from the population of size \(n\) and observe \(y\) successes – for a given value of \(p\), the probability of \(y\) is given by the binomial formula \[ f(y | p) = {n \choose y} p^y (1-p)^{n - y}. \]

As an example, suppose that coordinator of developmental math courses at a particular university is concerned about the proportion of students in these courses who have math anxiety, where “math anxiety” is defined by obtaining a particular score on an anxiety rating instrument. A sample of 30 students takes the instrument and 10 have math anxiety. What can be said about the proportion of all developmental math course students who have math anxiety?

The standard estimate of \(p\) is the proportion of successes in the sample \(\hat p = y/n\) and the traditional Wald “large-sample” confidence interval for \(p\) is given by \[ \left(\hat p - z_{\alpha/2} \sqrt{\frac{\hat p (1- \hat p)}{n}}, \hat p + z_{\alpha/2} \sqrt{\frac{\hat p (1- \hat p)}{n}}\right), \] where \(z_\alpha\) is the \(1-\alpha\) quantile of the standard normal distribution.

For large samples, this interval will cover the unknown proportion in repeated sampling with probability \(1 - \alpha\). However this interval estimate has questionable value for samples with very few observed successes or failures. Suppose that no students in our sample have math anxiety. Then \(y = 0\), \(\hat p = 0/30 = 0\) and the confidence interval will be degenerate at zero. (Similarly, if all the students have math anxiety, then \(\hat p = 30/30 = 1\) and the confidence interval will be degenerate at one.) Since one certainly believes that the proportion is larger than zero, this degenerate interval at zero doesn’t make any sense.

One ad-hoc solution to the “zero successes” problem is to initially add two artificial successes and two artificial failures to the data, and then apply the Wald interval to this adjusted data. This is a recommended approach in the literature and the resulting confidence interval has good sampling probabilities. We will see that this ad-hoc procedure has a natural correspondence with a Bayesian interval that incorporates prior information about the proportion.

4.3 Using a Discrete Prior

One simple way of incorporating prior information about \(p\) is by use of a discrete prior. One makes a list of plausible values \(p_1, ..., p_k\) for the proportion and then assigns probabilities \(P(p_1), ..., P(p_k)\) to these values. It may be difficult to directly assess the individual prior probabilities, but it may be easier to think about the probability of one proportion value relative to the probabilities of other values. One might first assign a large integer value, say 10, to the value of \(p\) that is believed most likely, and then assess the probabilities of the remaining values relative to the probability of the most likely value. Once the relative probabilities are determined, then the probabilities are normalized to obtain the prior probabilities.

In the example, suppose one lists the possible values for the proportion of mathematics students with math anxiety displayed in the following table.

\(p\) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Prior

Suppose one’s best guess at the proportion of students with math anxiety is \(p = 0.20\) so this value is assigned a “prior weight” of 10.
The values \(p = 0.15\) and \(p = 0.25\) are believed to half as likely as \(p = 0.20\) so each value is assigned a prior weight of 5. The value \(p = 0.30\) is thought to be only 30% as likely as \(p = 0.20\) so this proportion value is assigned a weight of 3. Continuing in this fashion, one obtains the table of prior weights for \(p\) as shown in Table \(\ref{table:priortable}\). One converts these prior weights to probabilities by dividing each weight by its sum. Since the sum of prior weights is 31, the prior probability of \(p = 0.5\) is equal to \(P(.05) = 1/31 = 0.32\). The third row of the table display the prior probabilities.

\(p\) 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Prior Weight 1 2 5 10 5 3 2 1 1 1
Prior .032 .065 .161 .323 .161 .097 .065 .032 .032 .032

Once this prior distribution is assigned, one can compute the posterior probabilities by use of Bayes’ rule. One observes \(y\) successes in \(n\) trials. The likelihood of \(p = p_i\) given this result is given by \[ L(p_i) = p_i^y (1- p_i)^{n-y}, \] and the posterior probability of \(p_i\) will be given (up to a proportionality constant) by multiplying the prior probability by the likelihood. \[ P(p_i | {\rm data}) \propto P(p_i) L(p_i) = P(p_i) p_i^y (1- p_i)^{n-y}. \] The following table displays the posterior distribution calculations in the familiar table format. The columns of the table include the values of the proportion, the values of the prior, the likelihoods, and the products of the prior and the likelihood. One normalizes the probabilities by first computing the sum of the products (denoted by SUM in the table), and then dividing each product by this sum.

\(p\) Prior Likelihood Product Posterior
\(p_1\) \(P(p_1)\) \(p_1^y(1-p_1)^{n-y}\) \(P(p_1)\) \(p_1^y(1-p_1)^{n-y}\) \(P(p_1)\) \(p_1^y(1-p_1)^{n-y}/SUM\)\
\(p_2\) \(P(p_2)\) \(p_2^y(1-p_2)^{n-y}\) \(P(p_2)\) \(p_2^y(1-p_2)^{n-y}\) \(P(p_2)\) \(p_2^y(1-p_2)^{n-y}/SUM\) \
\(p_k\) \(P(p_k)\) \(p_k^y(1-p_k)^{n-y}\) \(P(p_k)\) \(p_k^y(1-p_k)^{n-y}\) \(P(p_k)\) \(p_k^y(1-p_k)^{n-y}/SUM\) \
SUM

The Bayes’ rule calculations are illustrated in the following table for our math anxiety example. For the example, we observed \(y = 10\) who had math anxiety in a sample of \(n = 30\) and the likelihood is \(p^{10} (1-p)^{20}\). The computed values of the likelihood are very small so they have been multiplied by \(10^{12}\) in the table to obtain integer values.

\(p\) Prior Likelihood Product Posterior
0.05 0.032 0 0 0.000
0.10 0.065 12 1 0.000
0.15 0.161 224 36 0.019
0.20 0.323 1181 381 0.200
0.25 0.161 3024 487 0.255
0.30 0.097 4712 457 0.239
0.35 0.065 5000 325 0.170
0.40 0.032 3834 123 0.064
0.45 0.032 2185 70 0.037
0.50 0.032 931 30 0.016

To interpret the posterior probabilities, remember that initially we believed that the proportion of math anxiety students was about 0.20, although we were unsure about its true value and the prior was relatively diffuse about \(p = 0.20\). The most likely value of \(p\) from the posterior distribution is \(p = 0.25\). The observed proportion of math anxiety values from the sample is \(y/n = 10/30 = 0.33\) and the posterior estimate is a compromise between the sample proportion and the prior mode. We can use the posterior distribution to find an interval estimate for the proportion. Note from the table that the most likely values of \(p\) are \[ p = 0.20, 0.25, 0.30, 0.35 \] with total probability \[ 0.200 + 0.255 + 0.239 + 0.170 = 0.864. \] So the interval (0.20, 0.35) is a 86.4% interval estimate for \(p\) – the posterior probability \[ P(0.20 \le p \le 0.35| {\rm data}) = 0.864. \]

4.4 Using a Noninformative Prior

There are some advantages to using a discrete prior for a proportion. It provides a starting point for finding a prior distribution that reflects one’s knowledge, before sampling, about the location of the proportion. Also it is easy to summarize a discrete posterior distribution. But since the proportion \(p\) is a continuous parameter, one’s prior should be a continuous distribution on the interval from 0 to 1.

First, suppose one has little knowledge about the location of the proportion. In our example, suppose that one has little information about the proportion of students in the class who have math anxiety. How can one construct a prior distribution that reflects little or imprecise knowledge about the location of the parameter? This type of distribution is called a noninformative prior or ignorance prior. Using this type of prior, the posterior distribution will typically be more influenced by the data than the prior information.

One possible choice for a noninformative prior assumes that \(p\) has a uniform distribution \[ g(p) = 1, 0 < p < 1. \] This distribution implies that every subset of \(p\) of a given length has the same probability.

If we observe \(y\) successes in \(n\) trials, we wish to find the posterior density of \(p\), the density of the proportion conditional on \(y\). By Bayes’ rule, this density is given by \[ g(p | y) = \frac{f(y | p) g(p)}{\int_0^1 f(y | p) g(p) dp} \propto f(y|p) g(p), \] which gives the familiar POSTERIOR \(\propto\) LIKELIHOOD \(\times\) PRIOR recipe.

If we use a uniform prior for \(p\), then the posterior density is given by \[ g(p | y) \propto p^y (1-p)^{n-y}, \, 0 < p < 1. \] If we view this function as a function of the proportion \(p\) where \(y\) and \(n\) are fixed, then we recognize this density as a beta density of the form \[ g(p | y) = \frac{1}{B(a^*, b^*)} p^{a^* - 1} (1-p)^{b^*-1}, \, 0 < p < 1, \] where \(a = y+1\) and \(b = n - y + 1\)

4.5 Using a Conjugate Prior

In many situations, the use of noninformative priors is appropriate since the user does not have any knowledge about the parameter from previous experience. But in other situations such as the math anxiety example, the user does have knowledge about the unknown proportion before sampling and one wishes to construct a continuous prior on the unit interval that represents this prior knowledge.

One convenient family of prior distributions is the beta family with shape parameters \(a\) and \(b\): \[ g(p) = \frac{1}{B(a, b)} p^{a - 1} (1-p)^{b-1}, \, 0 < p < 1. \] As demonstrated by the graphs in Figure ???, the beta family can have many shapes and can reflect a variety of information about the proportion \(p\). In practice, one chooses the parameters \(a\) and \(b\) that matches one’s beliefs about the proportion.

One way of assessing values of \(a\) and \(b\) is to guess at the values of the prior mean and variance of \(p\). Suppose these guesses are \(M\) and \(V\), respectively. The prior mean and standard deviation of a beta(\(a, b\)) distribution are \(a/(a+b)\) and \(ab/(a+b)^2/(a+b+1)\). Then by solving the equations

\[ M = \frac{a}{a+b}, \, \, V = \frac{a b}{(a+b)^2 (a+b+1)} \]

for \(a\) and \(b\), one obtains the beta prior distribution. The problem with this method is that it may be difficult for a user to specify the prior moments of the distribution since moments can be affected by the shape or tail behavior of the distribution which may be unknown.

An alternative approach is to assess the parameters \(a\) and \(b\) indirectly through the specification of prior quantiles. In our example, suppose that the user believes that the median of the prior for the proportion of students is \(q_0.5 = 0.23\). This means that he/she believes that the proportion is equally likely to be smaller or larger than 0.23. Then the user makes a statement about the sureness of this guess at the median by the statement about a second quantile. Suppose the user says that he/she is 90% confident that the proportion \(p\) is less than 0.38. So the prior information is given by \[ P(p < 0.23) = 0.50, \, \, P(p < 0.38) = 0.90. \] By use of a program such as the function beta.select() in the LearnBayes package, one matches these prior quantiles with the beta parameters \(a = 4.0, b = 12.5\).

Once one assesses the values of the beta parameters, it is easy to compute the posterior distribution. By multiplying the prior and the likelihood, one obtains that the posterior density of \(p\) is proportional to \[ g(p | y) \propto L(p) g(p) \] \[ = p^y (1-p)^{n-y} \times p^{a-1} (1-p)^{b-1} \] \[ = p^{a + y -1} (1-p)^{b + n - y -1}, \]

which we recognize as a beta density with updated parameters \(a^* = a + y\) and \(b^* = b + n - y\). We say that the beta density is a conjugate prior density since the prior and posterior have the same functional form.

In our example, if our prior is beta(4.0, 12.5) and we have \(y = 10\) math anxious students in a sample of \(n = 30\), then the posterior distribution is beta(4.0 + 10, 12.5 + 20) or beta( 14.0, 32.5).

4.6 Inference

After one observes data, then all knowledge about the parameter is contained in the posterior distribution. It is common to simply display the posterior density and the reader can learn about the location and spread by simply looking at this curve. To obtain different types of statistical inferences, one summarizes the posterior distribution in various ways. We illustrate using the posterior distribution to obtain point and interval estimates of the parameter.

4.6.1 Point Inference

A suitable point estimate of a parameter is a single-number summary of the posterior density. The posterior mean is the mean of the posterior distribution given by the integral \[ E(p | y) = \int p \, g(p | y) dp. \] The posterior median is the median of the posterior distribution, the value \(p_{0.5}\) such that the proportion is equally likely to be smaller or larger than \(p\). \[ P(p < p_{0.5}) = 0.5. \] The posterior mode is the value \(\hat p\) where the posterior density is maximized: \[ g(\hat p | y) = \max_p g(p | y). \]

In the case where a beta(\(a, b\)) prior is assigned to a proportion \(p\), the posterior distribution is also in the beta family with updated parameters \(a^* = a + y\) and \(b^* = b + n - y\). The posterior mean of \(p\) is the mean of the beta density \[ E(p | y) = \frac{a^*}{a^*+b^*} = \frac{y + a}{n + a + b}. \] The posterior median \(p_M\) is the 0.5 fractile of the beta curve. It is not expressible in closed form, but is easily available by use of software. The posterior mode is found by finding the value of \(p\) that maximizes the density \(p^{a^*-1} (1-p)^{b^*-1}\). A straightforward calculation shows the posterior mode is \[ \hat p = \frac{a^*-1}{a^*+b^*-2}. \] For our example, our posterior density is beta(14.0, 32.5). The posterior mean is given by \(E(p | y) = 14.0/(14.0 + 32.5) = 0.301\). By use of the R command qbeta, the posterior median is found to be \(p_M = 0.298\), and the posterior mode is \(\hat p = (14.0 -1 )/(14.0 + 32.5 - 2) = 0.292\).

In the case where the posterior density is approximately symmetric, as in this example, the posterior mean, posterior median, and posterior mode will be approximately equal. In other situations where the posterior density is right or left skewed, these summary values can be different. One nice feature of the posterior median is its clear interpretation as the value that divides the posterior probability in half.

4.6.2 Interval Estimation

Typically, a point estimate such as a posterior median is insufficient for understanding the location of a parameter. A Bayesian interval estimate or credible interval is an interval that contains the parameter with a given probability. Specifically, a \(100 (1-\gamma)\) percent credible interval is any interval \((a, b)\) such that \[ P( a < p < b) = \gamma. \] There are many intervals that contain \(100 (1-\gamma)\) percent of the posterior probability. A convenient estimate is an equal-tail interval estimate whose endpoints are the \(\gamma/2\) and \(1-\gamma/2\) quantiles of the posterior distribution. \[ (p_{\gamma/2}, p_{1-\gamma/2}). \] An alternative is the highest posterior density interval or HPD interval which is the shortest interval that contains this probability content.

In our example, the posterior for \(p\) was beta(14.0, 32.5). If we wish to construct a 90% interval estimate, then one possible interval would be (0, \(p_{.90}\)) = (0, 0.389) and another would be \((p_{.10}, 1) = (0.217, 1)\). These would be undesirable intervals since they both would have long widths. The equal-tail interval would be formed from the 5th and 95th percentiles that is equal to (0.197, 0.415). Using the function hpd in the TeachingDemos package, one computes the HPD interval (0.191, 0.409). Since the posterior density is approximately symmetric, the equal-tail and HPD intervals are approximately equal.

4.6.3 Estimation of Probabilities

One attractive feature of the Bayesian approach is that one can see if the parameter falls in different regions by simply computing the posterior probabilities of these regions. In the math anxiety example, suppose we are interested in the plausibility that the proportion falls in the intervals (0, 0.2), (0.2, 0.4), (0.4, 0.6), (0.6, 0.8), (0.8, 1). The posterior distribution for the proportion of math anxious students is beta(14.0, 32.5) and by use of the R pbeta command, we can compute the probabilities of these regions and these probabilities are displayed in Table \(\ref{table:postprobs}\). Is it likely that the proportion of math anxious students is larger than 0.4? The answer would be no, since the posterior probability that \(p > 0.4\) is only 0.08. We see from this table that it is very likely that the proportion falls between 0.4 and 0.6.

Interval Posterior Probability
(0, 0.2) 0.06
(0.2, 0.4) 0.87
(0.4, 0.6) 0.08
(0.6, 0.8) 0.00
(0.8, 1.0) 0.00

4.7 Using Alternative Priors

The choice of a beta prior is made by convenience. By use of a beta prior, the posterior has the same functional (beta) form and it is easy to summarize the posterior distribution. But Bayes’ rule can be applied for any continuous prior density of \(p\) on the unit interval. We illustrate this point by using an alternative density for the proportion based on prior beliefs about the logit proportion.

In some situations, one may have prior beliefs about the logit of \(p\) defined by \[ \theta = \log \frac{p}{1-p}. \] Suppose that one believes, before sampling, that \(\theta\) is normally distributed with mean \(\mu = -1.21\) and standard deviation \(\tau = 0.55\). By transforming the logit \(\theta\) to \(p\) by \[ p = \frac{\exp(\theta)}{1+\exp(\theta)}, \] one can show that the induced prior on \(p\) is given by \[ g(p) = \phi(\log \frac{p}{1-p}; \mu, \tau) \frac{1}{p(1-p)}, \, \, 0 < p < 1, \] where \(\phi(x; \mu, \tau)\) is the normal density with mean \(\mu\) and standard deviation \(\tau\).

As before, the likelihood function is \(L(p) = p^y (1-p)^{n-y}\), where \(n = 30\) and \(y = 10\). By using the “prior times likelihood” recipe, the posterior density of \(p\) is given by \[ g(p | y) \propto L(p) g(p) = \left(p^y (1-p)^{n-y}\right) \times \left( \phi(\log \frac{p}{1-p}; \mu, \tau) \frac{1}{p(1-p)} \right). \]

In this situation, we no longer have a conjugate analysis, since the prior and posterior densities have different functional forms. Moreover, the posterior has a functional form that we do not recognize as a member of a familiar family such as the beta. However, this just means that we will need alternative tools to summarize the posterior distribution to perform inferences.

4.8 Prediction

In this chapter, we have focused on the use of the posterior distribution to make inferences about the proportion \(p\). It is also possible to learn about the plausibility of future outcomes by inspection of the predictive distribution. In our math anxiety example, suppose we administer the exam to a new sample of 30 students. How many students in the new sample will be math anxious?

Let \(y^*\) denote the number of math anxious students in a future sample of size \(n^*\). Conditional on \(p\), the distribution of \(y^*\) will be binomial(\(n^*, p\)). If our current beliefs about the proportion are represented by the density \(g(p)\), then the predictive density of \(y^*\) will be given by the integral

\[\begin{eqnarray*} f(y^*) = \int_0^1 f(y^*|p) g(p) dp \nonumber \\ = \int_0^1 {n^* \choose y^*} p^{y^*} (1-p)^{n^*-y^*} g(p) dp. \nonumber \\ \end{eqnarray*}\]

Suppose we assign \(p\) a uniform prior; that is, \(g(p) = 1\). If we substitute this prior for \(g(p)\), then the predictive density is given by \[\begin{eqnarray*} f(y^*) = \int_0^1 {n^* \choose y^*} p^{y^*} (1-p)^{n^*-y^*} dp. \nonumber \\ = {n^* \choose y^*} B(y^*+1, B(n^* - y^*+1) \nonumber \\ = \frac{1}{n^*+1}. \nonumber \\ \end{eqnarray*}\] If we use a uniform prior, then each of the \(n^*+1\) possible values of \(y^*\) are equally likely.

Suppose our current knowledge about the proportion is contained in a beta(\(a, b\)) density. Then the predictive density is given by \[\begin{eqnarray*} f(y^*) = \int_0^1 {n^* \choose y^*} p^{y^*} (1-p)^{n^*-y^*} \frac{1}{B(a, b)} p^{a-1}(1-p)^{b-1} dp \nonumber \\ = {n^* \choose y^*} \frac{B(a + y^*, b + n^* - y^*)}{B(a, b)}, y^* = 0, ..., n^*. \nonumber \\ \end{eqnarray*}\] This is called a beta-binomial density since it is a mixture of binomial densities, where the proportion \(p\) follows a beta density.

In our example, after observing the sample, the beliefs about the proportion of math anxious students is represented by a beta(14.0, 32.5) distribution. By use of the R function pbetap(), one can compute the predictive density for the number of math anxious students in a future sample of \(n^* = 30\). The figure shows that there is a sizable variation in \(y^*\); a 90% prediction interval for \(y^*\) is given by {4, 5, …, 13, 15}. Why is the prediction interval so wide? There are two sources of variability in prediction. First, there is uncertainty about the proportion of math anxious students \(p\) as reflected in the posterior density \(g\), and there is uncertainty in the number of anxious students \(y^*\) for a fixed value of \(p\) as reflected in the sampling density \(f\). The prediction distribution incorporates both types of uncertainty and therefore results in a relatively wide prediction interval estimate.