6 Prior Distributions
6.1 Informative Prior
6.1.1 Specifying a Beta Prior
In a Bayesian analysis, one needs to specify a density \(g(\theta)\) that reflects one’s prior beliefs about the location of the parameter \(\theta\). The problem is that one typically has only knowledge about a typical value of \(\theta\) and some information about the sureness of this guess. The question is: “How can one construct a prior density that represents this imprecise prior information?”
In this section, we will talk about constructing a prior for a proportion, although the discussion will extend to any single parameter.
To use a concrete example, suppose I’m interested in the proportion of students \(p\) at my university who send or receive text messages while driving. I wish to construct a prior for \(p\) that reflects my beliefs about the size of this proportion.
A standard approach for constructing a prior assumes that \(p\) has density that is a member of a familiar functional form, and one chooses the parameters of the density that match one’s beliefs. For a proportion, the usual choice for density is a beta curve with parameters \(a\) and \(b\). This simplifies the prior assessment task considerably. Instead of constructing an entire density function, one needs only to assess two parameter values. The implicit assumption is that the beta family of distributions is sufficiently flexible to represent different beliefs about the proportion value.
To choose the parameters \(a\) and \(b\), one matches these parameter values with statements about the location and spread of the density. We typically measure location by the mean and spread by the standard deviation – these moments have simple expressions for the beta family: \[ E(p) = \frac{a}{a+b}, \, \, SD(p) = \sqrt{\frac{a b}{(a+b)^2 (a + b + 1)}}. \] One can guess at the mean and standard deviation of \(p\), then use the expressions to find values of the matching parameters \(a\) and \(b\).
Unfortunately, there are problems using this method. It is generally hard to specify a mean and standard deviation of a parameter. The moments of a prior can be significantly affected by the tails of the distribution, but one typically has little information about the tail portion of the prior. It is typically easier to state beliefs about location and spread in terms of percentiles of the distribution.
In our example, it seems easy to first think about the median of my prior. I think of a value \(p_{0.5}\), such that the proportion of text-messaging drivers is equally likely to be smaller or larger than that value. After some reflection, I decide that \(p_{0.5}\) = 0.10. Next, I express my belief about spread by thinking about a second percentile, say the 90th. I specify a proportion value \(p_{0.9}\) such that it is unlikely (with 10% probability) that \(p\) will larger than that value. This is a harder value to specify. After some thought, I decide on \(p_{0.9}\) = 0.25. I then match these two percentiles with a beta curve. Using the beta.select()
function, I find that the beta curve with \(a = 1.41\) and \(b = 10.15\) match my beliefs.
Is the beta(1.41, 10.15) density a good representation of my prior beliefs? To check, one can assesses other percentiles of the prior and see if they match up with the beta density. In our example, suppose that I also believe that the 10th percentile of my prior is \(p_{0.1} = 0.05.\). Since the 10th percentile of the beta(1.41, 10.15) prior is 0.024, there is some incompatability of my prior information with the fitted beta curve. By several of these checks, I can adjust the values of the beta shape parameters so it seems to be a better representation of my beliefs.
6.1.2 Predictive Assessment
One difficulty in specifying a prior is that the parameter \(p\) is relatively abstract. In this example, \(p\) represents the proportion of all students at my university who text while driving. It may be easier to think about the proportion of a sample of students who are texting.
Suppose you have a random sample of \(n = 20\) students. Assuming a beta(\(a, b\)) prior, the number \(y\) of students who text while driving has a beta-binomial predictive distribution of the form \[\begin{eqnarray*} f(y | a, b) = \int_0^1 {n \choose y} p^y (1-p)^{n-y} \frac{1}{B(a, b)} p^{a-1} (1-p)^{b-1} dp \nonumber \\ = {n \choose y} \frac{B(a+y, b+n-y)}{B(a, b)}, \, \, y = 0, ..., n \nonumber \\ \end{eqnarray*}\]
Suppose our initial assessment for \(p\) resulted in a beta(1.41, 10.15) prior. Using the beta-binomial distribution, we can use this prior to predict the number of text message students in the sample of \(n = 20\) students. By using the function {} in the LearnBayes package, we find that that \(P(y \le 4)= 0.827\) and \(P(y \le 5)= 0.934\). If those probabilities don’t reflect your beliefs about the plausibility of the events \(y \le 4\) and \(y \le 5\), then some adjustment needs to be made to the values of the beta shape parameters.
6.2 Noninformative Prior
6.2.1 Uniform prior
In the event that little or no information exists about a parameter, then one can assign a vague or noninformative prior. When Thomas Bayes wrote his famous Bayesian paper, he placed a uniform prior on an unknown proportion. This certainly seems like a reasonable choice since this reflects the belief that any subset of \(p\) of the same length has the same probability.
Suppose instead of the proportion \(p\), we focus on the parameter \(p^2\) that in our example represents the probability (conditional on \(p\)) that two consecutive sampled students text while driving. Since \(p^2\) is an unknown parameter on the unit interval, it certainly is reasonable to assign \(p^2\) a uniform prior. But if \(p\) has a uniform prior, it is straightforward to show using a transformation argument that \(\theta = p^2\) has the density \[ g(\theta) = \frac{1}{2 \sqrt{\theta}}, \, \, 0 < \theta < 1. \] So if the proportion has a uniform prior, then the proportion squared has a nonuniform prior that favors values of \(p^2\) near zero.
A uniform prior is not transformation invariant. This means that the belief in uniformity of a parameter will change under a nonlinear transformation. Since it is unclear which parameter (in our example, \(p\) or \(p^2\)) should be uniform, this notion of uniformity will not lead to a unique choice of prior.
6.2.2 Improper prior
Sometimes improper priors, that is, prior densities that don’t integrate to one, will be chosen as noninformative priors. These type of priors can seem appropriate for use in particular applications. However, their use should be made with some caution, since the choice of an improper prior may lead to an improper posterior distribution.
Consider again the family of beta\((a, b)\) priors for the proportion \(p\). The parameters \(a\) and \(b\) can be viewed as the respective number of successes and number of failures in a preliminary experiment. The total amount of information in the experiment is measured by the “preliminary sample size” \(a+ b\). If we have little information about the proportion, it is reasonable to let both \(a\) and \(b\) approach zero, resulting in the improper prior \[ g(p) \propto \frac{1}{p(1-p)}, \, \, 0 < p < 1. \] The corresponding posterior density, given \(y\) successes in \(n\) trials, is equal to \[ g(p | y) \propto p^{y-1} (1-p)^{n-y-1} . \] This will be a proper posterior density only if the number of successes is in the interval from 1 to \(n-1\). If one observes no successes (\(y = 0\)) or all failures (\(y = n\)), the posterior density will be improper.
In the binomial situation, it is best to avoid these awkward situations and use a prior where both \(a\) and \(b\) are positive. In the next section, we will derive one type of “optimal” prior which does not result in an improper posterior.
6.2.3 Jeffreys prior
One popular way of defining a noninformative prior was suggested by Harold Jeffreys. To define this prior, we review the concept of information. If a single observation \(y\) has a sampling density \(f(y | \theta)\), then we define the Fisher information as \[ I(\theta) = - E\left[ \frac{\partial^2}{\partial \theta^2} \log f(y | \theta) \right], \] where the expectation is taken over the distribution of \(y\). As an example, suppose the binary observation \(y\) has the density \[ f(y | p) = p^y (1-p)^{1-y}, y = 0, 1. \] An easy calculation shows that the information is given by \[ I(p) = \frac{1}{p(1-p)}. \] If we have independent observations \(y_1, ..., y_n\) from \(f(y|\theta)\) and \(I^*\) denotes the information, then it can be shown that \(I^*(\theta) = n I(\theta)\), where \(I\) is the information for a single observation. Applying this result, if we have a sample of Bernoulli(\(p\)) observations \(y_1, ..., y_n\), then the information based on this sample is \[ I(p) = \frac{n}{p(1-p)}. \]
Jeffreys suggests that a suitable noninformative prior is the square root of the information \[ g(\theta) = \sqrt{I(\theta)}. \] The reasoning for this prior is based on the fact that this prior is invariant under transformation. Suppose \(\theta\) is assigned this prior and one transforms \(\theta\) to a new parameter \(\eta = h(\theta)\). Then one can show that the prior on \(\eta\) is given by the same functional form \[ g_1(\eta) = \sqrt{I(\eta)}. \]
In the Bernoulli case, we have already shown that the information \(I(p) = 1/(p(1-p))\). So the Jeffreys prior is given by \[ g(p) = \sqrt{I(p)} = \frac{1}{\sqrt{p(1-p)}} = p^{1/2-1} (1-p)^{1/2-1}. \]
At this point, we have talked about three possible priors for a proportion that are all special or limiting cases of a beta density. The choice \(a = b = 1\) leads to the uniform prior used by Bayes, \(a = b = 1/2\) leads to the Jeffreys prior, and the limiting case where \(a\) and \(b\) approach zero results in the improper prior proportional to \((p(1-p))^{-1}\).