5 Single Parameter Inference

5.1 Introduction

In this chapter, we introduce Bayesian thinking for several one-parameter problems. We first consider normally distributed data and discuss learning about the normal mean given a known value of the variance and learning about the normal variance given a known mean. These situations are artificial, but we will learn particular forms of posterior distributions that will be used in the situation where both parameters of the normal population are unknown. We continue by illustrating Bayesian inference for a Poisson mean and an exponential location parameter. In all examples, the parameter is assigned a conjugate prior and the posterior density has a convenient functional form and easy to summarize. One way of generalizing the class of conjugate priors is by use of mixtures and we illustrate the use of a simple mixture in estimating a Poisson mean.

5.2 Learning about a Normal Mean with Known Variance

5.2.1 A single observation

A basic problem in statistics is to learn about the mean of a normal population. Suppose we observe a single observation $y$ from a normal population with mean $\theta$ and known variance $\sigma^2$. Here the likelihood function is the sampling density of $y$ viewed as a function of the parameter $\theta$. \[ L(\theta) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y-\theta)^2}{2 \sigma^2}\right). \] As an example, suppose Joe takes an IQ test and his score is $y$. Joe’s ``true IQ” is $\theta$ – this would represent Joe’s average IQ test score if he were able to take the test an infinite number of times. We assume that his test score $y$ is normally distributed with mean equal to his true IQ and standard deviation $\sigma$. Here $\sigma$ represents the measurement error of the IQ test and we know from published reports that the standard deviation $\sigma = 10$.

Suppose one has some prior beliefs about the location of the mean and one represents these beliefs by a normal curve with mean $\mu$ and standard deviation $\tau$. That is, \[ g(\mu) = \frac{1}{\sqrt{2 \pi \tau^2}} \exp\left(-\frac{(\theta-\mu)^2}{2 \tau^2}\right). \] Here $\mu$ represents the person’s best guess at the value of the normal mean, and $\tau$ reflects the sureness of this guess. In our IQ example, suppose one believes that Joe has average intelligence so one sets the prior mean $\mu = 100$. Moreover, one is pretty confident (with probability 0.90) that $\mu$ falls in the interval (81, 119). This information can be matched with the standard deviation $\tau = 15$.

After we observe the observation $y$, we wish to find the posterior density of the mean $\theta$. By Bayes’ rule, the posterior density is proportional to the product of the prior of the likelihood. \[ g(\theta | y) \propto g(\theta) L(\theta). \] To find the functional form of this posterior, we combine the terms in the exponent. \[\begin{eqnarray*} g(\theta | y) \propto \exp\left(-\frac{(\theta-\mu)^2}{2 \tau^2}\right) \times \exp\left(-\frac{(y-\theta)^2}{2 \sigma^2}\right) \\ = \exp\left(-\frac{(\theta-\mu)^2}{2 \tau^2}-\frac{(y-\theta)^2}{2 \sigma^2}\right). \nonumber \\ \end{eqnarray*}\] By completing the square, one can show that the exponent is equal to \[ -\frac{1}{2}\left(\frac{1}{\tau^2}+\frac{1}{\sigma^2}\right) \left[\theta^2 - 2 \left(\frac{\mu/\tau^2 + y/\sigma^2}{1/\tau^2+1/\sigma^2}\right) + \left(\frac{\mu/\tau^2 + y/\sigma^2}{1/\tau^2+1/\sigma^2}\right)^2 \right] \] \[ = -\frac{1}{2}\left(\frac{1}{\tau^2}+\frac{1}{\sigma^2}\right) \left(\theta - \frac{\mu/\tau^2 + y/\sigma^2}{1/\tau^2+1/\sigma^2} \right)^2 . \] We see then that, up to a proportionality constant, the posterior density has the normal functional form \[ g(\theta | y) \propto \exp\left(-\frac{1}{2 \tau_1^2} (\theta - \mu_1)^2 \right), \] where the posterior mean and posterior variance have the form \[ \mu_1 = \frac{\mu/\tau^2 + y/\sigma^2}{1/\tau^2+1/\sigma^2}, \,\, \tau_1^2 = \frac{1}{1/\tau^2+1/\sigma^2}. \]

One can see how the posterior combines the information in the prior and the data by using the notion of {}. The precision is defined to be the reciprocal of the variance. The prior precision, denoted $P_0$, is the reciprocal of the prior variance \[ P_0 = 1/\tau^2. \] In a similar fashion, the data precision, denoted by $P_D$, is the reciprocal of the data variance \[ P_D = 1/\sigma^2. \] The posterior precision, denoted by $P_1 = 1/\tau_1^2$, is found simply by adding the prior precision and the data precision: \[ P_1 = P_0 + P_D. \] The posterior mean $\mu_1$ is the weighted average of the prior mean $\mu$ and the observation $y$, where the weights are proportional to the precisions: \[ \mu_1 = \frac{P_0}{P_0 + P_D} \mu + \frac{P_D}{P_0 + P_D} y . \]

Returning to our example, suppose Joe’s score on the IQ test is $y = 120$. What have we learned about his true IQ $\theta$? Table 4.1 illustrates the calculations of the posterior distribution.

We first compute the precision $P_0 = 1/\tau^2$ of the prior and the precision $P_D = 1/\sigma^2$ of the data.
The posterior precision $P_1$ is the sum of the prior precision and the data precision. \[ P_1 = P_0 + P_D = 0.0044 + 0.0100 = 0.0144 . \]
The variance of the posterior $\tau_1^2$ is the reciprocal of the posterior precision.
We compute the posterior mean $\mu_1$ that is a weighted average of the prior mean and the observation: \[ \mu_1 = \frac{0.0100}{0.0144} \times 100 + \frac{0.0044}{0.0144} \times 113.8 \]

Estimate	Variance	Precision
Prior	100	225
Data	120	100
Posterior	113.8	69.23

Before sampling, one believed that Joe was of average intelligence and the prior mean was $\mu = 100$. After the IQ test $y = 120$ is observed, one’s opinion about Joe’s intelligence has changed and the posterior mean is now at $\mu_1 = 113.8$. The posterior mean, as expected, falls between the prior mean and the observed test score. The posterior mean is closer to the test score than the prior mean – this is due to the fact that there is more information (measured by the precision) in the data value than the prior mean. Also note that the posterior variance is smaller than either the prior variance and the data variance. Since the posterior precision is the sum of the prior precision and data precision, one gains information by adding the information from the two sources, and this will result in a smaller posterior variance.

Although we have focused on learning about Joe’s true IQ $\theta$, one may be interested in predicting Joe’s test score on a future test. Denote a future test score by $\tilde y$. We are interested in obtaining the predictive density of $\tilde y$ denoted by $f(\tilde y)$. To obtain this density, we apply some results from normal sampling theory. By adding and subtracting $\theta$ we write the random variable $\tilde y$ as \[ \tilde y = W + \theta, \] where $W = \tilde y - \theta$. Suppose the current beliefs about $\theta$ are represented by a normal density with mean $\mu$ and variance $\tau^2$. For a given value of $\theta$, the future observation $\tilde y$ is normal with mean $\theta$ and variance $\sigma^2$. Equivalently, the distribution of $W = \tilde y - \theta$ is normal with mean 0 and variance $\sigma^2$. Since this distribution of W doesn’t depend on $\theta$, this also represents the unconditional distribution of W. It can be shown that the random variables $W$ and $\theta$ are independent. Since $\tilde y$ is the sum of two independent normal variates, it follows that the distribution for $\tilde y$ will also be normal with mean $\mu$ and variance $\sigma^2 + \tau^2$.

We have already observed Joe’s test score of 120 and the posterior distribution of his true IQ $\theta$ is $N(113.8, \sqrt{69.23})$. We wish to construct a 90% prediction interval for the score of a future test $\tilde y$. Applying the above result, the future test score will be normal with mean $\mu_1 = 113.8$ and variance $\sigma^2 + \tau_1^2 = 100 + 69.23 = 169.23$. So the interval \[ (113.8 - 1.645 \sqrt{169.23}, 113.8 + 1.645 \sqrt{169.23}) \] will cover 90% of the predictive distribution of the future test score.

5.2.2 Multiple observations

In our discussion, we observed only a single observation. Suppose now that we observe a random sample $y_1, ..., y_n$ from a normal distribution with mean $\theta$ and variance $\sigma^2$. In this case, the likelihood function for $\theta$ is the product of the sampling densities \[ L(\theta) = \prod_{i=1}^n \exp\left(-\frac{(y_i-\theta)^2}{2 \sigma^2}\right). \] Suppose we subtract and add the sample mean $\bar y = \frac{1}{n}\sum_{i=1}^n y_i$ to the term in parentheses in the exponent: \[ L(\theta) = \prod_{i=1}^n \exp\left(-\frac{(y_i - \bar y + \bar y -\theta)^2}{2 \sigma^2}\right). \] If we expand the quadratic term and simplify, ignoring multiplicative constants not depending on $\theta$, one can show that \[ L(\theta) = \exp\left(-\frac{n (\bar y - \theta)^2}{2 \sigma^2}\right). \] What we have shown is that $\bar y$ is a sufficient statistic for $\theta$. Also it shows that the multiple observation situation is really equivalent to observing the {} data value $\bar y$ that is normally distributed with mean $\theta$ and variance $\sigma^2/n$.

Suppose a normal($\mu, \tau^2$) prior is chosen for $\theta$. Then, by applying the results of the previous section, the posterior distribution will be normal($\mu_1, \tau_1^2)$ where the mean and variance are given by \[ \mu_1 = \frac{\mu/\tau^2 + \bar y n/\sigma^2}{1/\tau^2+n/\sigma^2}, \,\, \tau_1^2 = \frac{1}{1/\tau^2+n/\sigma^2}. \]

5.2.3 Inference with a noninformative prior

Suppose that one has little prior information about the location of the normal mean. This means one believes that every possible IQ value (within reasonable limits) is equally likely. This lack of knowledge can be represented by a uniform curve \[ g(\theta) = c, \, \, \infty < \theta < \infty . \] One might object to this choice of prior since $g$ is not a proper probability density. However this choice results in a posterior density for $\theta$ that is indeed proper. In the case where a random sample of $n$ is taken, then the posterior density will be given by \[\begin{eqnarray*} g(\theta | y) \propto L(\theta) g(\theta) \nonumber \\ = \exp\left(-\frac{n (\bar y - \theta)^2}{2 \sigma^2}\right) \end{eqnarray*}\] We recognize this posterior as a normal density with mean $\bar y$ and variance $\sigma^2/n$.

If we use this uniform prior, then Bayesian inference will mimic the usual frequentist inference about a mean with known variance. For example, suppose we want to construct a 95% Bayesian interval estimate. The ``equal-tails” interval \[ (\bar y - 1.96 \frac{\sigma}{\sqrt{n}}, \bar y + 1.96 \frac{\sigma}{\sqrt{n}}) \] covers the central 95% of the posterior distribution of $\theta$. This interval also has a frequentist 95% probability of coverage in repeated sampling

5.2.4 Inference with large samples

Suppose one’s prior beliefs about a mean are modeled by a normal distribution and one observes a very large sample. What is the impact of this large sample size on the posterior distribution? As an example, suppose I’m interested in learning about the mean height $\theta$ of undergraduate female students at my university. Based on my knowledge, my ``best guess” at $\theta$ is 65.5 inches and I am 90% confident that the mean is within 2 inches of my best guess. I can represent this information by a normal curve with mean $\mu = 65.5$ and standard deviation $\tau = 1.2$. A large sample of $n = 427$ female students are asked about their height and we observe a sample mean of $\bar y = 64.71$ inches. What have we learned about the population mean height $\theta$? For this example, we will assume that we know the the standard deviation of the population of female heights is $\sigma = 3$ inches.

Table 4.2 shows the posterior calculations for this example. The first row contains the mean, standard deviation, variance and precision for our prior, and the second row contains the same quantities for the data. Here the standard error of the mean is $\sigma/\sqrt{n} = 3/\sqrt{427} = 0.145$, the data variance is $\sigma^2/n = 0.021$ and the data precision is $n/\sigma^2 = 47.56$. The posterior mean is given by \[ E(\theta | y) = \frac{0.69}{0.69 + 47.56} \times 65.5 + \frac{47.56}{0.69 + 47.56} \times 64.71 = 64.72. \]

	Estimate	Standard Deviation	Variance	Precision
Prior	65.5	1.2	1.44	0.69
Data	64.71	0.145	0.021	47.56
Posterior	64.72	0.144	0.0207	48.25

Recall the precision is a measure of the amount of information and it is clear in this example that there is much more information about $\theta$ contained in the sample of $427$ students than the prior. As a result, the likelihood function is much more precise than the prior and the posterior is essentially controlled by the data. In other words, since the data is so precise relative to the prior, the prior acts like a constant noninformative prior. The posterior mean and posterior standard deviation are approximately equal to the sample mean and classical standard error, respectively.

In this particular example, there was some conflict between the information in the prior and the data. My prior beliefs about the average female height was approximately one inch too high. But since a large sample was collected, my ``wrong” prior information has little impact on the posterior inference. In other words, the posterior inference in this case is robust or insensitive to the choice of prior distribution.

5.3 Learning About a Normal Variance with Known Mean

5.3.1 Inference using an informative prior

Suppose we observe $y_1, ..., y_n$ from a normal population with known mean $\theta$ and unknown variance $\sigma^2$. The likelihood function for $\sigma^2$ is given by \[\begin{eqnarray*} L(\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{1}{2 \sigma^2}(y_i - \theta)^2\right) \nonumber \\ \propto \frac{1}{(\sigma^2)^{n/2}} \exp\left(-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \theta)^2\right). \nonumber \\ \end{eqnarray*}\]

A conjugate prior for a variance is the inverse gamma distribution. Suppose $X$ has a gamma distribution with shape $\alpha$ and rate $\beta$ with density \[ f(x) = \frac{ x^{\alpha-1} \exp(-\beta x) \beta^\alpha}{\Gamma(\alpha)}, \, \, x > 0. \] If we let $W = 1/X$, then $W$ has an inverse gamma distribution with parameters $\alpha$ and $\beta$ with density \[ f(w) = \frac{\exp(-\beta/w) \beta^\alpha}{w^{\alpha+1} \Gamma(\alpha)}, \, \, w > 0. \]

Now we assume that one’s prior beliefs about $\sigma^2$ can be represented by an inverse gamma($\alpha, \beta)$ density \[ g(\sigma^2) = \frac{\exp(-\beta/\sigma^2) \beta^\alpha}{(\sigma^2)^{\alpha+1} \Gamma(\alpha)}. \] Then by combining prior and likelihood, the posterior of $\sigma^2$ is equal to \[\begin{eqnarray*} g(\sigma^2 | y) \propto L(\sigma^2) \times g(\sigma^2) \\ \propto \frac{1}{(\sigma^2)^{n/2}} \exp\left(-\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \theta)^2\right) \times \frac{\exp(-\beta/\sigma^2) }{(\sigma^2)^{\alpha+1} } \nonumber \\ = \frac{1}{(\sigma^2)^{\alpha+n/2+1}} \exp\left(-\frac{1}{ \sigma^2}\left[\beta+ \frac{1}{2}\sum_{i=1}^n (y_i - \theta)^2\right]\right),\nonumber \\ \end{eqnarray*}\] which we recognize as a inverse gamma density with updated parameters $\alpha_1 = \alpha + n/2$ and $\beta_1 = \beta+ \frac{1}{2} \sum_{i=1}^n (y_i - \theta)^2$.

It is common to represent the inverse gamma prior and posterior densities in terms of the ``scale times inverse chi-square” density. A chi-square random variable with $\nu$ degrees of freedom has density \[ f(w) \propto w^{\nu/2-1} \exp(-w/2), \, w > 0. \] The random variable $V = c / W$ has the density \[ f(v) \propto \frac{1}{v^{\nu/2+1}} \exp\left(-\frac{c}{2 v}\right), \, v > 0. \] Since the density of $V$ is derived as a constant divided by a chi-squared variable, we write \[ V \sim c \chi^{-2}_\nu . \] Using this notation, our inverse gamma prior can represented as $ 2 ^{-2}_{2 }$ and likewise the posterior is represented as $ 2 1 ^{-2}{2 _1}$.

5.3.2 Inference using a noninformative prior

In the case where one has little prior information about a variance, the standard noninformative prior is given by the improper density \[ g(\sigma^2) = \frac{1}{\sigma^2}, \, \, \sigma^2 > 0. \] This can viewed as a limiting case of the inverse gamma density as the shape parameter $\alpha$ and the rate parameter $\beta$ both approach zero. If this prior is combined with the likelihood, one obtains the posterior density \[\begin{eqnarray*} g(\sigma^2 | y) = \frac{1}{(\sigma^2)^{n/2+1}} \exp\left(-\frac{1}{ \sigma^2}\left[\frac{1}{2}\sum_{i=1}^n (y_i - \theta)^2\right]\right),\nonumber \\ \end{eqnarray*}\] which is inverse gamma with parameters $\alpha_1 = n/2$ and $\beta_1 = \frac{1}{2} \sum_{i=1}^n (y_i - \theta)^2$. Relating the posterior to the inverse chi-squared distribution, we can say that \[ \sigma^2 \sim (\sum_{i=1}^n (y_i - \theta)^2) \chi^{-2}_{n}. \]

5.3.3 An example

To illustrate learning about a variance, consider the following winning percentages for all Major League Baseball teams in the 1964 season. These can be considered a random sample of percentages from a normal distribution with mean $\mu = 50$ and unknown variance $\sigma^2$. The value of the variance is a measure of the level of competition in the baseball league where small values of $\sigma^2$ correspond to teams that are relatively equal in ability.

57.4 56.8 56.8 55.6 54.3 49.4 49.4 46.9 40.7 32.7 
61.1 60.5 59.9 52.5 50.6 48.8 48.8 44.4 38.3 35.2

Suppose one has some prior knowledge about the value of $\sigma^2$. The value of the sample variance of the team winning percentages during the previous season is equal to 69.2 and you are pretty confident (with probability 0.90) that the variance $\sigma^2$ is within 20% of 69.2. With some trial and error, one finds that the inverse gamma density with $\alpha = 70$ and $\beta = 4500$ approximately matches this prior information.

From the observed data, we compute \[ n = 20, \, \, \sum_{i=1}^n (y_i - 50)^2 = 1321.45. \] So the posterior density will be inverse gamma($\alpha_1, \beta_1)$ with \[ \alpha_1 = 70 + 20/2 = 80, \, \, \beta_1 = 4500 + 1321.45/2 = 5160.725 \]

One can summarize the posterior by computing suitable percentiles. Statistics packages like R typically have functions for computing percentiles of a gamma distribution and these functions can be used to find percentiles of the inverse gamma distribution. For example, suppose one wishes to find the pth percentile $x_p$ of a inverse gamma($\alpha, \beta$) density – the median satisfies the relationship \[ P( X < x_p) = p, \] where $X$ is inverse gamma($\alpha, \beta$). One can rewrite this statement as \[ P(1/X > 1/x_p) = p, \] where $Y = 1/X$ has a gamma($\alpha, \beta$) density. Using this relationship, it is straightforward to show that the inverse gamma percentile satifies \[ x_p = \frac{1}{y_{1-p}}, \] where $y_{1-p}$ is the $(1-p)$ percentile of a gamma($\alpha, \beta$) density.

The 5th, 50th, and 95th percentiles of the inverse gamma posterior distribution are displayed in Table 3.3. The median (50th percentile) of 64.78 is a reasonable point estimate at $\sigma^2$ and the 5th and 95th percentiles, 54.17 and 78.34, form a ``equal-tails” interval estimate. In the case where prior information is not available for the variance, one can place a noninformative prior with $\alpha = 0, \beta = 0$ on $\sigma^2$ and the table also shows the posterior percentiles for this choice of prior. In this situation, the informative prior information has a substantial impact on the posterior inference and the posterior interval estimate with the informative prior is significantly shorter than the interval estimate with the vague prior.

Prior	5th Percentile	Median	95th Percentile
Inverse Gamma(70, 4500)	54.17	64.78	78.34
Noninformative	42.07	68.34	121.78

5.4 Learning About a Poisson Mean

5.4.1 Introduction

The author has a website for one of his books and Google Analytics ({}) records the number of visits to this particular website every day.
The author records the following counts of weekday visits for a 21-day period during May and June of 2009.

 20 30 22 20 20 17 21 26 22 30 36
 15 30 27 22 23 18 24 28 23 12

A common model for count data such as these is the Poisson. Suppose the counts $y_1, ..., y_n$ represent a random sample from a Poisson distribution with parameter $\lambda$ with probability mass function \[ p(y | \lambda) = \frac{\exp(-\lambda) \lambda^y}{y!}, \, \, y = 0, 1, 2, ... \] One objective is to learn about the mean parameter $\lambda$. In the example, $\lambda$ would represent the average number of hits to this website over a long period of days. Also we are interested in predicting the number of website counts in future days.

Suppose that the author has been observing the counts for daily visits to this website and so he has some beliefs about the location of $\lambda$ before sampling. In particular, he believes that the quartiles of $\lambda$ are 21.3 and 26.4. Equivalently, one believes \[ P(\lambda < 21.3) = 0.25, \, \, P(\lambda < 26.4) = 0.75. \] Any prior density $g(\lambda)$ that matches this prior information would suitable for use in this example. But we’ll shortly see that it is convenient to choose a gamma density. After some trial and error, the author finds that the gamma prior density \[ g(\lambda) = \frac{\exp(-\beta \lambda) \lambda^{\alpha-1} \beta^\alpha}{\Gamma(\alpha)}, \lambda > 0, \] where the shape parameter $\alpha = 40$ and the rate parameter $\beta = 1.67$ approximately matches these prior quartiles.

5.4.2 Learning about the Mean

After counts have been observed, one’s opinion about $\lambda$ is based on the posterior distribution. First, one finds the likelihood $L(\lambda)$. By the independence assumption, the joint mass function of the counts $y_1, ..., y_n$ is given by the product \[ p(y_1, ..., y_n | \lambda) = \prod_{i=1}^n \frac{\exp(-\lambda) \lambda^y_i}{y_i!}. \] Once the counts are observed, then the likelihood function is this joint mass function, viewed as a function of $\lambda$. \[\begin{eqnarray*} L(\lambda) = \prod_{i=1}^n \frac{\exp(-\lambda) \lambda^y_i}{y_i!} \nonumber \\ = C \exp(-n \lambda) \lambda ^s, \nonumber \\ \end{eqnarray*}\] where $n$ is the sample size, $s = _{i=1}^n y_i $ is the sum of observations, and $C$ is a constant not depending on the parameter $\lambda$.

The posterior of $\lambda$ is found by multiplying the prior and the likelihood: \[\begin{eqnarray*} g(\lambda| y) \propto g(\lambda) \times L(\lambda) \nonumber \\ = \exp(-\beta \lambda) \lambda^{\alpha-1} \times \exp(-n \lambda) \lambda ^s \nonumber \\ = \exp(-(\beta + n) \lambda) \lambda^{\alpha + s -1}. \end{eqnarray*}\] One sees that the posterior density of $\lambda$ is also of the gamma functional form with updated parameters \[ \alpha_1 = \alpha + s, \, \, \beta_1 = \beta + n . \] For our example, a Gamma($\alpha, \beta$) prior was assigned with $\alpha = 40$ and $\beta = 1.67$. From the observed data, there are $n = 21$ observations and the sum of the counts is $s = \sum_{i=1}^n y_i = 486$. So the posterior density for $\lambda$ will be gamma($\alpha_1, \beta_1)$ where \[ \alpha_1 = 40 + 486 = 526, \, \, \beta_1 = \beta + n = 1.67 + 21 = 22.67. \]

One can estimate $\lambda$ by the posterior median that is 23.19. A 90% interval estimate for $\lambda$ is formed from the 0.05 and 0.95 quantiles of the gamma(526, 22.67) density given by 21.56 and 24.89. So the mean number of website visits falls in the interval (21.56, 24.89) with probability 0.90.

5.4.3 Prediction

Next, suppose we are interested in predicting the number of websites $\tilde y$ on a future weekday. If our current beliefs about the mean $\lambda$ are given by a Gamma($\alpha, \beta$) distribution, then the predictive density of $\tilde y$ is given by \[\begin{eqnarray*} f(\tilde y) = \int_0^\infty f(\tilde y | \lambda) g(\lambda) d\lambda\nonumber \\ = \int_0^\infty \frac{\lambda^{\tilde y} \exp(-\lambda)}{\tilde y!} \times \frac{\lambda^{\alpha-1} \exp(-\beta \lambda) \beta^\alpha}{\Gamma(\alpha)} d\lambda \nonumber \\ \end{eqnarray*}\] One can analytically integrate out $\lambda$, obtaining the predictive density \[ f(\tilde y) = \frac{\beta^\alpha \Gamma(\alpha + \tilde y)}{(\beta + 1)^{\alpha + \tilde y} \Gamma(\alpha) y!}, \, y = 0, 1, 2, ... \]

In our example, after observing the data, the current beliefs about $\lambda$ are contained in the Gamma(526, 22.67) posterior density. The corresponding {} density is the distribution of a future number of website visits for a future weekday. Using R, we compute values of $f(\tilde y)$ for values of $\tilde y$ from 0 to 200. We find that the most likely value of $\tilde y$, the value with the largest predictive probability, is equal to 23. But there is much uncertainty about this prediction since (1) we are uncertain about the mean number of hits $\lambda$ and (2) there is Poisson variability about the number of hits $\tilde y$ given a value of $\lambda$.
Using the R function {}, we find that \[ P( 15 \le \tilde y \le 31) = 0.917, \] so (15, 31) is a 91.7% prediction interval for this future count.

The predictive density is useful for making predictions about future data. It is also helpful in judging the suitability of our Poisson/gamma model. The basic idea is that our model is reasonable if the actual observed data is consistent with predictions made from the model. We just saw that approximately 90% of the predictions fall between 15 and 31 visits. Looking at our data, we see that only 2 of the 21 observations (namely 12 and 36) fall outside of the interval (15, 31). Since the observed data appears consistent with the posterior predictive distribution, the model with Poisson sampling and a gamma(40, 1.67) prior seems to be a reasonable description of the observed counts.

5.5 Learning About an Exponential Threshold Parameter

In a reliability application, suppose that one observes the times to failure for a sample of washing machines. We assume that the failure times $y_1, ..., y_n$ represent a random sample from an exponential distribution with threshold $\theta$ and known rate $\beta$. The sampling density is given by \[ f(y | \theta) = \beta \exp(-\beta (y - \theta)), \, \, y \ge \theta. \] In this application, the parameter $\theta$ can represent the length of a warranty period where no failures can occur.
To simplify the problem, we are assuming that one does not know the warranty period but knows the parameter $\beta$ that describes the pattern of failures after the warranty period.

First, suppose that little information exists about the value of the positive parameter $\theta$, and so we assign this parameter a uniform prior on positive values: \[ g(\theta) = 1, \, \, \theta > 0. \] The likelihood function is given by \[\begin{eqnarray*} L(\theta) = \prod_{i=1}^n f(y_i | \theta) \nonumber \\ = \prod_{i=1}^n \left( \beta \exp(-\beta (y_i - \theta)) I(y_i \ge \theta) \right) \nonumber \\ \propto \exp(n \beta \theta) I(\theta \le \min y_i) \nonumber \\ \end{eqnarray*}\] Combining the likelihood and prior, we see that the posterior density is defined, up to a proportional constant, by \[\begin{eqnarray*} g(\theta | y) \propto L(\theta) g(\theta) \nonumber \\ \propto \exp(n \beta \theta) I(0 < \theta \le \min y_i). \nonumber \\ \end{eqnarray*}\] We see that the posterior density is an exponential density with rate $n \beta$ that is truncated to the right by $\min y_i$.

As an example, suppose that one observes the following failure times (in months) for 15 refrigerators:

 51.8  50.8  20.6  42.4  20.5  18.7  27.9  18.7  
 42.9 123.7  62.7  22.8  89.6  21.4  49.4

From past experience, one knows that $\beta = 1/24$ – this indicates the belief that the average failure time for a refrigerator will be 24 months after the warranty time. If one assigns a uniform prior, then the posterior density for the warranty time $\theta$ is given by \[\begin{eqnarray*} g(\theta | y) \propto \exp(n \beta \theta) I(0 < \theta \le \min y_i) \nonumber \\ \propto \exp(15/24 \theta) I(0 < \theta \le 18.7). \nonumber \\ \end{eqnarray*}\] When one normalizes the density, one finds the posterior density is equal to \[ g(\theta | y) = \frac{\exp(15/24 \theta)}{1- \exp(15/24 \times 18.7)}, \, \, 0 < \theta < 18.7. \]

If one graphs this density, one sees that the posterior density for $\theta$ is strongly left skewed with a mode at 18.7 months.
A HPD interval for the threshold clearly will have the form $(\theta_0, 18.7)$ where $\theta_0$ is chosen so that the probability content has the given level of $\gamma$. If one desires a 90% interval estimate where $\gamma = 0.90$, then a straightforward calculation shows that the HPD interval for $\theta$ is given by (16.0, 18.7).

Exercise: Assume $\theta$ has the prior \[ g(\theta) \propto \exp(a \theta), \, \, \theta < b . \]

Find the posterior density for $\theta$.
Show that the posterior has the same functional form as the prior with updated parameters \[ a_1 = n \beta + a, \, \, b_1 = \min(\min y_i, b)). \]
Suppose one believes that the threshold $\theta$ is definitely smaller than 24 months and you are 90% sure the threshold is larger than 15 months. Find values of the parameters $a$ and $b$ that match these prior beliefs.
Using the prior constructed in part (c) and the refrigerator failure data, find a 90% interval estimate for the threshold parameter $\theta$. Compare this interval estimate with the interval estimate using a noninformative prior.

5.6 Using Mixtures of Conjugate Priors

One way of extending the class of conjugate priors is by the use of discrete mixtures. We illustrate the use of mixtures for the Poisson scenario, although this approach can be applied to any sampling model where there is a conjugate prior.

Suppose we assign the Poisson parameter $\lambda$ the discrete mixture prior \[ g(\lambda) = p g_1(\lambda) + (1 - p) g_2(\lambda) \] where $g_1$ is gamma($\alpha_1, \beta_1$), $g_2$ is gamma($\alpha_2, \beta_2)$, and $p$ is a known mixing probability between 0 and 1.

Suppose we observe ($y, t$), where $y$ is the count in the time interval $t$. We assume that $y$ is Poisson($t \lambda$) with density \[ f(y | \lambda) = \frac{(t \lambda)^y \exp(-t \lambda)}{y!}, \, \, y = 0, 1, 2, ... \]

In the following computation, we include all of the terms to motivate a special expression for the posterior density. The posterior density for $\lambda$ is given by \[\begin{eqnarray*} g(\lambda| y) = \frac{g(\lambda) \times f(y | \lambda)}{f(y)} \nonumber \\ = \frac{\left(p g_1(\lambda) + (1-p) g_2(\lambda)\right)\times f(y | \lambda)}{f(y)} \nonumber \\ = \frac{p g_1(\lambda)f(y | \lambda) + (1-p) g_2(\lambda)f(y | \lambda)}{f(y)}. \nonumber \\ \end{eqnarray*}\] The predictive density $f(y)$ can be written as \[\begin{eqnarray*} f(y) = \int_0^\infty g(\lambda) f(y|\lambda) d\lambda \nonumber \\ = \int_0^\infty \left( p g_1(\lambda)f(y | \lambda) + (1-p) g_2(\lambda)f(y | \lambda)\right) d\lambda \nonumber \\ = p f_1(y) + (1-p) f_2(y) \nonumber ,\\ \end{eqnarray*}\] where $f_1$ and $f_2$ are respectively the predictive densities for $y$ when $\lambda$ is assigned the priors $g_1$ and $g_2$, respectively. If we substitute this expression into the posterior density expression and rearrange terms, one can see that the posterior density has the representation \[ g(\lambda | y) = p(y) g_1(\lambda | y) + (1- p(y)) g_2 (\lambda | y), \] where $g_1(\lambda | y)$ and $g_2(\lambda | y)$ are the posterior densities assuming the priors $g_1$ and $g_2$ and \[ p(y) = \frac{ p f_1(y)}{p f_1(y) + (1-p) f_2(y)}. \] When $g_1$ and $g_2$ are gamma (conjugate) priors, then the posterior density will also be a mixture of gamma distributions, where the mixing weights $p(y)$ and $1 - p(y)$ depend on the mixing probability $p$ and the predictive densities $f_1(y)$ and $f_2(y)$ using the two priors.

To illustrate the use of mixtures, suppose I am interested in learning about the home run rate $\lambda$ for the baseball player Derek Jeter before the start of the 2004 season. (A home run rate is the proportion of official at-bats that are home runs.) Suppose my prior beliefs are that the median of $\lambda$ is equal to 0.05 and the 90th percentile is equal to 0.081. Here are two priors that match this information. The first is a conjugate Gamma prior and the second is a mixture of two conjugate Gamma priors with mixing probabilities 0.88 and 0.12.

Prior 1: Gamma(shape = 6, rate = 113.5)
Prior 2: 0.88 x Gamma(shape = 10, rate = 193.5) + 0.12 x Gamma(shape = 0.2, rate = 0.415)

One can check that these two priors match the prior beliefs. If one graphs the two priors, they are similar in location and spread but the mixture prior (Prior 2) has flatter tails.

Now we observe some data – Jeter hit $y = 23$ homeruns in $t = 643$ at-bats. If one assumes Prior 1, then one can show that the posterior density for $\lambda$ will be Gamma with shape $29$ and rate $756.5$. If one uses the mixture Prior 2, then the posterior density can be shown to be \[ g(\lambda|y) = 0.98 \times Gamma(33.0, 836.5) + 0.02 \times Gamma(23.3, 643.4). \] Does the choice of prior make a difference in this situation? If one plots the two posterior densities on the same scale, they look barely distinguishable, indicating that inference about $\lambda$ is robust or insensitive to the choice of prior.

But this robustness of the inference with respect to the prior depends on the observed data. Suppose that Jeter is a ``steriod slugger” during the 2004 season and hits 70 home runs in 500 at-bats. If we again use the same two priors, the posterior density for $\lambda$ using Prior 1 will be Gamma(76, 693.5). The posterior using Prior 2 is given by the mixture \[ g(\lambda | y) = 0.23 \times Gamma(80, 693.5) + 0.77 \times Gamma(70.2, 500.4). \] If one draws the two posteriors on the same scale, one sees that the two posterior densities are significantly different, indicating that the inference depends on the choice of prior.