10  Hierarchical Modeling

10.1 A Beta-Binomial Hierarchical Model

10.1.1 Introduction: Learning About OGT Success Rates

The Ohio Graduation Test (OGT) is a test administered to all high school students in the state of Ohio. To graduate from high school, a student must develop a proficiency in reading, science, mathematics, social studies and writing as measured by his/her performance on the OGT. This test is used to assess the quality of schools in the state and the State of Ohio releases summary OGT scores for all public and private schools in the state.

A student receives a grade on each of the five sections of the OGT. The raw score for each section is categorized as Advanced, Accelerated, Proficient, Basic, or Limited. A student needs to have a grade of Proficient, Accelerated, or Advanced on all sections to pass the OGT.

We focus on the performance of the students from the nine public school districts in Wood County. For each school district, we collect the number of students taking the exam, and the number who received an Advanced score (the highest category) on the Writing section of the OGT.

School.District N ADVANCED Proportion
Bowling Green City Sd 259 5 0.019
Eastwood Local Sd 145 10 0.069
Elmwood Local Sd 95 3 0.032
Lake Local Sd 138 2 0.014
North Baltimore Local Sd 53 0 0.000
Northwood Local Sd 82 4 0.049
Otsego Local Sd 140 2 0.014
Perrysburg Ex Vill Sd 355 16 0.045
Rossford Ex Vill Sd 139 4 0.029

Overall, only 3% of the students received an Advanced score on the Writing test for this county, so clearly this was a challenging test. One can compare schools by computing the proportion of Advanced that are given in the last column of the table. Several questions arise when one tries to make sense of these proportions. First, we note that none of the North Baltimore Local students received an Advanced score. Does this mean that the population proportion of students from North Baltimore Local who received this score is equal to zero? Although it is unlikely for these students to receive this grade, one would expect at least a few students to get “Advanced” in future years. Second, we note that Eastwood’s proportion of Advanced (0.069) is higher than Perrysburg (0.045). Does this mean that Eastwood’s Advanced population proportion is higher than Perrysburg? Since the actual success counts are small, it is possible that there is no difference in the quality of the Eastwood and Perrysburg schools and we are simply observing sampling variability.

10.1.2 To Pool or Not to Pool?

In the OGT testing example, the general problem is to estimate the proportions \(p_1, ..., p_9\), where \(p_j\) represents the proportion of students from the \(j\)th school district who score at an Advanced level on the writing section of the OGT. We observe \((y_j, n_j)\), where \(y_j\) is the number of Advanced students in a sample of size \(n_j\) from the \(j\)th district. The typical estimate of the proportion \(p_j\) is the sample proportion \[ \hat{p_j} = \frac{y_j}{n_j} \] and the sample proportions are displayed in Table ???. As discussed in the introduction, there are some problems with these estimates. Due to the relatively small sample sizes, it is possible to get a sample proportion \(\hat{p_j}\) equal to zero, although we believe that the corresponding population proportion \(p_j\) is positive. Generally, the variability of the sample proportions can be high, and so it may be problematic to make decisions solely on the sample proportions. For example, it may be difficult to say that Elmwood truly has a higher Advanced rate than Rossford since the corresponding sample proportions (0.32 and 0.29) have high sample variability.

How can we improve the individual sample proportion estimates? If we believe that the proportions are equal, that is, \[ p_1 = ... = p_9, \] then we can pool the data to get “improved” estimates. The pooled estimate of the common proportion value is given by \[ \tilde{p} = \frac{\sum_{i-1}^9 y_j}{\sum_{i-1}^9 n_j}. \] Here this estimate is \(\tilde p = 46/1406 = 0.0327\). Certainly this estimate improves the “zero estimate problem”. But this estimate is based on the strong assumption that the schools have equal ability to get an Advanced score on the writing component of the OGT.

From a frequentist perspective, the usual procedure is to (1) fit a model that describes the relationship between the parameters, (2) decide whether or not to accept the model by means of a statistical test, and then (3) estimate the parameters on the basis of the best fitting model. Here one would be interested in testing the hypothesis \(H\) that the proportions are equal: \[ H: p_1 = ... = p_9. \] A standard test of this hypothesis is the chi-square procedure that tests whether the school district is independent of the score (Advanced or not Advanced) on the OCT writing test. On the basis of the procedure, we will decide either to accept or reject the hypothesis. If we accept the hypothesis \(H\), then we would use the pooled estimate $ = . $ Instead if we decide to reject \(H\), then we would use the sample proportion estimates \(\hat{p_j} = \frac{y_j}{n_j}\). This procedure is sometimes called a testimator, since we are deciding on the appropriate estimate by using a statistical test.

If one applies this test to these data, one obtains a Person chi-squared test statistic of 14.69. The p-value is the probability a chi-square variate with 8 degrees of freedom exceeds 14.69 – the p-value for these data is 0.065. If one uses the typical 0.05 cutoff to decide on the significance of this test, then one would decide that there is insufficient evidence to conclude that the proportions are different and would use the pooled estimate. But this p-value is close to 0.05; if a p-value of 10% or smaller was deemed “significant”, then we would conclude the proportions are different and use the sample proportion estimates.

Is there anything else that could be done? It appears that both the sample proportions and the pooled estimate are undesirable, and so it may be more appropriate to use a compromise estimate. One possibility is to estimate \(p_j\) by a weighted average of the two extreme estimates: \[ w \hat{p_j} + ( 1- w) \tilde{p}, \] where \(w\) is a weight between 0 and 1 that determines the relative importance of the two extreme estimates. The value of the weight intuitively should be a function of the test statistic of the hypothesis \(H\). If the p-value for the test is large, then one would like the value of \(w\) to be small, indicating a greater weight on the pooled estimate. Alternately, if the p-value is very small, then the constant \(p\) model would seem to be inappropriate, and the value of \(w\) would be close to 1, reflecting more support on the sample proportion. It will be seen that this compromise estimate will be a by-product of the use of a hierarchical prior distribution placed on the proportions.

10.1.3 An Exchangeable Model for Proportions

What prior beliefs are available when one is estimating a set of proportions? In Chapter ???, we considered prior beliefs about a single proportion \(p\). In the case when one has prior information about the location and spread, we discussed the use of a beta(\(a, b\)) to approximate these beliefs. But when one is estimating many proportions, it may be difficult to think about locations and spreads for all of the parameters.

Suppose one believes that the proportions \(p_1, ..., p_9\) represent a random sample from a single beta(\(a, b\)) distribution where the parameters \(a\) and \(b\) are unknown. Then if historical data is available, then one could use the data to estimate the parameters of the beta density. In our example, suppose one had the school district sizes and number of students receiving Advanced on the writing component for a previous years’ administration of the OGT. Then one could use this previous year’s data to estimate \(a\) and \(b\). For example, one could estimate the prior mean \(a/(a+b)\) by using the mean success rate for the schools in the previous year. Also one could use the variability of the success rates to estimate the variance of the beta density.

Is it appropriate to use historical data in this case? It would be appropriate if one had some prior beliefs about the similarity of the success proportions for the past year and the success proportions for the current year. For example, if {\(q_j\)} represent the proportions of students receiving the Advanced grade for schools in a previous year, then one may believe that the sizes of the {\(q_j\)} were similar to the sizes of the proportions {\(p_j\)} of the nine schools in the current year.

One way of constructing a prior distribution that reflects a belief in similarity is based on the notion of {}. We say that a set of random variables \(X = X_1, ..., X_n\) is exchangeable if the distribution of \(X\) does not change if we change the order of the subscripts. That is, if \(s(1), ..., s(n)\) represent a permutation of the components of \(X\), then \((X_1, ..., X_n)\) has the same distribution of \(X_{s(1)}, ..., X_{s(n)}\).

In our example, suppose we believe that the proportions \(p_1, ..., p_9\) are exchangeable. If \(p\) is the vector of proportions, a belief of exchangeability means that our belief in \(p\) is the same if we change the order of the subscripts of the components of \(p\). Also a belief in exchangeability implies that our beliefs about any two different proportions is the same – that is, the prior for \(p_i\) will be the same as the prior for \(p_j\) for \(i \neq j\). This belief also implies that the joint prior for any two proportions, say \((p_i, p_j)\) for \(i \neq j\), does not change if we replace \(i\) and \(j\) by two other subscripts.

The famous probabilist deFinetti showed that if one believes that a set of random variables is exchangeable, then the corresponding distribution is constructed by means of a hierarchical structure. In our situation, this means that if \(p_1, ..., p_9\) is exchangeable, then one can construct the prior in a two-stage process as follows.

  • STAGE I: The proportions \(p_1, ..., p_9\) are independently distributed from a prior \(g_1(p | \phi)\) depending on an unknown parameter vector \(\phi\).
  • STAGE II: The unknown parameter vector \(\phi\) has a completely specified distribution \(g_2(\phi)\)

This two-stage structure implies that the proportion vector has the mixture prior \[ g(p_1, ..., p_9) = \int \left(\prod_{j=1}^9 g_1(p_j | \phi)\right) g_2(\phi) d\phi. \]

Many possible distributions can be chosen for \(g_1\) and \(g_2\). For computational convenience, we will let \(g_1\) be a beta density with parameters \(a\) and \(b\). Then the unknown parameter vector \(\phi = (a, b)\) and \(g_2\) will be a completely specified distribution on the beta parameters. In this case, the proportions will have the prior \[ g(p_1, ..., p_9) = \int\int \left(\prod_{j=1}^9 \frac{1}{B(a, b)} p_j^{a-1} (1-p_j)^{b-1}\right) g_2(a, b) da \, db. \]

How should we choose the second-stage prior \(g_2\)? First, it is useful to reparameterize the beta parameters \(a\) and \(b\) to the prior mean \(\eta\) and the precision parameter \(K\) where \[ \eta = \frac{a}{a+b}, \, \, K = a + b . \] Then we assume \(\eta\) and \(K\) are independent, so \[ g_2(\eta, K) = g_2(\eta) g_2(K). \] We will assign \(\eta\) the noninformative prior proportional to \(\eta^{-1}(1-\eta)^{-1}\) and \(K\) the proper, but vague prior density \((1+K)^{-1}\). So the joint prior at the second-stage is given by \[ g_2(\eta, K) = \frac{1}{\eta(1-\eta)} \frac{1}{(1+K)^2}, \, \, 0 < \eta < 1, K > 0. \] Later we will discuss alternative “noninformative” choices for these parameters. The joint prior of the proportions and the hyperparameters is given by \[\begin{eqnarray*} g(p_1, ..., p_k,\eta, K) = \left(\prod_{j=1}^9 \frac{1}{B(K \eta, K (1-\eta))} p_j^{K \eta -1} (1-p_j)^{K(1-\eta)-1}\right) \nonumber \\ \times \frac{1}{\eta(1-\eta)} \frac{1}{(1+K)^2} \nonumber \\ \end{eqnarray*}\] where the beta parameters \(a\) and \(b\) are written in terms of the new \((\eta, K)\) parameterization.

10.1.4 Computation of the Posterior Distribution

In the exchangeable model, the unknown parameters are the \(k\) proportions \(p_1, ..., p_k\) and the unknown second-stage hyperparameters \(\eta\) and \(K\). For the \(j\)th sample, we observe \(y_j\) successes in \(n_j\) trials. The likelihood of the proportions is given by \[ L(p_1, ..., p_k) = \prod_{j=1}^k p_j^{y_j} (1-p_j)^{n_j-y_j}. \] By multiplying the joint prior of \((p_1, ..., p_k, \eta, K)\) by the likelihood, we obtain that the joint posterior is given by

\[\begin{eqnarray*} g(p_1, ..., p_k,\eta, K {\rm data}) \propto \left(\prod_{j=1}^9 \frac{1}{B(K \eta, K (1-\eta))} p_j^{K \eta -1} (1-p_j)^{K(1-\eta)-1}\right) \nonumber \\ \times \frac{1}{\eta(1-\eta)} \frac{1}{(1+K)^2}\prod_{j=1}^k p_j^{y_j} (1-p_j)^{n_j-y_j} \nonumber \\ \end{eqnarray*}\]

To simplify the posterior computation, we decompose this joint posterior into the product \[ g(p_1, ..., p_k,\eta, K | {\rm data}) = g(p_1, ..., p_k|\eta, K, {\rm data}) g(\eta, K | {\rm data}). \]

First consider the first term, the joint posterior of the proportions given the hyperparameters \(\eta\) and \(K\). If we fix values of the hyperparameters, then the proportions have the joint density \[\begin{eqnarray*} g(p_1, ..., p_k, \eta, K,{\rm data}) \propto \left(\prod_{j=1}^9 p_j^{K \eta -1} (1-p_j)^{K(1-\eta)-1}\right) \nonumber \\ \times \prod_{j=1}^k p_j^{y_j} (1-p_j)^{n_j-y_j} \nonumber \\ \end{eqnarray*}\] We regrouping terms, we see that the proportions \(p_1, .., p_k\), conditional on the hyperparameters, are independent with \(p_j\) distributed beta with parameters \(a_1 = K\eta + y_j\) and \(b_1 = K(1-\eta) + n_j - y_j\).

Next consider the marginal posterior density of \((\eta, K)\). We obtain this distribution by integrating out the proportions from the joint posterior density. Recall the identity \[ \int_0^1 p^{a-1} (1- p)^{b-1} dp = B(a, b). \] If we integrate each proportion out using this identity, we obtain the marginal posterior density \[\begin{eqnarray*} g(\eta, K | {\rm data}) \propto \left(\prod_{j=1}^9 \frac{B(K \eta + y_j, K(1-\eta) + n_j - y_j)}{B(K \eta, K (1-\eta))} \right) \frac{1}{\eta(1-\eta)} \frac{1}{(1+K)^2} .\nonumber \\ \end{eqnarray*}\]

Suppose we are interested in learning about the \(j\)th proportion \(p_j\) that we can summarize by a posterior mean and posterior variance. We can use the posterior representation to obtain simple expressions for these moments. We compute the posterior mean of \(p_j\) by using the conditional expectation formula: \[ E(p_j | {\rm data}) = E \left[ E(p_j | \eta, K, {\rm data})\right], \] where the outer expectation is taken with respect to the posterior distribution of \((\eta, K)\). Since the posterior density of \(p_j\), conditional on \((\eta, K)\) is beta(\(K\eta + y_j, K(1-\eta) + n_j-y_j\)), we have \[\begin{eqnarray*} E(p_j | \eta, K, {\rm data}) = \frac{y_j + K\eta}{n_j+K} \nonumber \\ = \frac{n_j}{n_j + K} \hat{p_j} + \frac{K}{n_j+K} \eta . \nonumber \\ \end{eqnarray*}\] So the posterior mean of \(p_j\) can be expressed as \[ E(p_j | {\rm data}) = E\left(\frac{n_j}{n_j + K}\right) \hat{p_j} + E\left(\frac{K}{n_j+K} \eta\right), \] where the expectations on the right hand side are taken with respect to the posterior density of \((\eta, K)\). In a similar fashion, by using the conditioning rule for variances, the posterior variance of \(p_j\) can be expressed as \[ Var(p_j | {\rm data}) = Var \left[ E(p_j | \eta, K, {\rm data})\right] + E \left[ Var(p_j | \eta, K, {\rm data})\right]. \] Conditional on \((\eta, K)\), we know the variance of \(p_j\) is given by \[ Var(p_j | \eta, K, {\rm data}) = \frac{a_1 b_1}{(a_1 + b_1)^2 (a_1 +b_1 + 1)}, \] where \(a_1 = K\eta + y_j\) and \(b_1 = K(1-\eta) + n_j - y_j\). Substituting in the conditional posterior moments, we obtain \[ Var(p_j | {\rm data}) = Var \left[ \frac{a_1}{a_1+b_1} \right] + E \left[ \frac{a_1 b_1}{(a_1 + b_1)^2 (a_1 +b_1 + 1)}\right], \] where again the variance and expectation on the right hand side are taken with respect to the posterior distribution of \((\eta, K)\).

10.1.5 Estimating OGT Success Rates

Recall that the exchangeable model is a compromise between two models, the “separate proportions” model and the “one proportion” model where one assumes the proportions are equal. Table 6.1 displays summaries of the posterior distributions of logit\((\eta)\) and \(\log K\).

2.5% 25% 50% 75% 97.5%
logit \(\eta\) -3.90 -3.53 -3.35 -3.16 -2.62
\(\log K\) 2.28 3.68 4.36 5.04 6.78

The hyperparameter \(\eta\) represents the common proportion value. In the table, the posterior median of logit(\(\eta\)) is \(-3.35\), so the posterior median of \(\eta\) is \(\exp(-3.35)/(1+\exp(-3.35)) = 0.0339.\) So generally 3.4% of the students received an Advanced score on the writing section of the OGT.
The hyperparameter \(K\) is informative about the degree of shrinkage of the separate proportions model towards the one proportion model. From the table, we see that the posterior median of \(\log K\) is 4.36 which translates to a posterior median of \(K\) of \(\exp(4.36) = 78.3\). If we substitute these estimates for \(\eta\) and \(K\) in the expression for the posterior mean for \(p_j\), we get the approximation \[ E(p_j | {\rm data}) \approx \frac{n_j}{n_j + 78.3} \hat{p_j} + \frac{78.3}{n_j+78.3} (0.0339). \] One can measure the shrinkage of the estimate \(\hat{p_j}\) towards the pooled estimate by the fraction \[ \frac{78.3}{n_j+78.3}. \] For the nine schools, the shrinkage values range between 18% and 60%, reflecting substantial movement towards the pooled estimate. The size of the shrinkage depends on the sample size – the largest shrinkage is for North Baltimore School District where only 53 students took the test.

Table 6.2 gives summaries of the posteriors for the proportions of success for the nine schools. To better understand the values, consider the proportion of students from the Bowling City School District who were successful. The 90% naive Wald confidence interval for \(p_1\) based on the data \((y_1, n_1) = (5, 259)\) is given by \[ (0.019 - 1.645 \sqrt{\frac{0.019 (1-0.019)}{259}}, 0.019 + 1.645 \sqrt{\frac{0.019 (1-0.019)}{259}}) \] \[ = (0.0051, 0.0329) \] If we assumed the proportions for the nine schools were equal, then the estimate for \(p_1\) would be the combined estimate 0.0327 with a sample size of 1406 and the 90% Wald interval would have the form \[ (0.0327 - 1.645 \sqrt{\frac{0.0327 (1-0.0327)}{1406}}, 0.0327 + 1.645 \sqrt{\frac{0.0327 (1-0.0327)}{1406}}) \] \[ = (0.0199, 0.0456) \] From the table, the 90% interval estimate of \(p_1\) (from the 5th and 95th percentiles of the marginal posterior) is given by (0.011, 0.038).
The lengths of the separate proportion, one proportion, and Bayesian intervals are respectively 0.0278, 0.0257, and 0.027. By “borrowing strength”, the exchangeable model gives more precise estimates than the separate proportion estimates, although not as precise as the one proportion estimate.

School District \(E(p)\) \(SD(p)\) \(p_{.05}\) \(p_{.95}\)
Bowling Green City Sd 0.023 0.008 0.011 0.038
Eastwood Local Sd 0.055 0.017 0.031 0.088
Elmwood Local Sd 0.033 0.014 0.013 0.058
Lake Local Sd 0.022 0.010 0.007 0.040
North Baltimore Local Sd 0.020 0.013 0.002 0.043
Northwood Local Sd 0.041 0.017 0.019 0.073
Otsego Local Sd 0.022 0.010 0.007 0.040
Perrysburg Ex Vill Sd 0.042 0.010 0.028 0.060
Rossford Ex Vill Sd 0.031 0.012 0.014 0.052

Figure 6.1 graphically illustrates the shrinkage behavior of the Bayesian estimates using the exchangeable model. The open circles represent the individual proportion estimates {\(\hat {p_j}\)} and the solid circles represent the Bayesian posterior means. The horizontal line represents the pooled estimate assuming the proportions are equal. The arrows show the shrinkage of the individual estimates towards the pooled estimate. Note from the figure that the shrinkage sizes are largest for the smaller schools with a small \(n_j\).

Code
library(readr)
library(ggplot2)
library(tidyr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Code
d <- read_csv("https://raw.githubusercontent.com/bayesball/BayesNotes/main/data/writing_OGT.csv")
New names:
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
Rows: 9 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): School
dbl (2): N, Advanced
lgl (4): ...4, ...5, ...6, ...7

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
d$Multilevel <- c(0.023, 0.055, 0.033, 0.022, 0.020,
                  0.041, 0.022, 0.042, 0.031)
d$Abbrev <- c("Bowling Green", "Eastwood", "Elmwood",
              "Lake", "N Baltimore", "Northwood",
              "Otsego", "Perrysburg", "Rossford")
d %>% 
  mutate(Individual = Advanced / N) %>% 
  select(Abbrev, Individual, Multilevel) -> d1
d2 <- pivot_longer(d1, 2:3,
                   names_to = "Type",
                   values_to = "Rate")
ggplot(d2, aes(Type, Rate, group = Abbrev)) + 
    geom_line() + 
    geom_point() + 
    annotate("text", 
              x = 0.75, 
              y = d1$Individual, 
              label = d1$Abbrev) + 
    ggtitle("Shrinkage Plot") + 
    theme(plot.title = element_text(colour = "blue", 
          size = 18, hjust = 0.5)) +
  theme(text=element_text(size=18))

10.2 A Normal/Normal Exchangeable Model

In many situations, one is interested in combining normally distributed data from a group of related experiments. Suppose one observes independent random variables \(y_1, ..., y_k\), where \(y_i\) is distributed from a normal population with mean \(\theta_i\) and known variance \(\sigma^2\). (Note that we are assuming the variabilities for the \(k\) experiments are equal.) The prior belief is that \(\theta_1, .., \theta_k\) are exchangeable. We model this belief by a hierarchical model where at stage I, we assume that the {\(\theta_i\)} are a random sample from a normal density with mean \(\mu\) and variance \(\tau^2\), and at stage II, we assign the prior parameters \((\mu, \tau^2)\) a distribution \(g(\mu, \tau^2)\).

In a famous example described by Efron and Morris, one is interested in simultaneously estimating the batting abilities for 18 baseball players shown in Table 1. For the \(i\)th player, one observes \(x_i\), the number of hits in 45 at-bats (opportunities), and one assumes \(x_i\) is binomial(45, \(p_i\)), where \(p_i\) is the hitting probability. We are interested in estimating the 18 hitting probabilities \(p_1, ..., p_{18}\). To make the data approximately normal, we transform \(x_i\) by the logit transformation \(y_i = \log(x_i/(45-x_i))\). The reexpressed random variable \(x_i\) is approximately N(\(\theta_i, \sigma^2\)), where \(\theta_i = \log(p_i/(1-p_i))\) and \(\sigma^2 = 0.11\).

Name \(x_i\) \(AB\)
1 Alvardo 12 45
2 Alvis 7 45
3 Berry 14 45
4 Campaneris 9 45
5 Clemente 18 45
6 Howard 16 45
7 Johnstone 15 45
8 Kessinger 13 45
9 Munson 8 45
10 Petrocelli 10 45
11 Robinson 17 45
12 Rodriguez 10 45
13 Santo 11 45
14 Scott 10 45
15 Spencer 14 45
16 Swoboda 11 45
17 Unser 10 45
18 Williams 10 45

In this problem, there are \(k+2\) unknowns, the \(k\) normal means and the second-stage prior parameters \((\mu, \tau^2)\). The joint posterior density of these unknowns is given by \[ g(\{\theta_i\}, \mu, \tau^2 | y) \propto \prod_{i=1}^k \left[ \phi(y_i; \theta_i, \sigma^2) \phi(\theta_i; \mu, \tau^2)\right] g(\mu, \tau^2), \] where \(\phi(y; \mu, \sigma^2)\) denotes a normal density with mean \(\mu\) and variance \(\sigma^2\).

We summarize this joint posterior by considering two distributions, the posterior distribution of the normal means conditional on the parameters \((\mu, \tau^2)\), and the marginal distribution of the second-stage parameters.

10.2.1 Conditional posterior distribution of the normal means

If we condition on the parameters \((\mu, \tau^2)\), then the posterior density of the means \(\{\theta_i\}\) has the form \[ g(\{\theta_i\} | \mu, \tau^2, y) \propto \prod_{i=1}^k \left[ \phi(y_i; \theta_i, \sigma^2) \phi(\theta_i; \mu, \tau^2)\right]. \] We see that \(\theta_1, ..., \theta_k\) are independent and the marginal density of \(\theta_i\) has a normal density with parameters given by the familiar normal/normal updating formula: \[ E(\theta_i | \mu, \tau^2, y) = \frac{y_i/\sigma^2 + \mu/\tau^2}{1/\sigma^2 + 1/\tau^2}, \] and \[ Var(\theta_i | \mu, \tau^2, y) = \frac{1}{1/\sigma^2 + 1/\tau^2}. \]

10.2.2 Posterior distribution of the second-stage parameters

To obtain the marginal posterior distribution of \((\{\mu, \tau^2\}\), we need to integrate out the means \(\{\theta_i\}\) from the joint posterior. From our work with the normal sampling/normal prior model, we know that \[ \int \phi(y_i; \theta_i, \sigma^2) \phi(\theta_i; \mu, \tau^2) \theta_i = \phi(y_i; \mu, \sigma^2 + \tau^2). \] This is just the statement that for a normal/normal model, the predictive density of \(y_i\) is normal with mean equal to the prior mean and a variance given by the sum of the sampling variance and the prior variance. When we integrate out all \(k\) means from the joint posterior, we obtain an expression for the posterior of the second-stage parameters: \[ g(\mu, \tau^2 | y) \propto \left(\prod_{i=1}^k \phi(y_i; \mu, \sigma^2 + \tau^2)\right) g(\mu, \tau^2). \]

Suppose we assume \(\mu\) and \(\tau^2\) are independent, with \(\mu\) assigned a uniform prior and \(\tau^2\) assigned the proper density \[ g(\tau^2) = \frac{1}{1+\tau^2}, \, \, \tau^2 > 0. \] Then the posterior density of \((\{\mu, \tau^2\}\) simplifies as the product \[ g(\mu, \tau^2 | y) = g(\mu | \tau^2, y) g(\tau^2 | y), \] where \(g(\mu | \tau^2, y)\) is normal with mean \(\bar y\) and variance \((\sigma^2 + \tau^2)/k\), and \(g(\tau^2 | y)\) has the form \[ g(\tau^2 | y) \propto \frac{1}{(\sigma^2 + \tau^2)^{(k-1)/2}} \exp\left(-\frac{1}{2(\sigma^2 + \tau^2)} \sum_{i=1}^k (y_i - \bar y)^2 \right)\frac{1}{1+\tau^2}. \]

10.2.3 Posterior means

Let’s focus on the posterior mean of the \(i\)th normal mean \(\theta_i\). By the conditional expectation formula, one has \[ E(\theta_i | y) = E^{\mu, \tau^2} \left[ E(\theta_i | y, \mu, \tau^2) \right], \] where the outer expectation is taken over the posterior distribution of \(\{\mu, \tau^2\}\). Since the posterior distribution of \(\theta_i\) conditional on the second-stage parameters is normal, we have \[ E(\theta_i | y) = E^{\mu, \tau^2} \left[ \frac{y_i/\sigma^2 + \mu/\tau^2}{1/\sigma^2 + 1/\tau^2} \right]. \] Using a similar conditional expectation formula, we can write \[\begin{eqnarray*} E(\theta_i | y) = E^{\tau^2} E^{\mu|\tau^2} \left[ \frac{y_i/\sigma^2 + \mu/\tau^2}{1/\sigma^2 + 1/\tau^2} \right] \nonumber \\ = E^{\tau^2} \left[ \frac{y_i/\sigma^2 + \bar y/\tau^2}{1/\sigma^2 + 1/\tau^2} \right]\nonumber \\ = (1 - F)y_i + F \bar y, \nonumber \\ \end{eqnarray*}\] where \(F\) is the shrinkage \[ F = E^{\tau^2}\left(\frac{1/\tau^2}{1/\tau^2 + 1/\sigma^2} \right) = \int_0^\infty \frac{1/\tau^2}{1/\tau^2 + 1/\sigma^2} g(\tau^2 | y) d\tau^2. \]

To illustrate the posterior calculations, we consider the problem of estimating the collection of hitting probabilities. In Table 2, the column \(x/45\) contains the batting averages, the proportions of hits in the 45 at-bats. The column \(y\) contains the transformed proportions \(y = \log(x/(45-x))\).

We focus on the computation of the posterior means of the normal means \(\{\theta_j\}\). The shrinkage of the posterior mean estimate is controlled by the variance parameter \(\tau^2\) and Figure 1 displays its posterior density. Since the parameter \(\tau^2\) affects the posterior mean through the shrinkage function \[ F(\tau^2) = \frac{1/\tau^2}{1/\tau^2 + 1/\sigma^2}, \] one is interested in the posterior density of \(F(\tau^2)\) and Figure 2 displays its posterior density. Note that the posterior density of the shrinkage function is concentrated on the interval (0.6, 0.8), indicating substantial shrinkage of the individual estimates towards the combined estimate \(\bar y\). One can compute \[ F = E^{\tau^2} (F(\tau)) = 0.68. \] This means that the individual estimate \(y_i\) is shrunk 68% towards the pooled estimate \(\bar y\). Table 2 displays the posterior means \[ E(\theta_i | y) = (1 - 0.68) y_i + 0.68 \bar y. \] By transforming these logit estimates back to the proportion scale \[ p = \exp(\theta)/(1+\exp(\theta)), \] one obtains the Bayesian proportion estimates {\(\hat p_i\)}.

Player \(x/45\) \(y\) \(E(\theta|y)\) \(\hat p\)
1 Alvardo 0.267 -1.012 -1.035 0.262
2 Alvis 0.156 -1.692 -1.251 0.222
3 Berry 0.311 -0.795 -0.966 0.276
4 Campaneris 0.200 -1.386 -1.154 0.240
5 Clemente 0.400 -0.405 -0.842 0.301
6 Howard 0.356 -0.595 -0.902 0.289
7 Johnstone 0.333 -0.693 -0.934 0.282
8 Kessinger 0.289 -0.901 -1.000 0.269
9 Munson 0.178 -1.531 -1.200 0.231
10 Petrocelli 0.222 -1.253 -1.112 0.248
11 Robinson 0.378 -0.499 -0.872 0.295
12 Rodriguez 0.222 -1.253 -1.112 0.248
13 Santo 0.244 -1.128 -1.072 0.255
14 Scott 0.222 -1.253 -1.112 0.248
15 Spencer 0.311 -0.795 -0.966 0.276
16 Swoboda 0.244 -1.128 -1.072 0.255
17 Unser 0.222 -1.253 -1.112 0.248
18 Williams 0.222 -1.253 -1.112 0.248

10.2.4 Priors

In the specification of priors, we did not say much about the choice of priors on the hyperparameters \((\mu, \tau^2)\). That raises some obvious questions. Can we place any vague prior on these parameters? Does the choice of prior matter?

A traditional choice of prior on a variance parameter is the improper form \[ g(\tau^2) = \frac{1}{\tau^2}, \, \, \tau^2 > 0. \] Unfortunately, this choice may result in an improper posterior distribution for \(\tau^2\). We used a random walk Metropolis algorithm to simulate 10,000 variates and Figure 4 shows trace and density plots for the parameters \(\mu\) and \(\log \tau^2\). The trace plot for \(\log \tau^2\) shows much instability in the simulated draws and this reflects the fact that the posterior density is not well-behaved.

By the above investigation, it is clear that a proper prior needs to be chosen for the variance parameter \(\tau^2\). But how does the choice of prior matter? Suppose we slightly generalize the prior of \(\tau^2\) to the log logistic form \[ g(\tau^2) = \frac{M}{(M + \tau^2)^2}, \, \, \tau^2 > 0. \] The parameter \(M\) is the prior median of \(\tau^2\). In the above analysis we chose \(M = 1\) and it natural to ask about the sensitivity of the posterior analysis to the choice of different values for \(M\).

Suppose we are interested in the sensitivity of the shrinkage parameter \(F\) with respect to the prior median \(M\). In our earlier analysis, we found that the posterior mode of \(F\) was 0.686 when \(M = 1\) or when \(\log M = 0\). In Figure 5, we plot the posterior mode of \(F\) as a function of \(\log M\). We see that the posterior shrinkage is relatively insensitive to the prior parameter for “large” values of \(\log M\) between 0 and 2. Indeed, if we choose the very large value \(\log M = 10\), the posterior mode of the shrinkage changes only slightly to 0.655. However, the posterior shrinkage is sensitive to the choice of small values of \(\log M\).

Since the posterior shrinkage is sensitive to the choice of small \(M\), that raises a new question. If we had knowledge about batting averages, what would be a reasonable prior for the variance parameter \(\tau^2\)? The author is a baseball fan and he believes that most of the hitting probabilities for players fall in the interval (0.240, 0.320). If we equate the range of this interval with \(4 \tau\) (corresponding to 95% confidence), we obtain the estimate \(0.02\) for \(\tau\) which would correspond to a prior median of \(M = 0.02^2 = 0.0004\) for \(\tau^2\). With this choice of \(M\), the posterior mode of the shrinkage \(F\) is essentially 1, which means that the observed batting averages are shrunk completely towards the pooled estimate. This is not a surprising result, since the observed batting averages are only based on a few weeks of baseball (45 at-bats), and it makes sense that the prior information about the hitting probabilities will swamp the data information in this case.