Chapter 12 Bayesian Multiple Regression and Logistic Models
12.1 Introduction
In Chapter 11, we introduced simple linear regression where the mean of a continuous response variable was represented as a linear function of a single predictor variable. In this chapter, this regression scenario is generalized in several ways. In Section 12.2, the multiple regression setting is considered where the mean of a continuous response is written as a function of several predictor variables. Methodology for comparing different regression models is described in Section 12.2. The second generalization considers the case where the response variable is binary with two possible responses in Section 12.3. Here one is interested in modeling the probability of a particular response as a function of an predictor variable. Although these situations are more sophisticated, the Bayesian methodology for inference and prediction follows the general approach described in the previous chapters.
12.2 Bayesian Multiple Linear Regression
12.2.1 Example: expenditures of U.S. households
The U.S. Bureau of Labor Statistics (BLS) conducts the Consumer Expenditure Surveys (CE) through which the BLS collects data on expenditures, income, and tax statistics about households across the United States. Specifically, this survey provides information on the buying habits of U.S. consumers. The summary, domain-level statistics published by the CE are used for both policy-making and research, including the most widely used measure of inflation, the Consumer Price Index (CPI). In addition, the CE has measurements of poverty that determine thresholds for the U.S. Government’s Supplemental Poverty Measure.
The CE consists of two surveys. The Quarterly Interview Survey, taken each quarter, aims to capture large purchases (such as rent, utilities, and vehicles), containing approximately 7000 interviews. The Diary Survey, administrated on an annual basis, focuses on capturing small purchases (such as food, beverages, tobacco), containing approximately 14,000 interviews of households.
The CE publishes public-use microdata (PUMD), and a sample of the Quarterly Interview Survey in 2017 1st quarter is collected from the PUMD. This sample contains 1000 consumer units (CU), and provides information of the CU’s total expenditures in last quarter, the amount of CU income before taxes in past 12 months, and the CU’s urban/rural status. Table 12.1 provides the description of each variable in the CE sample.
Table 12.1. The variable description for the CE sample.
Variable | Description |
---|---|
Expenditure | Continuous; CU’s total expenditures in last quarter |
Income | Continuous; the amount of CU income before taxes in |
past 12 months | |
UrbanRural | Binary; the urban/rural status of CU: 1 = Urban, |
2 = Rural |
Suppose someone is interested in predicting a CU’s expenditure from its urban/rural status and its income before taxes. In this example, one is treating expenditure as the response variable and the other two variables as predictors. To proceed, one needs to develop a model to express the relationship between expenditure and the other two predictors jointly. This requires extending the simple linear regression model introduced in Chapter 11 to the case with multiple predictors. This extension is known as multiple linear regression – the word multiple indicates two or more predictors are present in the regression model. This section describes how to set up a multiple linear regression model, how to specify prior distributions for regression coefficients of multiple predictors, and how to make Bayesian inferences and predictions in this setting.
Recall in Chapter 11, the mean response \(\mu_i\) was expressed as a linear function of the single continuous predictor \(x_i\) depending on an intercept parameter \(\beta_0\) and a slope parameter \(\beta_1\):
\[\begin{equation*} \mu_i = \beta_0 + \beta_1 x_i. \end{equation*}\]
In particular, the slope parameter \(\beta_1\) is interpreted as the change in the expected response \(\mu_i\), when the predictor \(x_i\) of record \(i\) increases by a single unit. In the household expenditures example, not only there are multiple predictors, but the predictors are of different types including one continuous predictor (income), and one binary categorical (rural/urban status) predictor. As Chapter 11 focused on continuous-valued predictors, the interpretation of a regression coefficient for a binary categorical predictor is an important topic for discussion in this section.
12.2.2 A multiple linear regression model
Similar to a simple linear regression model, a multiple linear regression model assumes a observation specific mean \(\mu_i\) for the \(i\)-th response variable \(Y_i\). \[\begin{equation} Y_i \mid \mu_i, \sigma \overset{ind}{\sim} \textrm{Normal}(\mu_i, \sigma), \, \, i = 1, \cdots, n. \tag{12.1} \end{equation}\] In addition, it assumes that the mean of \(Y_i\), \(\mu_i\), is a linear function of all predictors. In general, one writes \[\begin{equation} \mu_i = \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_r x_{i,r}, \tag{12.2} \end{equation}\] where \(\mathbf{x}_i = (x_{i,1}, x_{i,2}, \cdots, x_{i,r})\) is a vector of \(r\) known predictors for observation \(i\), and \(\mathbf{\beta} = (\beta_0, \beta_1, \cdots, \beta_r)\) is a vector of unknown regression parameters (coefficients), shared among all observations.
For studies where all \(r\) predictors are continuous, one interprets the intercept parameter \(\beta_0\) as the expected response \(\mu_i\) for observation \(i\), where all of its predictors take values of 0 (i.e. \(x_{i,1} = x_{i,2} = \cdots = x_{i,r} = 0\)). One can also interpret the slope parameter \(\beta_i\) (\(j = 1, 2, \cdots, r\)) as the change in the expected response \(\mu_i\), when the \(j\)-th predictor, \(x_{i,j}\), of observation \(i\) increases by a single unit while all remaining \((r-1)\) predictors stay constant.
However in the household expenditures example from the CE data sample, not all predictors are continuous. The urban/rural status variable is a binary categorical variable, taking a value of 1 if the CU is in an urban area, and taking value of 2 if the CU is in a rural area. It is possible to consider the variable as continuous and interpret the associated regression coefficient as the change in the expected response \(\mu_i\) when the CU’s urban/rural status changes by one unit from urban to rural (corresponding to change from one to two). But it is much more common to consider this variable as a binary categorical variable that classifies the observations into two distinct groups: the urban group and the rural group. It will be seen that this classification puts an emphasis on the difference of the expected responses between the two distinct groups.
Consequently, consider the construction of a new indicator variable in place of the binary variable. This new indicator variable takes a value of 0 if the CU is in an urban area, and a value of 1 if the CU is in a rural area. To understand the implication of this indictor variable, it is helpful to consider a simplified regression model with a single predictor, the binary indicator for rural area \(x_i\). This simple linear regression model expresses the linear relationship as \[\begin{equation} \mu_i = \beta_0 + \beta_1 x_i = \begin{cases} \beta_0, & \text{ the urban group}; \\ \beta_0 + \beta_1, & \text{ the rural group}. \\ \end{cases} \tag{12.3} \end{equation}\] The expected response \(\mu_i\) for CUs in the urban group is given by \(\beta_0\), and the expected response \(\mu_i\) for CUs in the rural group is \(\beta_0 + \beta_1\). In this case \(\beta_1\) represents the change in the expected response \(\mu_i\) from the urban group to the rural group. That is, \(\beta_1\) represents the effect of being a member of the rural group.
Before continuing, there is a need for some data transformation. Both the expenditure and income variables are highly skewed, and both variables have more even distributions if we apply logarithm transformations. So the response variable will be the logarithm of the CU’s total expenditure and the continuous predictor will be the logarithm of the CU 12-month income. Figure 12.1 displays scatterplots of log income and log expenditure where the two panels correspond to urban and rural residents. Note that in each panel there appears to be a positive association between log income and log expenditure.
Now that the data transformations are completed, one is ready to set up a multiple linear regression model for the log expenditure response including one continuous predictor and one binary categorical predictor. The expected response \(\mu_i\) is expressed as a linear combination of the log income variable and the rural indicator variable. \[\begin{eqnarray} \mu_i = \beta_0 &+& \beta_1 x_{i, income} + \beta_2 x_{i, rural}. \tag{12.4} \end{eqnarray}\] The multiple linear regression model is written as \[\begin{eqnarray} Y_i \mid \beta_0, \beta_1, \beta_2, \sigma \overset{ind}{\sim} \textrm{Normal}(\beta_0 &+& \beta_1 x_{i, income} + \beta_2 x_{i, rural}, \sigma), \nonumber \\ \tag{12.5} \end{eqnarray}\] where \(\mathbf{x}_i = (x_{i, income}, x_{i, rural})\) is a vector of predictors and \(\sigma\) is the standard deviation in the Normal model shared among all responses \(Y_i\)’s.
The regression parameters have clear interpretations. The intercept parameter \(\beta_0\) is the expected log expenditure when both the remaining variables are 0’s: \(x_{i, income} = x_{i, rural} = 0\). This intercept represents the mean log expenditure for an urban CU with a log income of 0.
The regression slope coefficient \(\beta_1\) is associated with the continuous predictor variable, log income. This slope \(\beta_1\) can be interpreted as the change in the expected log expenditure when the predictor log income of record \(i\) increases by one unit, while all other predictors stay unchanged.
The remaining regression coefficient \(\beta_2\) represents the change in the expected log expenditure compared relative to the expected log expenditure of the associated reference category, while all other predictors stay unchanged. In other words, \(\beta_2\) is the change in the expected log expenditure of a rural CU comparing to an urban CU, when the two CUs have the same log income.
With an understanding of the meaning of the regression coefficients, one can now proceed to a description of a prior and MCMC algorithm of this multiple linear regression model. Note that one needs to construct a prior distribution for the set of parameters \((\beta_0, \beta_1, \beta_2, \sigma)\). We begin by describing the weakly informative prior approach and the subsequent MCMC inference.
12.2.3 Weakly informative priors and inference through MCMC
In situations where the data analyst has limited prior information about the regression parameters or the standard deviation, it is desirable to assign a prior that has little impact on the posterior. Similar to the weakly informative prior for simple linear regression described in Chapter 11, one assigns a weakly informative prior for a multiple linear regression model using standard functional forms. Assuming independence, the prior density for the set of parameters \((\beta_0, \beta_1, \beta_2, \sigma)\) is written as a product of the component densities:
\[\begin{equation*} \pi(\beta_0, \beta_1, \beta_2, \sigma) = \pi(\beta_0) \pi(\beta_1) \pi(\beta_2) \pi(\sigma), \end{equation*}\] where \(\beta_0\) is \(\textrm{Normal}(m_0, s_0)\), \(\beta_1\) is \(\textrm{Normal}(m_1, s_1)\), \(\beta_2\) is \(\textrm{Normal}(m_2, s_2)\), and the precision parameter \(\phi = 1/\sigma^2\), the inverse of the variance \(\sigma^2\), is \(\textrm{Gamma}(a, b)\).
If one has little information about the location of the regression parameters \(\beta_0\), \(\beta_1\), and \(\beta_2\), one assigns the respective prior means to be 0 and the prior standard deviations to be large values, say 20. In similar fashion, if little knowledge exists about the location of the sampling standard deviation \(\sigma\), one assigns small values for the hyperparameters, \(a\) and \(b\), say \(a = b = 0.001\), for the Gamma prior placed on the precision \(\phi = 1/\sigma^2\).
One uses the JAGS software to draw MCMC samples from this multiple linear regression model. The process of using JAGS mimics the general approach used in earlier chapters.
Describe the model by a script
The first step in using JAGS writes the following script defining the multiple linear regression model, saving the script in the character string "modelString".
modelString <-"
model {
## sampling
for (i in 1:N){
y[i] ~ dnorm(beta0 + beta1*x_income[i] +
beta2*x_rural[i], invsigma2)
}
## priors
beta0 ~ dnorm(mu0, g0)
beta1 ~ dnorm(mu1, g1)
beta2 ~ dnorm(mu2, g2)
invsigma2 ~ dgamma(a, b)
sigma <- sqrt(pow(invsigma2, -1))
}
"
In the sampling section of the script, the iterative loop goes from 1
to N
, where N
is the number of observations with index i
. Recall that the Normal distribution dnorm
in JAGS is stated in terms of the mean and the precision and the variable invsigma2
corresponds to the Normal sampling precision. The variable sigma
is defined in the prior section of the script so one can track the simulated values of the standard deviation \(\sigma\).
Also the variables m0
, m1
, m2
correspond to the means, and g0
, g1
, g2
correspond to the precisions of the Normal prior densities for the three regression parameters.
Define the data and prior parameters
The next step is to provide the observed data and the values for the prior parameters. In the R script below, a list the_data
contains the vector of log expenditures, the vector of log incomes, the indicator variables for the categories of the binary categorical variable, and the number of observations. This list also contains the means and precisions of the Normal priors for beta0
through beta2
and the values of the two parameters a
and b
of the Gamma prior for invsigma2
. The prior mean of the Normal priors on the individual regression coefficients is 0, for mu0
through mu2
. The prior standard deviations of the Normal priors on the individual regression coefficients are 20, and so the corresponding precision values are \(1/20^2 = 0.0025\) for g0
through g2
.
y <- as.vector(CEsample$log_TotalExp)
x_income <- as.vector(CEsample$log_TotalIncome)
x_rural <- as.vector(CEsample$Rural)
N <- length(y)
the_data <- list("y" = y, "x_income" = x_income,
"x_rural" = x_rural, "N" = N,
"mu0" = 0, "g0" = 0.0025,
"mu1" = 0, "g1" = 0.0025,
"mu2" = 0, "g2" = 0.0025,
"a" = 0.001, "b" = 0.001)
Generate samples from the posterior distribution
The run.jags()
function in the runjags
package generates posterior samples by the MCMC algorithm using the JAGS software. The script below runs one MCMC chain with an adaption period of 1000 iterations, a burn-in period of 5000 iterations, and an additional set of 20,000 iterations to be run and collected for inference. By using the argument monitor = c("beta0", "beta1", "beta2", "sigma")
, one keeps tracks of all four model parameters. The output variable posterior
contains a matrix of simulated draws.
posterior <- run.jags(modelString,
n.chains = 1,
data = the_data,
monitor = c("beta0", "beta1",
"beta2", "sigma"),
adapt = 1000,
burnin = 5000,
sample = 20000)
MCMC diagnostics
To obtain valid inferences from the posterior draws from the MCMC simulation, one should assess convergence of the MCMC chain. The plot()
function with the argument input vars
returns four diagnostic plots (trace plot, empirical CDF, histogram and autocorrelation plot) for the specified parameter. For example, Figure 12.2 shows the diagnostic plots for the slope parameter \(\beta_1\) for the log income predictor using the following code.
plot(posterior, vars = "beta1")
The upper left trace plot shows MCMC mixing for the 20,000 simulated draws of \(\beta_1\). In this example, the lower right autocorrelation plot indicates relatively large correlation values between adjacent posterior draws of \(\beta_1\). In this particular example, since the mixing was not great, it was decided to take a larger sample of 20,000 draws to get good estimates of the posterior distribution. In usual practice, one should perform these diagnostics for all parameters in the model.
Summarization of the posterior
Posterior summaries of the parameters are obtained by use of the print(posterior, digits = 3) command. Note that these summaries are based on the 20,000 iterations from the sampling period excluding the samples from the adaption and burn-in periods.
print(posterior, digits = 3)
Lower95 Median Upper95 Mean SD Mode MCerr
beta0 4.59 4.95 5.36 4.95 0.201 -- 0.0166
beta1 0.328 0.365 0.4 0.365 0.0188 -- 0.00155
beta2 -0.482 -0.267 -0.0476 -0.269 0.112 -- 0.00112
sigma 0.735 0.769 0.802 0.769 0.0172 -- 0.000172
One way to determine if the two variables are useful predictors is to inspect the location of the 90% probability intervals. The interval estimate for \(\beta_1\) (corresponding to log income) is (0.328, 0.400) and the corresponding estimate for \(\beta_2\) (corresponding to the rural variable ) is (\(-0.482, -0.048\)). Since both intervals do not cover zero, this indicates that both log income and the rural variables are helpful in predicting log expenditure.
Several types of summaries of the posterior distribution are illustrated. Suppose one is interested in learning about the expected log expenditure. From the regression model, the mean log expenditure is equal to \[\begin{equation} \beta_0 + \beta_1 x_{income} \tag{12.6} \end{equation}\] for urban CUs, and equal to \[\begin{equation} \beta_0 + \beta_1 x_{income} + \beta_2 \tag{12.7} \end{equation}\] for rural CUs. Figure 12.3 displays simulated draws from the posterior of the expected log expenditure superposed over the scatterplots of log income and log expenditure for the urban and rural cases. Note that there is more variation in the posterior draws for the rural units – this is reasonable since only a small portion of the data came from rural units.
Figure 12.4 displays the posterior density of the mean log expenditure for the predictor pairs (log Income = 9, Rural = 1), (log Income = 9, Rural = 0), (log Income = 12, Rural = 1), and (log Income = 12, Rural = 0). It is pretty clear from this graph that log income is the more important predictor. For both urban and rural CUs, the log total expenditure is much larger for log income = 12 than for log income = 9. Given a particular value of log expenditure, the log expenditure is slightly higher for urban (Rural = 0) compared to rural units.
12.2.4 Prediction
A related problem is to predict a CU’s log expenditure for a particular set of predictor values. Let \(\tilde{Y}\) denote the future response value for the expenditure for given values of income \(x^*_{income}\) and rural value \(x^*_{rural}\). One represents the posterior predictive density of \(\tilde{Y}\) as \[\begin{equation} f(\tilde{Y} = \tilde{y} \mid y) = \int f(\tilde{y} \mid y, \beta, \sigma) \pi(\beta, \sigma \mid y) d\beta, \tag{12.8} \end{equation}\] where \(\pi(\beta, \sigma | y)\) is the posterior density and \(f(\tilde{Y} = \tilde{y} \mid y, \beta, \sigma)\) is the Normal sampling density which depends on the predictor values.
R Work Since we have already produced simulated draws from the posterior distribution, it is straightforward to simulate from the posterior predictive distribution. One simulates a single draw from \(f(\tilde{Y} = \tilde{y} \mid y)\) by first simulating a value of \(({\mathbf{\beta}}, \sigma)\) from the posterior – call this draw \((\beta^{(s)}, \sigma^{(s)})\). Then one simulates a draw of \(\tilde{Y}\) from a Normal density with mean \(\beta_0^{(s)} + \beta_1^{(s)} x^*_{income} + \beta_2^{(s)} x^*_{rural}\) and standard deviation \(\sigma^{(s)}\). By repeating this process for a large number of iterations, the function one_predicted()
simulates a sample from the posterior prediction distribution for particular predictor values \(x^*_{income}\) and \(x^*_{rural}\).
one_predicted <- function(x1, x2){
lp <- post[ , "beta0"] + x1 * post[ , "beta1"] +
x2 * post[, "beta2"]
y <- rnorm(5000, lp, post[, "sigma"])
data.frame(Value = paste("Log Income =", x1,
"Rural =", x2),
Predicted_log_TotalExp = y)
}
df <- map2_df(c(12, 12),
c(0, 1), one_predicted)
This procedure is implemented for the two sets of predictor values (log income, rural) = (12, 1) and (log income, rural) = (12, 0). Figure 12.5 displays density estimates of the posterior predictive distributions for the two cases. Comparing Figures 12.4 and 12.5, note the increased width of the prediction densities relative to the expected response densities. One confirms this by computing interval estimates. For example, for the values (log income, rural) = (12, 1), a 90% interval for the expected log expenditure is (8.88, 9.25) and the 90% interval for the predicted log expenditure for the same predictor values is (7.81, 10.34).
12.3 Comparing Regression Models
When one fits a multiple regression model, there is a list of inputs, i.e. potential predictor variables, and there are many possible regression models to fit depending on what inputs are included in the model. In the household expenditures example, there are two possible inputs, the log total income and the rural/urban status and there are 2 x 2 = 4 possible models depending on the inclusion or exclusion of each input. When there are many inputs, the number of possible regression models can be quite large and so there needs to be some method for choosing the “best” regression model. A simple example to used to describe what is meant by a best model and then a general method is outlined for selecting between models.
Learning about a career trajectory
To discuss model selection in a simple context, consider a baseball modeling problem that will be more thoroughly discussed in Chapter 13. One is interested in seeing how a professional athlete ages during his or her career. In many sports, an athletic enters his/her professional career at a modest level of performance, gets better until a particular age when peak performance is achieved, and then decreases in the level of performance until retirement. One can use a regression model to explore the pattern of performance over age – this pattern is typically called the athletic’s career trajectory.
We focus on a particular great historical baseball player Mike Schmidt who played in Major League Baseball from 1972 through 1989. Figure 12.6 first displays a scatterplot of the rate that Schmidt hit home runs as a function of his age. If \(y_i\) denotes Schmidt’s home run rate during the \(i\)-th season when his age was \(x_i\), Figure 12.6 further overlays fits from the following three career trajectory models:
- Model 1 - Linear: \[\begin{equation*} Y_i \mid \beta_0, \beta_1, x_i, \sigma \sim \textrm{Normal}(\beta_0 + \beta_1 (x_i - 30), \sigma). \end{equation*}\]
- Model 2 - Quadratic: \[\begin{equation*} Y_i \mid \beta_0, \beta_1, \beta_2, x_i, \sigma \sim \textrm{Normal}(\beta_0 + \beta_1 (x_i - 30) + \beta_2 (x_i - 30)^2, \sigma). \end{equation*}\]
- Model 3 - Cubic:
\[ Y_i \mid \beta_0, \beta_1, \beta_2, \beta_3, x_i, \sigma \sim \textrm{Normal}(\beta_0 + \beta_1 (x_i - 30)+ \beta_2 (x_i - 30)^2 + \beta_3 (x_i - 30)^3, \sigma). \]
Model 1 says that Schmidt’s true home run performance is a linear function of his age, Model 2 says that his home run performance follows a parabolic shape, and Model 3 indicates that his performance follows a cubic curve. Based on the earlier comments about the knowledge of shapes of career trajectories, the linear function of age given in Model 1 does not appear suitable in reflecting the "down, up, down" trend that we see in the scatterplot. The fits of Models 2 and 3 appear to be similar in appearance, but there are differences in the interpretation of the fits. The quadratic fit (Model 2) indicates that Schmidt’s peak performance occurs about the age of 30 while the cubic fit (Model 3) indicates that his peak performance occurs around the age of 33. How can we choose between the two models?
Underfitting and overfitting
In model building, there are two ways of misspecifying a model that we call “underfitting” and “overfitting” that is described in the context of this career trajectory example. First, it is important that the model includes all inputs is helpful in explaining the variation in the response variable. Failure to include relevant inputs in the model will result in underfitting. In our example, age is the predictor variable and the possible inputs are age, age\(^2\), and age\(^3\). If we use Model 1 which includes only the input age, this particular model appears to underfit the data since this model does not reflect the increasing and decreasing pattern in the home run rates that we see in Figure 12.6.
At the other extreme, one should be careful not to include too many inputs in the model. When one includes more inputs in our regression model than needed, one has overfitting. Model 3 possibly overfits the data, since it may not be necessary to represent a player’s trajectory by a cubic curve – perhaps a quadratic curve is sufficient. In an extreme situation, by increasing the degree of the polynomial function of age, one can find a fitted curve that goes through most of the points in the scatterplot. This would be a severe case of overfitting since it is unlikely that a player’s true career trajectory is represented by a polynomial of a high degree.
Cross-validation
How does one choose a suitable regression model that avoids the underfitting and overfitting problems described above? A general method of comparing models is called cross-validation. In this method, one partitions the dataset into two parts – the “training” and “testing” components. One initially fits each regression model to the training dataset. Then one uses each fitted model to predict the response variable in the testing dataset. The model that is better in predicting observations in the future testing dataset is the preferred model.
Let’s describe how one implements cross-validation for our career trajectory example. In the example, Mike Schmidt had a total of 8170 at-bats for 13 seasons. One randomly divides these 8170 at-bats into two datasets – 4085 of the at-bats (and the associated home run and age variables) are placed in a training dataset and the remaining at-bats become the testing dataset. Let \(\{(x_i^{(1)}, y_i^{(1)})\}\) denote the age and home run rate variables from the training dataset and \(\{(x_i^{(2)}, y_i^{(2)})\}\) denote the corresponding variables from the testing dataset.
Suppose one considers the use of Model 1 where the home run rate \(Y_i^{(1)} \sim \textrm{Normal}(\mu_i, \sigma)\) where the mean rate is \(\mu_i = \beta_0 + (\beta_1 - 30) x_i^{(1)}\). One places a weakly informative prior on the vector of parameters \((\beta_0, \beta_1, \sigma)\) and define the likelihood using the training data. One uses JAGS to simulate from the posterior distribution and obtain the fitted regression \[\begin{equation*} \mu = \tilde{\beta}_0 + (\tilde{\beta}_1 - 30) x, \end{equation*}\] where \(\tilde{\beta}_0\) and \(\tilde{\beta}_1\) are the posterior means of the regression intercept and slope respectively.
One now uses this fitted regression to predict values of the home run rate from the testing dataset. One could simulate predictions from the posterior predictive distribution, but for simplicity, suppose one is interested in making a single prediction. For the \(i\)-th value of age \(x_i^{(2)}\) in the testing dataset, our “best” prediction of the \(i\)-th home run rate from Model 1 would be \(\tilde{y}_i^{(2)}\) where \[\begin{equation*} \tilde{y}_i^{(2)} = \tilde{\beta}_0 + (\tilde{\beta}_1 - 30) x_i^{(2)}. \end{equation*}\] If one performs this computation for all ages, one obtains a set of predictions {\(\tilde{y}_i^{(2)}\}\) that one would like to be close to the actual home run rates {\(y_i^{(2)}\}\) in the training dataset. It is unlikely that the prediction will be on target so one considers the prediction error that is the difference between the prediction and the response \(|\tilde{y}_i^{(2)} - y_i^{(2)}|\). One measures the closeness of the predictions by computing the sum of squared prediction errors (SSPE): \[\begin{equation} SSPE = \sum (\tilde{y}_i^{(2)} - y_i^{(2)})^2. \tag{12.9} \end{equation}\]
The measure \(SSPE\) describes how well the fitted model predicts home run rates from the training dataset. One uses this measure to compare predictions from alternative regression models. Specifically, suppose each of the regression models (Model 1, Model 2, and Model 3) is fit to the training dataset and each of the fitted models is used to predict the home run rates of the testing dataset. Suppose the sum of squared prediction errors for the three fitted models are \(SSPE_1\), \(SSPE_2\) and \(SSPE_3\). The best model is the model corresponding to the smallest value of \(SSPE\). If this model turns out to be Model 2, then we say that Model 2 is best in that it is best in predicting home run rates in a future or “out-of-sample” dataset.
Approximating cross-validation by DIC
The cross validation method of assessing model performance can be generally applied in many situations. However, there are complications in implementing cross validation in practice. One issue is how the data should be divided into the training and testing components. In our example, the data was divided into two datasets of equal size, but it is unclear if this division scheme is best in practice. Another issue is that the two datasets were divided using a random mechanism. The problem is that the predictions and the sum of squared prediction errors can depend on the random assignment of the two groups. That raises the question – is it necessary to perform cross validation to compare the predictive performance of two models?
A best regression model is the one that provides the best predictions of the response variable in an out-of-sample or future dataset. Fortunately, it is not necessary in practice to go through the cross-validation process. It is possible to compute a measure, called the or DIC, from the simulated draws from the posterior distribution that approximates a model’s out-of-sample predictive performance. The description and derivation of the DIC measure is outside of the scope of this text – a brief description of this method is contained in the appendix. But we illustrate the use of DIC measure for the career trajectory example. It can be applied generally and is helpful for comparing the predictive performance of several Bayesian models.
Example of model comparison
To illustrate the application of DIC, let’s return to the career trajectory example. As usual practice, JAGS will be used to fit a specific Bayesian model. To fit the quadratic model \(M_2\), one writes the following JAGS model description.
At the sampling stage, the home run rates y[i]
are assumed to be a quadratic function of the ages x[i]
, and at the prior stage, the regression coefficients beta0
, beta1
, beta2
, and the precision phi
are assigned weakly informative priors. The variable the_data
is a list containing the observed home run rates, ages, and sample size.
modelString = "
model {
for (i in 1:N){
y[i] ~ dnorm(mu[i], phi)
mu[i] <- beta0 + beta1 * (x[i] - 30) +
beta2 * pow(x[i] - 30, 2)
}
beta0 ~ dnorm(0, 0.001)
beta1 ~ dnorm(0, 0.001)
beta2 ~ dnorm(0, 0.001)
phi ~ dgamma(0.001, 0.001)
}
"
d <- filter(sluggerdata,
Player == "Schmidt", AB >= 200)
the_data <- list(y = d$HR / d$AB,
x = d$Age,
N = 16)
The model is fit by the run.jags()
function. To compute DIC, it is necessary to run multiple chains, which is indicated by the argument n.chains = 2
that two chains will be used.
post2 <- run.jags(modelString,
n.chains = 2,
data = the_data,
monitor = c("beta0", "beta1",
"beta2", "phi"))
To compute DIC, the extract.runjags()
function is applied on the runjags object post2
. The “Penalized deviation” output is the value of DIC computed on the simulated MCMC output.
extract.runjags(post2, "dic")
Mean deviance: -88.98
penalty 4.817
Penalized deviance: -84.17
The value of DIC = \(-84.17\) for this single quadratic regression model is not meaningful, but one compares values of DIC for
competing models. Suppose one wishes to compare models \(M_1, M_2, M_3\) and a quartic regression where one represents the home run rate as a polynomial of fourth degree of the age. For each model, a JAGS script is written where the regression coefficients and the precision parameter are assigned weakly informative priors. The run.jags()
function is applied to produce a posterior sample and the extract.runjags()
with the dic
argument to extract the value of DIC. Table 12.2 displays the values of DIC for the four regression models. The "best" model is the model with the smallest value of DIC. Looking at the values in Table 12.2, one sees that the quadratic model has the smallest value of \(-84.2\). The interpretation is that the quadratic model is best in the sense that it will provide the best out-of-sample predictions.
Table 12.2. DIC values for four regression models fit to Mike Schmidt’s home run rates.
Model | DIC |
---|---|
Linear | \(-80.4\) |
Quadratic | \(-84.2\) |
Cubic | \(-82.1\) |
Quartic | \(-79.0\) |
12.4 Bayesian Logistic Regression
12.4.1 Example: U.S. women labor participation
The University of Michigan Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world. The study began in 1968 with a nationally representative sample of over 18,000 individuals living in 5000 families in the United States. Information on these individuals and their descendants has been collected continuously, including data covering employment, income, wealth, expenditures, health, marriage, childbearing, child development, philanthropy, education, and numerous other topics.
The PSID 1976 survey has attracted particular attention since it interviewed wives in the households directly in the previous year. The survey provides helpful self-reporting data sources for studies of married women’s labor supply. A sample includes information on family income exclusive of wife’s income (in $1000) and the wife’s labor participation (yes or no). This PSID sample contains 753 observations and two variables. Table 12.3 provides the description of each variable in the PSID sample.
Table 12.3. The variable description for hte PSID sample.
Variable | Description |
---|---|
LaborParticipation | Binary; the labor participation status of the wife: |
1 = yes, 0 = no | |
FamilyIncome | Continuous; the family income exclusive of wife’s |
income, in $1000, 1975 U.S. dollars |
Suppose one is interested in predicting a wife’s labor participation status from the family income exclusive of her income. In this example, one is treating labor participation as the response variable and the income variable as a predictor. Furthermore, the response variable is not continuous, but binary – either the wife is working or she is not. To analyze a binary response such as labor participation, one is interested in estimating the probability of a labor participation (yes) as a function of the predictor variable, family income exclusive of her income. This requires a new model that can express the probability of a yes as a function of the predictor variable.
Figure 12.7 displays a scatterplot of the family income against the labor participation status. Since the labor participation variable is binary, the points are jittered in the vertical direction. From this graph, we see that roughly half of the wives are working and it is difficult to see if the family income is predictive of the participation status.
Recall in Chapter 11, when one had a continuous-valued response variable and a single continuous predictor, the mean response \(\mu_i\) was be expressed as a linear function of the predictor through an intercept parameter \(\beta_0\) and a slope parameter \(\beta_1\): \[\begin{equation} \mu_i = \beta_0 + \beta_1 x_i. \tag{12.10} \end{equation}\] Moreover it is reasonable to use a Normal regression model where the response \(Y_i\) is Normally distributed where the mean \(\mu_i\) with a linear function as in Equation (12.10). \[\begin{equation*} Y_i \mid \mu_i, \sigma \overset{ind}{\sim} \textrm{Normal}(\mu_i, \sigma), \,\,\, i = 1, \cdots, n. \end{equation*}\]
However, such a Normal density setup is not sensible for this labor participation example. For a binary response \(Y_i\), the mean is a probability \(\mu_i\) that falls in the interval from 0 to 1. Thus the model \(\mu_i = \beta_0 + \beta_1 x_i\) is not sensible since the linear component \(\beta_0 + \beta_1 x_i\) is on the real line, not in the interval [0, 1].
In the upcoming subsections, it is described how to construct a regression model for binary responses using a linear function. In addition, this section describes how to interpret regression coefficients, how to specify prior distributions for these coefficients, and simulate posterior samples for these models.
12.4.2 A logistic regression model
Recall in Chapter 1 and Chapter 7, the definition of odds was introduced – an odds is the ratio of the probability of some event will take place over the probability of the event will not take place. The notion of odds will be used in how one represents the probability of the response in the regression model.
In the PSID example, let \(p_i\) be the probability of labor participation of married woman \(i\), and the corresponding odds of participation is \(\frac{p_i}{1 - p_i}\). The probability \(p_i\) falls in the interval [0, 1] and the odds is a positive real number. If one applies the logarithm transformation on the odds, one obtains a quantity, called a log odds or logit, that can take both negative and positive values on the real line. One obtains a linear regression model for a binary response by writing the logit in terms of the linear predictor.
The binary response \(Y_i\) is assumed to have a Bernoulli distribution with probability of success \(p_i\).
\[\begin{equation}
Y_i \mid p_i \overset{ind}{\sim} \textrm{Bernoulli}(p_i), \,\,\, i = 1, \cdots, n.
\tag{12.11}
\end{equation}\]
The logistic regression model writes that the logit of the probability \(p_i\) is a linear function of the predictor variable \(x_i\):
\[\begin{equation}
\textrm{logit}(p_i) = \textrm{log}\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_i.
\tag{12.12}
\end{equation}\]
It is more challenging to interpret the regression coefficients in a logistic model. In simple linear regression with one predictor, the interpretation of the intercept and the slope is relatively straightforward, as the linear function is directly assigned to the mean \(\mu_i\). With the logit function as in Equation (12.12), one sees that the regression coefficients \(\beta_0\) and \(\beta_1\) are directly related to the log odds \(\textrm{log}\left(\frac{p_i}{1 - p_i}\right)\) instead of \(p_i\).
For example, the intercept \(\beta_0\) is the log odds \(\textrm{log}\left(\frac{p_i}{1 - p_i}\right)\) for observation \(i\) when the predictor takes a value of 0. In the PSID example, it refers to the log odds of labor participation of a married woman, whose family has 0 family income exclusive of her income.
The slope \(\beta_1\) refers to the change in the expected log odds of labor participation of a married woman who has an additional $1000 family income exclusive of her own income.
By rearranging the logistic regression Equation (12.12), one expresses the regression as a nonlinear equation for the probability of success \(p_i\): \[\begin{eqnarray} \textrm{log}\left(\frac{p_i}{1 - p_i}\right) &=& \beta_0 + \beta_1 x_i \nonumber \\ \frac{p_i}{1 - p_i} &=& \exp(\beta_0 + \beta_1 x_i) \nonumber \\ \end{eqnarray}\] \[\begin{equation} p_i = \frac{\exp(\beta_0 + \beta_1 x_i)}{1 + \exp(\beta_0 + \beta_1 x_i)}. \tag{12.13} \end{equation}\] Equation (12.13) shows that the logit function guarantees that the probability \(p_i\) lies in the interval [0, 1].
With these building blocks, one proceeds to prior specification and MCMC posterior inference of this logistic regression model. Note that a prior distribution is needed for the set of regression coefficient parameters: \((\beta_0, \beta_1)\). In the next subsections, a conditional means prior approach is explored in this prior construction and the subsequent MCMC inference.
12.4.3 Conditional means priors and inference through MCMC
A conditional means prior can be constructed in a straightforward manner for logistic regression with a single predictor. This type of prior was previously constructed in Chapter 11 for a Normal regression problem in the gas bill example. A weakly informative prior can always be used when little prior information is available. In contrast, the conditional means prior allows the data analyst to incorporate useful prior information about the probabilities at particular observation values.
The task is to construct a prior on the vector of regression coefficients \(\beta = (\beta_0, \beta_1)\).
Since the linear component \(\beta_0 + \beta_1 x\) is indirectly related to the probability \(p\), it is generally difficult to think directly about plausible values of the intercept \(\beta_0\) and slope \(\beta_1\) and think about the relationship between these regression parameters.
Instead of constructing a prior on \(\beta\) directly, a conditional means prior indirectly specifies a prior by constructing priors on the probability values \(p_1\) and \(p_2\) corresponding to two predictor values \(x_1^*\) and \(x_2^*\).
By assuming independence of one’s beliefs about \(p_1^*\) and \(p_2^*\), this implies a prior on the probability vector \((p_1^*, p_2^*)\). Since the regression coefficients \(\beta_0\) and \(\beta_1\) are functions of the probability values, this process essentially specifies a prior on the vector \(\beta\).
A conditional means prior
To construct a conditional means prior, one considers two values of the predictor \(x_1^*\) and \(x_2^*\) and constructs independent Beta priors for the corresponding probabilities of success.
For the first predictor value \(x_1^*\), construct a Beta prior for the probability \(p^*_1\) with shape parameters \(a_1\) and \(b_1\).
Similarly, for the second predictor value \(x_2^*\), construct a Beta prior for the probability \(p^*_2\) with shape parameters \(a_2\) and \(b_2\).
If one’s beliefs about the probabilities \(p^*_1\) and \(p^*_2\) are independent, the joint prior for the vector \((p^*_1, p^*_2)\) has the form \[\begin{equation*} \pi(p^*_1, p^*_2) = \pi(p^*_1) \pi(p^*_2). \end{equation*}\]
The prior on \((p^*_1, p^*_2)\) implies a prior on the regression coefficient vector (\(\beta_0, \beta_1)\). First write the two conditional probabilities \(p^*_1\) and \(p^*_2\) as function of the regression coefficient parameters \(\beta_0\) and \(\beta_1\), as in Equation (12.13). By solving these two equations for the regression coefficient parameters, one expresses each regression parameter as a function of the conditional probabilities.
\[\begin{equation} \beta_1 = \frac{\textrm{logit}(p_1^*) - \textrm{logit}(p_2^*)}{x_1^* - x_2^*}, \tag{12.14} \end{equation}\]
\[\begin{equation} \beta_0 = \textrm{log}\left(\frac{p^*_1}{1-p^*_1}\right) - \beta_1 x_1^*. \tag{12.15} \end{equation}\]
Let’s illustrate constructing a conditional means prior for our example. Consider two different family income (exclusive of the wife’s income), say $20,000 and $80,000 (predictor is in $1000 units). For each family income, a Beta prior is constructed for the probability of the wife’s labor participation. As in Chapter 7, a Beta prior is assessed by specifying two quantiles of the prior distribution and finding the values of the shape parameters that match those specific quantile values.
Consider the labor participation probability \(p_1^*\) for the value \(x = 20\), corresponding to a $20,000 family income. Suppose one believes the median of this probability is 0.10 and the 90th percentile is equal to 0.2. Using the R function
beta_select()
this belief is matched to a Beta prior with shape parameters 2.52 and 20.08.Next the participation probability \(p_2^*\) for the value \(x = 80\), corresponding to a $80,000 family income. The median and 90th percentile of this probability are thought to be 0.7 and 0.8, respectively, and this information is matched to a Beta prior with shape parameters 20.59 and 9.01.
Figure 12.8 illustrates the conditional means prior for this example. Each bar displays the 90% interval estimate for the participation probability for a particular value of the family income.
Assuming independence of the prior beliefs about the two probabilities, one represents the joint prior density function for (\(p_1^*, p_2^*\)) as the product of densities \[\begin{equation} \pi(p_1^*, p_2^*) = \pi_B(p_1^*, 2.52, 20.08)\pi_B(p_2^*, 20.59, 9.01), \tag{12.16} \end{equation}\] where \(\pi_B(y, a, b)\) denotes the Beta density with shape parameters \(a\) and \(b\).
As said earlier, this prior distribution on the two probabilities implies a prior distribution on the regression coefficients. To simulate pairs \((\beta_0, \beta_1)\) from the prior distribution, one simulates values of the means \(p_1^*\) and \(p_2^*\) from independent Beta distributions in Equation (12.16), and apply the expressions in Equation (12.14) and Equation (12.15). One then obtains prior draws of the regression coefficient pair \((\beta_0, \beta_1)\). Figure 12.9 displays a scatterplot of the simulated pairs \((\beta_0, \beta_1)\) from the prior. Note that, although the two probabilities \(p_1^*\) and \(p_2^*\) have independent priors, the implied prior on the regression coefficient vector \(\beta\) indicates strong negative dependence between the intercept \(\beta_0\) and the slope \(\beta_1\).
Inference using MCMC
Once the prior on the regression coefficients is defined, it is straightforward to simulate from the Bayesian logistic model by MCMC and the JAGS software.
The JAGS script
As usual, the first step in using JAGS is writing a script defining the logistic regression model, and saving the script in the character string modelString
.
modelString <-"
model {
## sampling
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta1*x[i]
}
## priors
beta1 <- (logit(p1) - logit(p2)) / (x1 - x2)
beta0 <- logit(p1) - beta1 * x1
p1 ~ dbeta(a1, b1)
p2 ~ dbeta(a2, b2)
}
In the sampling section of the script, the loop goes from 1
to N
, where N
is the number of observations with index i
. Since \(Y_i \mid p_i \overset{ind}{\sim} \textrm{Bernoulli}(p_i)\), one uses dbern()
for y[i]
. In addition, since \(\textrm{logit}(p_i) = \beta_0 + \beta_1 x_i\), logit()
is written for establishing this linear relationship.
In the prior section of the script, one expresses beta0
and beta1
according to the expressions in Equation (12.14) and Equation (12.15), in terms of p1
, p2
, x1
, and x2
. One also assigns Beta priors to p1
and p2
, according to the conditional means prior discussed previously. Recall that the Beta distribution is represented by dbeta()
in the JAGS code where the arguments are the associated shape parameters.
Define the data and prior parameters
The next step is to provide the observed data and the values for the prior parameters. In the R script below, a list the_data
contains the vector of binary labor participation status values, the vector of family incomes (in $1000), and the number of observations. It also contains the shape parameters for the Beta priors on \(p_1^*\) and \(p_2^*\) and the values of the two incomes, \(x_1^*\) and \(x_2^*\).
y <- as.vector(LaborParticipation$Participation)
x <- as.vector(LaborParticipation$FamilyIncome)
N <- length(y)
the_data <- list("y" = y, "x" = x, "N" = N,
"a1" = 2.52, "b1" = 20.08,
"a2" = 20.59, "b2" = 9.01,
"x1" = 20, "x2" = 80)
Generate samples from the posterior distribution
The run.jags()
function in the runjags
package generates posterior samples by the MCMC algorithm using the JAGS software. The script below runs one MCMC chain with an adaption period of 1000 iterations, a burn-in period of 5000 iterations, and an additional set of 5000 iterations to be simulated.
By using the argument monitor = c("beta0", "beta1")
, one keeps tracks of the two regression coefficient parameters. The output variable posterior contains a matrix of simulated draws.
posterior <- run.jags(modelString,
n.chains = 1,
data = the_data,
monitor = c("beta0", "beta1"),
adapt = 1000,
burnin = 5000,
sample = 5000)
MCMC diagnostics and summarization
Once the simulated values are found, one applies several diagnostic procedures to check if the simulations appear to converge to the posterior distribution. Figures 12.10 and 12.11 display MCMC diagnostic plots for the regression parameters \(\beta_0\) and \(\beta_1\). From viewing these graphs, it appears that there is a small amount of autocorrelation in the simulated draws and the draws appear to have converged to the posterior distributions.
By use of the print()
function, posterior summaries are displayed for the regression parameters. One primary question is whether the family income is predictive of the labor participation status and so the key parameter of interest is the regression slope \(\beta_1\). From the output, one sees that the posterior median for \(\beta_1\) is \(-0.0052\) and a 90% interval estimate is \((-0.0143, 0.0029)\). This tells us several things. First, since the regression slope is negative, there is a negative relationship between family income and labor participation – wives from families with larger income (exclusive of the wife’s income) tend not to work. Second, this relationship does not appear to be strong since the value 0 is included in the 90% interval estimate.
print(posterior, digits = 3)
Lower95 Median Upper95 Mean SD Mode MCerr
beta0 0.101 0.358 0.59 0.36 0.125 -- 0.00214
beta1 -0.0143 -0.00524 0.00285 -0.00532 0.00438 -- 7.69e-05
One difficulty in interpreting a logistic regression model is that the linear component \(\beta_0 + \beta_1 x\) is on the logit scale. It is easier to understand the fitted model when one expresses the model in terms of the probability of participation \(p_i\): \[\begin{equation} p_i = \frac{\exp(\beta_0 + \beta_1 x_i)}{1 + \exp(\beta_0 + \beta_1 x_i)}. \tag{12.17} \end{equation}\] For a specific value of the predictor \(x_i\), it is straightforward to simulate the posterior distribution of the probability \(p_i\). If \((\beta_0^{(s)}, \beta_1^{(s)})\) represents a simulated draw from the posterior of \(\beta\), and one computes \(p_i^{(s)}\) using Equation (12.13) from the simulated draw, then \(p_i^{(s)}\) is a simulated draw from the posterior of \(p_i\).
This process was used to obtain simulated samples from the posterior distribution of the probability \(p_i\) for the income variable values 10, 20, …, 70. In Figure 12.12 the posterior medians of the probabilities \(p_i\) are displayed as a line graph and 90% posterior interval estimates are shown as vertical bars. The takeaway message from this figure is that the probability of labor participation is close to one-half and this probability slightly decreases as the family income increases. Also note that the length of the posterior interval estimate increases for larger family incomes – this is expected since much of the data is for small income values.
12.4.4 Prediction
We have considered learning about the probability \(p_i\) of labor participation for a specific income value \(x^*_i\). A related problem is to predict the fraction of labor participation for a sample of \(n\) women with a specific family income. If \(\tilde{y}_i\) represents the number of women who work among a sample of \(n\) with family income \(x_i\), then one would be interested in the posterior predictive distribution of the fraction \(\tilde{y}_i / n\).
One represents this predictive density of \(\tilde{y}_i\) as \[\begin{equation} f(\tilde{Y}_i = \tilde{y}_i \mid y) = \int \pi(\beta \mid y) f(\tilde{y}_i, \beta) d\beta, \tag{12.18} \end{equation}\] where \(\pi(\beta \mid y)\) is the posterior density of \(\beta = (\beta_0, \beta_1)\) and \(f(\tilde{y}_i, \beta)\) is the Binomial sampling density of \(\tilde{y}_i\) conditional on the regression vector \(\beta\).
A strategy for simulating the predictive density is implemented similar to what was done in the linear regression setting. Suppose that one focuses on the predictor value \(x^*_i\) and one wishes to consider a future sample of \(n = 50\) of women with that income level. The simulated draws from the posterior distribution of \(\beta\) are stored in a matrix post. For each of the simulated parameter draws, one computes the probability of labor participation \(p^{(s)}\) for that income level – these values represent posterior draws of the probability \(\{p^{(s)}\}\). Given those probability values, one simulates Binomial samples of size \(n = 50\) where the probability of successes are given by the simulated \(\{p^{(s)}\}\) – the variable \(\tilde{y}\) represents the simulated Binomial variable. By dividing \(\tilde{y}\) by \(n\), one obtains simulated proportions of labor participation for that income level. Each group of simulated draws from the predictive distribution of the labor proportion is summarized by the median, 5th, and 95th percentiles.
In the following R script, the function prediction_interval()
obtains the quantiles of the prediction distribution of \(\tilde{y}/ n\) for a fixed income level, and the sapply()
function computes these predictive quantities for a range of income levels. Figure 12.13 graphs the predictive median and interval bounds against the income variable. By comparing
Figure 12.12 and Figure 12.13, note that one is much more certain about the probability of labor participation than the fraction of labor participation in a future sample of 50.
prediction_interval <- function(x, post, n = 20){
lp <- post[, 1] + x * post[, 2]
p <- exp(lp) / (1 + exp(lp))
y <- rbinom(length(p), size = n, prob = p)
quantile(y / n,
c(.05, .50, .95))
}
out <- sapply(seq(10, 70, by = 10),
prediction_interval, post, n = 50)
12.5 Exercises
- Olympic Swimming Times
The dataset olympic_butterfly.csv
contains the winning time in seconds for the men’s and women’s 100m butterfly race for the Olympics from 1964 through 2016. Let \(y_i\) and \(x_i\) denote the winning time and year for the \(i\)-th Olympics. In addition, let \(w_i\) denote an indicator variable that is 1 for the women’s race and 0 for the men’s race. Consider the regression model \(Y_i \sim \textrm{Normal}(\mu_i, \sigma)\), where the mean is given by
\[
\mu_i = \beta_0 + \beta_1 (x_i - 1964) + \beta_2 w_i.
\]
- Interpret the parameter \(\beta_0\) in terms of the winning time in the race.
- Interpret the parameter \(\beta_0 + \beta_2\).
- Interpret the parameter \(\beta_0 + 8 \beta_1\).
- Interpret the parameter \(\beta_0 + 8 \beta_1 + \beta_2\).
- Olympic Swimming Times (continued)
Consider the regression model for the 100m Olympic butterfly race times described in Exercise 1. Suppose the regression parameters \(\beta_0, \beta_1, \beta_2\) and the precision parameter \(\phi = 1 / \sigma^2\) are assigned weakly informative priors.
- Using JAGS, sample 5000 draws from the joint posterior distribution of all parameters.
- Construct 90% interval estimates for each of the regression coefficients.
- Based on your work, describe how the mean winning time in the butterfly race has changed over time. In addition, describe how the men times differ from the women times.
- Construct 90% interval estimates for the predictive residuals \(r_i = y_i - \tilde y_i\) where \(\tilde y_i\) is simulated from the posterior predictive distribution. Plot these interval estimates and comment on any interval that does not include zero.
- Olympic Swimming Times (continued)
For the 100m Olympic butterfly race times described in Exercise 1 consider the regression model where the mean race time has the form \[ \mu_i = \beta_0 + \beta_1 (x_i - 1964) + \beta_2 w_i + \beta_3 (x_i - 1964) w_i, \] where \(x_i\) denotes the year for the \(i\)-th Olympics and \(w_i\) denote an indicator variable that is 1 for the women’s race and 0 for the men’s race.
Write the expression for the mean time for the men’s race, and for the mean time for the women’s race. Using this expressions, interpret the parameters \(\beta_2\) and \(\beta_3\).
Using weakly informative priors for all parameters, use JAGS to draw a sample of 5000 draws from the joint posterior distribution.
Based on your work, is there evidence that the regression model between year and mean race time differs between men and women?
- Prices of Personal Computers
What factors determine the price of a personal computer in the early days? A sample of 500 personal computer sales was collected from 1993 to 1995 in the United States. In addition to the sale price (price in US dollars of 486 PCs), information on clock speed in MHz, size of hard drive in MB, size of Ram in MB, and premium status of the manufacturer (e.g. IBM, COMPAQ) was collected. The dataset is in ComputerPriceSample.csv
. Suppose one considers the regression model \(Y_i \sim \textrm{Normal}(\mu_i, \sigma)\) where
\[\begin{equation*}
\mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2},
\end{equation*}\]
\(y_i\) is the sale price, \(x_{i1}\) is the clock speed, and \(x_{2i}\) is the logarithm of the hard drive size.
- Using a weakly informative prior on \(\beta = (\beta_0, \beta_1, \beta_2)\) and \(\sigma\), use JAGS to produce a simulated sample of size 5000 from the posterior distribution on \((\beta, \sigma)\).
- Obtain 95% interval estimates for \(\beta_1\) and \(\beta_2\).
- On the basis of your work, are both clock speed and hard drive size useful predictors of the sale price?
- Prices of Personal Computers (continued)
- Suppose a consumer is interested in a computer with a clock speed of 33 MHz and a 540 MB hard drive (so \(\log 450 = 6.1\)). Simulate 5000 draws from the expected selling price \(\beta_0 + \beta_1 x_1 + \beta_2\) for computer with this clock speed and hard drive size. Construct a 90% interval estimate for the expected sale price.
- Instead suppose the consumer wishes to predict the selling price of a computer with this clock speed and hard drive size. Simulate 5000 draws from the posterior predictive distribution and use these simulated draws to find a 90% prediction interval.
- Salaries for Professors
A sample contains the 2008-09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members. In addition to the nine-month salary (in US dollars), information on gender, rank (Assistant Professor, Associate Professor, Professor), discipline (A is “theoretical” and B is “applied”), years since PhD, and years of service were collected. The dataset is in ProfessorSalary.csv
. Suppose that the salary of the \(i\)-th professor, \(y_i\), is distributed Normal with mean \(\mu_i\) and standard deviation \(\sigma\), where the mean is given by
\[\begin{equation*} \mu_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2}, \end{equation*}\] where \(x_{i1}\) is the years of service and \(x_{i2}\) is the gender (where 1 corresponds to male and 0 to female).
- Assuming a weakly informative prior on \(\beta\) and \(\sigma\), use JAGS to simulate a sample of 5000 draws from the posterior distribution on \((\beta, \sigma)\).
- Simulate 1000 draws from the posterior of \(\beta_0 + 10 \beta_1\), the mean salary among all female professors with 10 years of service.
- Simulate 1000 draws from the posterior of the mean salary of male professors with 10 years of service \(\beta_0 + 10 \beta_1 + \beta_2\).
- By comparing the intervals computed in parts (b) and (c), is there a substantial difference in the mean salaries of male and female professors with 10 years of service?
- Salaries for Professors (continued)
- Suppose the college is interested in predicting the salary of a female professor with 10 years of service. By simulating 5000 draws from the posterior predictive distribution, construct a 90% prediction interval for this salary.
- Use a similar method to obtain a 90% prediction interval for the salary of a male professor with 10 years of service.
- Graduate School Admission
What factors determine admission to graduate school? In a study, data on 400 graduate school admission cases was collected. Admission is a binary response, with 0 indicating not admitted, and 1 indicating admitted. Moreover, the applicant’s GRE score, and undergraduate grade point average (GPA) are available. The dataset is in GradSchoolAdmission.csv
(GRE score is out of 800).
Let \(p_i\) denote the probability that the \(i\)-th student is admitted. Consider the logistic model
\[
\log \left( \frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2},
\]
where \(x_{1i}\) and \(x_{2i}\) are respectively the GRE score and the GPA for the \(i\)-th student.
- Assuming weakly informative priors on \(\beta_0, \beta_1\), and \(\beta_2\), write a JAGS script defining the Bayesian model.
- Take a sample of 5000 draws from the posterior distribution of \(\beta = (\beta_0, \beta_1, \beta_2)\).
- Consider a student with a 550 GRE score and a GPA of 3.50. Construct a 90% interval estimate for the probability that this student is admitted to graduate school.
- Construct a 90% interval estimate for the probability a student with a 500 GRE score and a 3.2 GPA is admitted to graduate school.
- Graduate School Admission (continued)
Consider the logistic model described in Exercise 8 where the logit probability of being admitted to graduate school is a linear function of his GRE score and GPA. It is assumed that JAGS is used to obtain a simulated sample from the posterior distribution of the regression vector.
- Consider a student with a 580 GRE score. Construct 90% posterior interval estimates for the probability that this student achieves admission for GPA values equally spaced from 3.0 to 3.8. Graph these posterior interval estimates as a function of the GPA.
- Consider a student with a 3.4 GPA. Find 90% interval estimates for the probability this student is admitted for GRE score values equally spaced from 520 to 700. Graph these interval estimates as a function of the GRE score.
- Personality Determinants of Volunteering
In a study of the personality determinants of volunteering for psychological research, a subject’s neuroticism (scale from Eysenck personality inventory), extraversion (scale from Eysenck personality inventory), gender, and volunteering status were collected. One intends to find out what personality determinants affect a person’s volunteering choice. The dataset is in Cowles.csv
.
Let \(p_i\) denote the probability that the \(i\)-th subject elects to volunteer. Consider the logistic model
\[
\log \left( \frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2},
\]
where \(x_{1i}\) and \(x_{2i}\) are respectively the neuroticism and extraversion measures for the \(i\)-th subject.
- Assuming weakly informative priors on \(\beta_0, \beta_1\), and \(\beta_2\), write a JAGS script defining the model and draw a sample of 5000 draws from the posterior distribution of \(\beta = (\beta_0, \beta_1, \beta_2)\).
- By inspecting the locations of the posterior distributions of \(\beta_1\) and \(\beta_2\), which personality characteristic is most important in determining a person’s volunteering choice?
- Let \(O = p / (1 - p)\) denote the odds of volunteering.
Construct a 90% interval estimate for the odds a student with a neuroticism score of 12 and an extraversion score of 13 will elect to volunteer.
- The Divide by Four Rule
Suppose one considers the logistic model \(\textrm{log} \left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x\). This model is rewritten as \[ p = \frac{\exp(\beta_0 + \beta_1 x)}{1 + \exp(\beta_0 + \beta_1 x)}. \]
Show that the derivative of \(p\) with respect to \(x\) is written as \[ \frac{dp}{dx} = p (1 - p) \beta_1. \]
Suppose the probability is close to the value 0.5. Using part (a), what is the approximate derivative of \(p\) with respect to \(x\) in this region?
Fill in the blank in the following sentence. In this logistic model, the quantity \(\beta_1 / 4\) can be interpreted as the change in the when \(x\) increases by one unit.
Suppose one is interested in fitting the logistic model \(\log \frac{p}{1-p} = \beta_0 + \beta_1 x\) where \(x\) is the number of study hours and \(p\) is the probability of passing an exam. One obtains the fitted model \[ \log \frac{\hat p}{1-\hat p} = -1 + 0.2 x. \] Using your work in parts (b) and (c), what is the (approximate) change in the fitted pass probability if a student studies an additional hour for the exam?
- Football Field Goal Kicking
The data file football_field_goal.csv
contains data on field goal attempts for professional football kickers. Focus on the kickers who played during the 2015 season. Let \(y_i\) denote the response (success or failure) of a field goal attempt from \(x_i\) yards. One is interested in fitting the logistic model
\[
\log\frac{p_i}{1-p_i} = \beta_0 + \beta_1 x_i,
\]
where \(p_i\) is the probability of a successful attempt.
Using weakly informative priors on \(\beta_0\) and \(\beta_1\), use JAGS to take a simulated sample from the posterior distribution of \((\beta_0, \beta_1\)).
Suppose a kicker is attempting a field goal from 40 yards. Construct a 90% interval estimate for the probability of a success.
Suppose instead that one is interested in estimating the yardage \(x^*\) where the probability of a success is equal to 0.8. First express the yardage \(x^*\) as a function of \(\beta_0\) and \(\beta_1\), and then find a 90% interval estimate for \(x^*\).
Suppose 50 field goals are attempted at a distance of 40 years. Simulate from the posterior predictive distribution to construct a 90% interval estimate for the number of successful attempts.
- Predicting Baseball Batting Averages
The data file batting_2018.csv
contains batting data for every player in the 2018 Major League Baseball season. The variables AB.x
and H.x
in the dataset contain the number of at-bats (opportunities) and number of hits of each player in the first month of the baseball season.
The variables AB.y
and H.y
in the dataset contain the at-bats and hits of each player for the remainder of the season.
Take a random sample of size 50 from batting_2018.csv
. Suppose one is interested in predicting the players’ batting averages \(H.y / AB.y\) for the remainder of the season. Consider the following three estimates:
- Individual Estimate: Use the player’s first month batting average \(H.x / AB.x\).
- Pooled Estimate: Use the pooled estimate \(\sum H.x / \sum AB.x\).
- Compromise Estimate: Use the shrinkage estimate
\[ \frac{AB.x}{AB.x + 135} \frac{H.x}{AB.x} + \frac{135}{AB.x + 135} \frac{\sum H.x}{\sum AB.x}. \]
For your sample, compute values of the individual, pooled, and compromise estimates. For each set of estimates, compute the sum of squared prediction errors, where the prediction error is defined to be the difference between the estimate and the batting average in the remainder of the season. Which estimate do you prefer? Why?
- Predicting Baseball Batting Averages (continued)
In Exercise 13, for the \(i\)-th player in the sample of 50 one observes the number of hits \(y_i\) (variable H.x
) distributed binomial with sample size \(n_i\) (variable AB.x
) and probability of success \(p_i\). Consider the logistic model
\[
\log \left( \frac{p_i}{1-p_i} \right) = \gamma_i.
\]
Use JAGS to simulate from the following three models:
Individual Model: Assume the \(\gamma_i\) values are distinct and assign each parameter a weakly informative normal distribution.
Pooled Model: Assume that \(\gamma_1 = ... = \gamma_{50} = \gamma\) and assign the single \(\gamma\) parameter a weakly informative normal distribution.
Partially Pooled Hierarchical Model: Assume that \(\gamma_i \sim Normal(\mu, \tau)\) where \(\mu\) and the precision \(P = 1 / \tau^2\) are assigned weakly informative distributions.
Focus on a particular player corresponding to the index \(k\). Contrast 90% interval for estimates for \(p_k\) using the individual, pooled, and partially pooled hierarchical models fit in parts (a), (b), and (c).
- Comparing Career Trajectory Models
In Section 12.3, the Deviance Information Criteria (DIC) was used to compare four regression models for Mike Schmidt’s career trajectory of home run rates. By fitting the model using JAGS and using the extract.runjags()
function, find the DIC values for fitting the linear, cubic, and quartic models and compare your answers with the values in Table . For each model, assume that the regression parameters and the precision parameter have weakly informative priors.
- Comparing Models for the CE Sample Example
For the Consumer Expenditure Survey (CE) example, the objective was to learn about a CU’s expenditure based on the person’s income and his/her urban/rural status. There are four possible regression models depending on the inclusion or exclusion of each predictor. Use JAGS to fit each of the possible models and compute the value of DIC. For each model, assume that the regression parameters and the precision parameter have weakly informative priors. By comparing the DIC values, decide on the most appropriate model and compare your results with the discussion in Section 12.2.
- Grades in a Calculus Class
Suppose one is interested in how the grade in a calculus class depends on the grade in the prerequisite math course. One is interested in fitting the logistic model \[ \log \left( \frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_i, \] where \(p_i\) is the probability of an A of the \(i\)th student and \(x_i\) represents the grade of the \(i\)th student in the previous math class (1 if an A was received, and 0 otherwise).
Suppose one believes a Beta(12, 8) prior reflects the belief about the probability of an A for a student who has received an A in the previous math, and a Beta(5, 15) prior reflects the belief about the probability of an A for a student who has not received an A in the previous course. Use JAGS to simulate 1000 draws from the prior of \((\beta_0, \beta_1)\).
Data for 100 students is contained in the data file "calculus.grades.csv". Using JAGS to simulate 5000 draws from the posterior of \((\beta_0, \beta_1)\).
Construct a 90% interval estimate for \(\beta_1\). Is there evidence that the grade in the prerequisite math course is helpful in explaining the grade in the calculus class?
- Grades in a Calculus Class (continued)
The traditional way of fitting the logistic model in Exercise 17 is by maximum likelihood. The variables grade
and prev.grade
contain the relevant variables in the data frame calculus.grades
. The maximum likelihood is achieved by the function glm()
with the family = binomial
option.
fit <- glm(grade ~ prev.grade, data = calculus.grades,
family = binomial)
summary(fit)
Look at the estimates and associated standard errors of the regression coefficients and contrast these values with the posterior means and standard deviations from the informative prior Bayesian analysis in Exercise 17.
- Logistic Model to Compare Proportions
In Exercise 19 of Chapter 7, one was comparing proportions of science majors for two years at some liberal arts colleges. One can formulation this problem in terms of logistic regression. Let \(y_i\) denote the number of science majors out of a sample of \(n_i\) for the \(i\)th year. One assumes that \(y_i\) is distributed Binomial\((n_i, p_i)\) where \(p_i\) satisfies the logistic model \[ \log \left(\frac{p_i}{1-p_i} \right) = \beta_0 + \beta_1 x_i, \] where \(x_i = 0\) for year 2005 and \(x = 1\) for year 2015.
- Assuming that \(\beta_0\) and \(\beta_1\) are independent with weakly informative priors, use JAGS to simulate a sample of 5000 from the posterior distribution. (In the JAGS script, the
dbin(p, n)
denotes the Binomial distribution with probabilityp
and sample sizen
. - Find a 90% interval estimate for \(\beta_1\).
- Use the result in (b) to describe how the proportion of science majors has changed (on the logit scale) from 2005 to 2015,
- Separation in Logistic Regression
Consider data in Table 12.4 that gives the number of class absences and the grade (1 for passing and 0 for failure) for ten students. If \(p_i\) denotes the probability the \(i\)th student passes the class, then consider the logistic model \[ \log \left( \frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_i, \] where \(x_i\) is the number of absences.
Table 12.4. Number of absences and grades for ten students.
Student | Absences | Grade | Student | Absences | Grade |
---|---|---|---|---|---|
1 | 0 | 1 | 6 | 2 | 1 |
2 | 0 | 1 | 7 | 2 | 1 |
3 | 0 | 1 | 8 | 5 | 0 |
4 | 1 | 1 | 9 | 8 | 0 |
5 | 1 | 1 | 10 | 10 | 0 |
Using the
glm()
function as shown in Exercise 18, find maximum likelihood estimates of \(\beta_0\) and \(\beta_1\).Comment on the output of implementing the
glm()
function. (The strange behavior is related to the problem of separation in logistic research.) Do some research on this topic and describe why one is observing this unusual behavior.By use of a weakly informative prior, use JAGS to simulate a sample of 5000 from the posterior distribution.
Compute posterior means and standard deviations of \(\beta_0\) and \(\beta_1\) and compare your results with the traditional fit in part (a).