Introduction to Bayesian Inference

Author

Jim Albert

Published

September 25, 2022

1 Introduction

1.1 Preface

This is a collection of lecture notes used in the teaching of MATH 6480 Bayesian Inference at Bowling Green State University. It is intended to accompany the material in Bayesian Computation with R, second edition, published by Chapman and Hall.

1.2 A Brief History of Statistics

Statistics, the science of learning from data, is a relatively new discipline. One can divide the history of Statistics into three periods using the years 1900 and 1960.

In the early days of Statistics (before 1900), much of the statistical work was devoted to data analysis including the construction of graphical displays. There was little work done on inferential statistics. The foundations of Bayesian inference had been developed by Bayes and Laplace in the 18th century.
The foundations of statistical inference were developed in the period between 1900 and 1970. Karl Pearson developed the chi-square goodness of fit procedure around the year 1900 and R. A. Fisher developed the notions of sufficiency and maximum likelihood in this period. Statistical procedures are evaluated in terms of their long-run behavior in repeated sampling. For this reason, these procedures are known as frequentist methods. Properties such as unbiasedness and mean square error are used to evaluate procedures. Some prominent Bayesians such as Harold Jeffreys, Jimmie Savage, and I. J. Good made substantial contributions during this period, but the frequentist methods became the standard inferential methods in the statistician’s toolkit.
In the last 40 years, there has been a great development in new statistical methods, especially computational demanding methods such as the bootstrap and nonparametric smoothing. Due to the recent availability of high-speed computers together with new simulation-based fitted algorithms, Bayesian methods have become increasingly popular. In contrast to the middle period of statistics where frequentist methods were dominate, we currently live in a frequentist/Bayesian world where statisticians routinely use Bayesian methods in situations where this inferential perspective has particular advantages.

1.3 An Example

One fundamental inference problem is learning about the association pattern in a 2 by 2 contingency table. Suppose we sample data values that are categorized with respect to the presence and absence of two variables \(A\) and \(B\) and one observes the following table of counts.

	Var B
Var A	yes	no
yes	\(a\)	\(b\)
no	\(c\)	\(d\)

There are two common questions that one is interested in answering. First, is there a significant association structure in the table? Second, if variables \(A\) and \(B\) are indeed dependent, one is interested in estimating the strength of the association.

As an example, suppose one observes the following table counts.

	Var B
Var A	yes	no
yes	\(10\)	\(0\)
no	\(2\)	\(5\)

One constructs a statistical test of the hypothesis of independence to see if there is significant association in the table. The standard test of independence is based on the Pearson’s chi-squared test. One implements this testing procedure on R by the function chisq.test() and one observes the following output for these data.

Code

counts <- matrix(c(10, 2, 0, 5), 2, 2)
chisq.test(counts)

Warning in chisq.test(counts): Chi-squared approximation may be incorrect


    Pearson's Chi-squared test with Yates' continuity correction

data:  counts
X-squared = 6.971, df = 1, p-value = 0.008284

We note that the p-value of the test statistic is 0.008284 which indicates that there is significant evidence that the two variables are dependent. But we see a warning in the output saying that the accuracy of this p-value computation is in doubt.

What is going wrong? The chi-squared test is based on the test statistic \[ X = \sum \frac{(o - e)^2}{e}, \] where \(o\) and \(e\) represent, respectively, the observed cell count and estimated expected cell count under the independence assumption. Asymptotically, under the assumption of independence, \(X\) has a chi-squared distribution with one degree of freedom. The displayed p-value is the tail probability of a chi-square(1) random variable. When the cell counts are large, the distribution of \(X\) is approximately chi-square. But, when the counts are small (as in this example), the distribution of \(X\) may not be approximately chi-square(1) and so the accuracy of the p-value calculation is in doubt.

What can one do in this situation? A standard alternative test procedure is Fisher’s exact test where the p-value is computed based on the hypergeometric distribution. If one implements this test using the R function fisher.test(), one sees the following output.

Code

counts <- matrix(c(10, 2, 0, 5), 2, 2)
fisher.test(counts)


    Fisher's Exact Test for Count Data

data:  counts
p-value = 0.003394
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.164093      Inf
sample estimates:
odds ratio 
       Inf

One obtains a new p-value of 0.003394 which is significantly different from the ``large sample” p-value of 0.008284, indicating that the accuracy of the chi-square approximation was relatively poor. But this analysis raises new questions.

What sampling model?

There are different sampling models that can produce the observed table counts. For example, one may be taking a single random sample of a particular size and classifying each observation with respect to the two variables – this is the multinomial sampling model. Alternatively, one may be taking two independent samples; the “A-sample” is classified with respect to variable B, and a second “not A-sample” is also classified with respect to variable B – this is the product of binomials sampling model. Or perhaps one assumes that the observed margins of the table are fixed and the only random quantity is the one count in the top left of the table – this gives rise to the hypergeometric distribution under independence that is the basis for Fisher’s exact test.

Does the choice of sampling model matter?

If one is unsure about the sampling method that produces the table, one might hope that the test of significance is insensitive to the choice of sampling model. But this is not the case. The p-value is dependent on the choice of model. Actually, there is a debate among frequentists on the “proper” choice of sampling model in the test of independence.

What about estimating the association?

In this example, since the test of independence seems to be clearly rejected, the focus should be on the estimation of the association. A standard measure of association in a two by two table is the odds ratio defined by \[ \alpha = \frac{p_{11} p_{22}}{p_{12} p_{21}}, \] where (assuming a multinomial sampling model) \(p_{ij}\) is the probability of an observation in the \(i\)th row and \(j\)th column of the table. The maximum likelihood estimate of \(\alpha\) is given by \[ \hat \alpha = \frac{a d}{b c}. \] For these data, we observe \(a = 10\), \(b = 0\), \(c = 2\) and \(d = 5\), resulting in an infinite estimate for \(\alpha\). This is indicated by the fisher.exact() output. Also the standard (asymptotic) 95% confidence interval for \(\alpha\) for these data is given by (2.164093, \(\infty)\). We see that the observed zero count has made it difficult to get reasonable point and interval estimates of the odds ratio.

In this problem, we see some pitfalls in applying frequentist testing methods for this simple problem. Since there are small counts, standard methods relying on asymptotic approximations seem unsuitable. But the computation of an “exact” p-value is also unclear in this situation, since this computation relies on the sampling distribution which may be unknown.

Frequentist methods also perform poorly for the estimation problem since the infinite estimate of \(\alpha\) is not reasonable. If one thinks about the cell probabilities, then one would think that all of these probabilities would be positive, resulting in a finite value of the odds-ratio. But there are no ways to include these ``prior beliefs” that the probabilities are positive in the estimation problem. A standard ad-hoc solution to this problem is to add a fake count of 1/2 to each cell count, and estimate alpha by computing the maximum likelihood estimate on these adjusted counts: \[ \hat \alpha = \frac{(a+1/2) (d+1/2)}{(b+1/2) (c+1/2)}. \] But it is not obvious that 1/2 is the correct choice of fake count to get a “best” estimate of the odds ratio.

1.4 Pro and Cons of the Two Modes of Statistical Inference

The previous example illustrates some of the problems in applying frequentist inferential methods and so it desirable to consider the alternative Bayesian paradigm for inference. Here is a short list of some positive and negative aspects of the frequentist and Bayesian approaches to inference.

Positive Aspects of Frequentist Inference:

There are a number of good methods such as maximum likelihood and most powerful tests and good criteria for evaluating procedures such as unbiased and mean square error.
These methods are automatic to apply and have wide applicability.
One is generally interested in evaluating procedures by their performance in repeated sampling.

Negative Aspects of Frequentist Inference:

There is no general method for inference. One has to be clever to devise good statistical procedures in situations where standard methods fail.
Frequentist methods can perform poorly. For example, frequentist methods do not perform well for sparse contingency tables with one or more observed zeros.
One is unable to incorporate prior knowledge into the inference.

Positive Aspects of Bayesian Inference:

One has one recipe (Bayes’ rule) for statistical inference.
One can formally incorporate prior information into the analysis.
Nuisance parameters are easily handled in a Bayesian analysis.

Negative Aspects of Bayesian Inference:

Bayesian thinking requires more thought with the introduction of a prior distribution.
From a calculation perspective, it can be difficult to implement Bayesian methods, although powerful computational tools exist.