3  Bayes Rule

3.1 Introduction

Here is a basic exposition of Bayes rule. Suppose you have events \(E_1, ..., E_k\) that form a partition of the sample space.

  1. \(P(E_i), i = 1, ..., k\)

  2. \(P(A | E_i), i = 1, ..., k\)

One is interested in computing the probabilities \(P(E_i | A), i = 1, ..., k\). By a standard manipulation of conditional probabilities, one obtains the result:

\[ P(E_i | A) = \frac{P(E_i) P(A | E_i)} {\sum_{j=1}^k P(E_j) P(A | E_j)} . \]

3.2 Illustrations of Bayes’ Rule

3.2.1 Example: Student Takes a Test

Suppose a student is taking a one-question multiple choice test with four possible choices. Either the student knows the material or she doesn’t; denote these two possibilities by \(K\) and “not \(K\)”. Based on previous work, the teacher decides the student likely knows the material and so assigns \(P(K) = 0.7\). Therefore, the probability the student doesn’t know the material is \(P({\rm not} \, K) = 1 - 0.7 = 0.3.\) The student will take the one-question test and either she will get it correct, which we denote by \(C\). If the student knows the material, then the chance she will get the question correct is 90%. On the other hand, if the student doesn’t know the material, then we will guess and obtain the correct answer with probability 25%. Suppose the student takes the test and gets the question correct – what is the probability she really knows the material?

Here the events \(K\) and “not \(K\)” form a partition of the sample space and we are given the probabilities of these two events. The probability the student gets the question correct depends on whether she knows the material – we are given that

\[ P(C | K) = 0.9, \, P(C | {\rm not} , K) = 0.25. \]

Given that the student gets the question correct, we’re interested in determining the probability of \(K\); that is, we wish to compute \(P(K | C)\). By Bayes’ rule, this is given by

\[ P(K | C) = \frac{P(K) P(C | K)} {P(K) P(C | K) + P({\rm not} , K) P(C | {\rm not} , K)}. \]

Substituting in the given values, we obtain

\[ P(K | C) = \frac{0.7 \times 0.9} {0.7 \times 0.9 + 0.3 \times 0.25} = \frac{0.63}{0.63+0.075} = 0.894 . \]

Does this answer make sense? Before the test, the teacher believed that the student knew the material with probability 0.7. The student got the question correct which intuitively should increase the teacher’s probability that the student was a good student. Bayes’ rule allows us to explicitly compute how the probability should increase – the probability has increased from \(P(K) = 0.7\) to \(P(K | C) = 0.894\).

3.2.2 Example: Balls in a Bag

Suppose a bag contains exactly one white ball. You roll a die and if the outcome of the die roll is \(i\), you add \(i\) red balls to the bag. You then select a ball from the bag and the color of the ball is red. What is the chance that the die roll is \(i\)?

In this example, let \(D_i\) denote the outcome that the die roll lands \(i\) and let \(R\) denote the outcome that a red ball is chosen. If we assume a fair die, then the six possible die rolls are equally likely, so \(P(D_1) = P(D_2) = ... = P(D_6) = 1/6\).

The probability of observing a red depends on the die roll. If the die roll is \(i\), one adds \(i\) red balls to the bag and the chance of choosing a red will be \(i/(i+1)\), so

\[ P(R | D_i) = \frac{i}{i+1}, , i = 1, ..., 6. \]

In this story, a red ball is observed and we are interested in computing \(P(D_i | R)\). By applying Bayes rule

\[ P(D_i | R) = \frac{P(D_i) P(R | D_i)}{\sum_{j=1}^6 P(D_j) P(R | D_j)}. \]

By substituting the known quantities, we have

\[ P(D_i | R) = \frac{\frac{1}{6} \times \frac{i}{i+1}}{\sum_{j=1}^6 \frac{1}{6} \times \frac{j}{j+1}}. \]

A convenient way of computing the die roll probabilities is by use of a table. In Table 2.1, each row corresponds to a specific die roll – we call these alternatives in the table. For each die roll, the table gives the initial probability \(P(D_i)\) and the probability of observing red for that die roll \(P(R | D_i)\). The updated probability \(P(D_i | R)\) is proportional to the product \(P(D_i) P(R | D_i)\) and the products are shown in the table.

One computes the updated probabilities by dividing each product by the sum of the products. For example the updated probability \(P(D_1 | R)\) is given by the product (1/6)(1/(1+1)) divided by the sum of the products \(1/12 + 2/18 + ... + 6/42 = 0.734\) which is equal to 0.113.

Alternative Probability \(P(R | {\rm Die \, \, Roll})\) Product
\(D_1\) 1/6 1/(1+1) 1/12
\(D_2\) 1/6 2/(2+1) 2/18
\(D_3\) 1/6 3/(3+1) 3/24
\(D_4\) 1/6 4/(4+1) 4/30
\(D_5\) 1/6 5/(5+1) 5/36
\(D_6\) 1/6 6/(6+1) 6/42

To make sense of these calculations, we started assuming that all six possible rolls of the die were equally likely.

With the observation of a red ball, the updated probabilities are unequal and give support for larger rolls of the die.

3.3 New Terminology

In general, we are interested in learning about \(k\) different models that we denote by \(M_1, ..., M_k\). Initially, we have beliefs about the plausibility of these models that we express through the probabilities \(P(M_1), ..., P(M_k)\). We refer to these as prior probabilities since these express our opinions about the models before or prior to observing any data. Next, we observe data denoted by \(D\) that will give us information about the models. We are given the probabilities of each data outcome for each model, that is, \(P(D|M_1), ..., P(D|M_k)\); these are called the likelihoods.

Now the a particular data result \(D\) is observed. How has this data result changed our beliefs about the \(k\) models? Bayes’ rule is the recipe for modifying the model probabilities. It says that the new probability for model \(M_i\) is proportional to the product of the prior probability and the likelihood:

\[ P(M_i | D) \propto P(M_i) P(D | M_i). \]

The updated probabilities {\(P(M_i | D)\)} are called posterior probabilities since they reflect our opinions about the models _after observing the data. Using words, we can write

POSTERIOR \(\propto\) PRIOR \(\times\) LIKELIHOOD.

A convenient way to performing the Bayes’ rule calculations is by use of a table similar to the example. We illustrate the use of the new terminology and the table calculations for two additional examples.

3.4 Example: Testing for a Disease

Suppose you are one of the many people who are tested for a rare disease. From reports, you known the the incidence of this disease is 1 out of 5000. You take the test and there are two results: either you are told “ok” or “see your doctor for further checks.” How should you feel on the basis of these two results?

There are two possible models in this example: you are either “diseased” or “not disease”. Assuming that you are a representative person from your community, your prior beliefs are that

\[ P({\rm diseased}) = \frac{1}{5000} = 0.0002, , P({\rm not \, diseased}) = \frac{4999}{5000} = 0.9998 . \]

The “data” in this example is the screening test result. There are two outcomes: either the test will be “positive” or “+”, which is some indication that you have the disease, or “negative” or “-” which is good news. From past experience, the screening test has 5% false positives and 2% false negatives.

This means that if you really don’t have the disease, the chance you get a positive result is 0.05; that is,

\[ P(+ , {\rm result} | {\rm not \, diseased}) = 0.05, P(- , {\rm result} | {\rm not \, diseased}) = 0.95. \]

Similarly, if you really have the disease, the chance of an incorrect negative result is 0.02:

\[ P(- , {\rm result} | {\rm diseased}) = 0.02, \, P(+ , {\rm result} | {\rm diseased}) = 0.98. \]

These values are the likelihoods – the probabilities of the data outcomes for each model.

Suppose you have a positive test result (\(+\)). We can find the new probabilities of diseased and not diseased by Bayes’ rule that we present in a table format in Table 2.2.

Prior Probability \(P(+ | {\rm Model})\) Product Posterior
Not diseased 0.9998 0.05 0.04999 0.9961
Diseased 0.0002 0.98 0.000196 0.0039

Before the test, your probability of having the disease was 0.02 and after getting the positive test result, this probability has increased to 0.039. This new probability is almost twice the initial probability, but you are still very unlikely to have the disease.

What if you received a negative test result? We repeat the Bayes’ rule calculations in Table 2.3 with a change in the likelihood values.

Model Prior \(P(- | {\rm Model})\) Product Posterior
Not diseased 0.9998 0.95 0.949810 0.999996
Diseased 0.0002 0.02 0.000004 0.000004

The probability of having the disease has decreased from 0.0002 to 0.000004.

These results are usually surprising to doctors and patients. It seems difficult to update probabilities accurately and people typically have a much stronger opinion they have the disease when faced with a positive test result.

3.5 Example: The Three Door Problem

There is a famous probability problem, called The Three Door Problem or The Car and the Goats that can be addressed by Bayes’ rule. There is a TV show where a contestant is showed three numbered doors, Door 1, Door 2, and Door 3, where one door is hiding a car and the other two doors hiding goats. The contestant is allowed to choose a door and win the corresponding prize. The contestant chooses Door 1. The host, who knows which door hides the car, then opens Door 2 to reveal a goat. The contestant is given the opportunity to change her selection. Should she switch her choice to Door 3?

In this example the unknown model is the location of the car. We will let \(C_i\) denote the event that the car is behind Door \(i, i = 1, 2, 3.\) Initially, the constestant believes the car is equally likely to be behind each of the three doors, so

\[ P(C_1) = P(C_2) = P(C_3) = \frac{1}{3}. \]

Here the data is the event that the host showed Door 2 – we’ll call this event \(H\). We wish to find the new probabilities of \(C_1, C_2\) and \(C_3\) conditional on the new information \(H\). We put the given information in the “Bayes’ table”:

Model Prior \(P(H | {\rm Model})\) Product Posterior
\(C_1\) 1/3 \(P(H | C_1)\)
\(C_2\) 1/3 \(P(H | C_2)\)
\(C_3\) 1/3 \(P(H | C_3)\)

Let’s consider the likelihood \(P(H | C_i)\) that represents the probability the host shows Door 2 if the car is behind Door \(i\). Remember that the contestant chose Door 1, so the host cannot choose this door.

  1. If the car is really behind door 1, the host can either show Door 2 or Door 3. We will assume that the probability he shows Door 2 is a number \(q\) between 0 and 1, so \(P(H | C_1) = q\).

  2. If the car is behind door 2, the host cannot show this door. So \(P(H | C_2) = 0\).

  3. If the car is behind door 3, the host cannot show this door -- he has to show Door 2. So \(P(H | C_3) = 1\).

We complete the table in Table 2.5 by filling in the likelihoods and computing the posterior probabilities.

Model Prior \(P(H | {\rm Model})\) Product Posterior
\(C_1\) 1/3 \(q\) \(q/3\) \(q/(q+1)\)
\(C_2\) 1/3 0 0 0
\(C_3\) 1/3 1 1/3 \(1/(q+1)\)

Let’s return to our question. Remember the contestant chose Door 1 and has the opportunity to switch to Door 3. Given the data “host shows Door 2”, we have found that the probability the car is behind Door 1 is \(q/(q+1)\) and the probability the car is behind Door 3 is \(1/(q+1)\). The contestant should switch if the probability of \(C_3\) is greater than the probability of \(C_1\), that is,

\[ P(C_3 | H) > P(C_1 | H) \]

or

\[ \frac{1}{q+1} > \frac{q}{q+1} \]

which is true if \(q > 0\). So the contestant will increase her probability of winning by switching. Remember \(q\) is the probability the host will show Door 2 instead of Door 3 if he has a choice. If we assume \(q = 1/2\), that is, the host chooses a door at random, then the probability the car is behind Door 3 is \(1/(1/2 + 1) = 2/3\).

3.6 Sequential Learning

A machine in a small factory is producing a particular automotive component. Most of the time (specifically, 90% from historical records), the machine is working well and produces 95% good parts. Some days, the machine doesn’t work as well (it’s broken) and produces only 70% good parts. A worker inspects the first dozen parts produced by this machine on a particular morning and obtains the following results (\(g\) represents a good component and \(b\) a bad component):

\[ g, b, g, g, g, g, g, g, g, b, g, b. \]

The worker is interested in assessing the probability the machine is working well.

In this problem there are two models – either the machine is working well, or “working” for short, or it is “broken”.

Based on the historical data, the worker assigns prior probabilities of 0.90 and 0.10 to the models “working” and “broken”. The data are the results of the inspection on the 12 parts. To understand the relationship between the data and the models, we compute the sampling probabilities, the probabilities of each data outcome for each model. If the machine is working, the probabilities of a good (g) part and a bad (b) part are 0.95 and 0.05, respectively. So

\[ P(g | {\rm working}) = 0.95,\, P(b | {\rm working}) = 0.05 . \]

If instead the machine is broken, the probabilities of good and bad part are 0.70 and 0.30, respectively:

\[ P(g | {\rm broken}) = 0.70, \, P(b | {\rm broken}) = 0.30 . \]

Now we’re ready to do the Bayes’ rule computation. The outcomes of twelve inspections of parts are the data:

\[ DATA = {g, b, g, g, g, g, g, g, g, b, g, b}. \]

The likelihoods are the probabilities of this data result for each of the two models. Assuming independence of individual outcomes, the likelihood of the working model is given by

\[ LIKE({\rm working}) = P(\{g, b, g, g, g, g, g, g, g, b, g, b\} | {\rm working}) \] \[ = P(g | {\rm working}) \times ... \times P(b | {\rm working}) \] \[ = 0.95 \times 0.05 \times 0.95 \times ... \times 0.05 \] \[ = 0.00007878. \]

Similarly, the likelihood of the broken model is

\[ LIKE({\rm broken}) = P(\{g, b, g, g, g, g, g, g, g, b, g, b\} | {\rm broken}) \] \[ = 0.70 \times 0.30 \times 0.70 \times ... \times 0.30 \] \[ = 0.00108955 \]

Using the “prior times likelihood” recipe, we compute the posterior probabilities in the following table.

Model Prior Likelihood Product Posterior
Working 0.90 0.00007878 0.000070902 0.3942
Broken 0.10 0.00108955 0.000108955 0.6058

We see that the posterior probability that the machine is broken is over 60% and perhaps the machine should be stopped for inspection and repair.

There is another way to implement Bayes’ rule when the data are observed in a sequential manner. Before any data are collected, the inspector’s probabilities of the two states of the machine, working and broken, are given by 0.90 and 0.10. He observes the quality of the first part – “g” – and then he can immediately update his probabilities by Bayes’ rule.

Model Prior Likelihood Product Posterior
Working 0.90 0.95 0.855 0.9243
Broken 0.10 0.70 0.070 0.0757

After this single observation, he is slightly more confident (with probability 0.9243) that the machine is working.

The inspector’s current probabilities of the two models are 0.9243 and 0.0757. He observes the quality of the next part – “b” – and again he can update his probabilities by Bayes’ rule. In this table “Prior” refers to his beliefs before observing the data.

Model Prior Likelihood Product Posterior
Working 0.9243 0.05 0.046215 0.6705
Broken 0.0757 0.30 0.022710 0.3295

We see that, after observing two parts, the inspector’s probability that the machine is working is 0.6705.

One can continue learning in this sequential manner. As one observes the quality of each single part, the inspector can update his probability of the two models by Bayes’ rule. Table 2.9 summarizes the results of this sequential learning. The first row of the table displays the prior probabilities of the working and broken models and the following rows display the probabilities after each outcome is observed. Note that the final row indicates that the probabilities after observing the 12 parts are equal to 0.3942 and 0.6058. As expected, these posterior probabilities are the same as the ones computed using the group of 12 observations as data.

Observation P(Working) P(Broken)
Prior 0.9000 0.1000
g 0.9243 0.0757
b 0.6706 0.3294
g 0.7342 0.2658
g 0.7894 0.2106
g 0.8358 0.1642
g 0.8735 0.1265
g 0.9036 0.0964
g 0.9271 0.0729
g 0.9452 0.0548
b 0.7421 0.2579
g 0.7961 0.2039
b 0.3942 0.6058