Physics to data science: A plan for textbook Bayes' rule problems

I like (love) Bayes' rule, as a few (many) of you know. It's applicable in many situations (every situation), and employers may (should) want to know that you can use it properly. To do that, they might present you with problems like this:

A friend who works in a big city owns two cars, one small and one large. Three-quarters of the time he drives the small car to work, and one-quarter of the time he drives the large car. If he takes the small car, he usually has little trouble parking, and so is at work on time with probability 0.9. If he takes the large car, he is at work on time with probability 0.6. Given that he was on time on a particular morning, what is the probability that he drove the small car? [from here]

The point of these problems, and the main function of Bayes' rule, is to combine new evidence (data) with prior information to update our beliefs about a set of hypotheses. For a good introduction to the Bayesian way of thinking, check out E.T. Jaynes' book Probability Theory: The Logic of Science. Here, I want to provide a protocol for attacking these problems that should elucidate the process, and maybe clear up some confusion about Bayes' rule.

Bayes' Rule

I like to write Bayes' rule like this:
$$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)},$$
where the symbols mean the following:

$H$: A hypothesis, which is a proposition like "She's a witch," or "I like pizza, and bats are reptiles." It can always be written as a full grammatical sentence.
$D$: The data, which may consist of many indiviudal data points.
$X$: The prior information. This always includes the problem statement, and may include other things. In principle, it includes everything you believe to be true about the universe in which the problem takes place. If you include irrelevant information in $X$, it will have no effect on the problem.

You'll see slightly different representations elsewhere, notably omitting the $X$ (leaving it implicit), and using other symbols for $D$ and $H$. I like this representation because it has a clear narrative. There are four quantities here, which can be interpreted as:

The posterior probabilty, $P\left(H|DX\right)$: The probability that the hypothesis is true, given the prior information AND the data. This is what we want to calculate.
The prior probability, $P\left(H|X\right)$: The probability that the hypothesis is true given the prior information, but without knowing the data.
The sampling distribution, $P\left(D|HX\right)$: The probability that we would see this data set if the hypothesis and the prior information were both true.
A normalizing constant, $P\left(D|X\right)$: The probability that we would observe the data regardless of the hypothesis. Sometimes called the marginal.

The plan

The plan for solving problems like this is the following:

Write down Bayes' rule.
Write out all of the hypotheses as English sentences.
Write down the data.
Find all of the prior probabilities.
Find all of the sampling distributions.
Construct the normalizing term.
Shut up and calculate.

Let's take a closer look at the example problem from above.

Write down Bayes' rule

This seems obvious, but go ahead and do it anyway. In fact, write it using the notation above, since it's hard to forget what you're doing that way. I would give you about 80% credit as an interviewer if you got this far and were able to explain the terms.

Write out the hypotheses

In the Bayesian view, there is a space of hypotheses that describe every way the universe can be. For each problem, there is a set of mutually exclusive hypotheses that span that space. We might label them $\left(H_1, H_2,...,H_N \right)$. If you don't know what every member of this space is, you can't do the problem. On the other hand, this space reflects your view of the universe, so you can always define it in principle.

For the above problem, we have the hypotheses

$$H_1 = \textrm{Your friend drove the small car.}$$ $$H_2 = \textrm{Your friend did not drive the small car.}$$

Mutually exclusive and exhaustive. Lovely.

Write down the data

The data are the things that let us update our beliefs about a set of propositions. Often, they're the things we measure. They can also be written as full sentences, and might be something like $D = \textrm{"I saw three ships come sailing in,"}$ or $D = \textrm{"Out of six die rolls, two of them resulted in a 4."}$

The data can also be a complicated logical statement, which lets us join together a bunch of points. For example, $D = \textrm{"The first roll was a 1 AND the second roll was a 5 AND..."}$, which can be represented as $D = D_1D_2D_3...$.

In the example problem, $D=\textrm{"Your friend was on time this morning."}$

Find the prior probabilities

In 99.999% of textbook problems, this step is as easy as reading some numbers from the problem statment. Prior probabilities will be provided directly, and we remain agnostic about where they came from. This leads to a great deal of confusion and skepticism about Bayes' rule, which I'll elaborate on another time. For now, be assured that for any set of prior information, there is one correct prior probability on each hypothesis.

Let's take the given priors for hypothesis $H_1$ and $H_2$:

$$P\left(H_1|X\right) = 3/4$$

or, "the probability that your friend drove the small car given the problem statement but without knowing whether he was on time is equal to $3/4$. And:

$$P\left(H_2|X\right) = 1/4$$.

Since the set of $H_i$ include the whole of possible reality, the prior probabilities had better sum to unity, and they do.

Find the sampling distributions

If each hypothesis were true, how likely is it that we would have seen this exact data set? To generate these numbers, we might need to pull in some expertise from combinatorics or statistics, or we might read it from the problem statement. The example problem is a case of the latter: $$P\left(D|H_1 X\right) = 0.9,$$
or "The probability that your friend arrives on time given that he drove the small car is 0.9." Similarly, $$P\left(D|H_2 X\right) = 0.6$$
If the data consist of more than one thing, remember that we can always expand the joint sampling distribution like this:
$$P\left(D|HX\right) = P\left(D_1D_2...D_N|HX\right) = P\left(D_1|D_2...D_NHX\right) = ...$$
Also, if the data are independent from each other (if we're looking at dice rolls, no roll affects any other roll), then the joint sampling distribution is just a product of each sampling distribution, i.e.
$$P\left(D|HX\right) = \Pi_i P\left(D_i|HX\right).$$

Build the normalization term

To get the normalization term into a form we can calculate, we need to do a little massaging. Any probability can be broken into a sum of joint probabilities with another variable, i.e. $P\left(A\right) = \sum_i P\left(AB_i\right)$, where the $B_i$ are mutually exclusive and exhaustive. The normalization constant in particular can be broken into a sum of joint probabilities with each hypothesis:

$$P\left(D|X\right) = \sum_i P\left(DH_i|X\right).$$

Then we can use the product rule to transform the thing in the sum into this:

$$P\left(D|X\right) = \sum_i P\left(D|H_iX\right)P\left(H_i|X\right).$$

The cool thing about this form is that we've already calculated everything in it. The sum contains each prior probability with its associated sampling distrubution. That means we don't have to do any additional thinking - we just add together the numbers we already thought about.

Shut up and calculate:

We have all of the pieces, so let's get the posterior probability of each hypothesis:
$$P\left(H_1|DX\right) = P\left(H_1|X\right) \frac{P\left(D|H_1X\right)}{\sum_i P\left(D|H_iX\right)P\left(H_i|X\right)} = 0.75\frac{0.9}{0.75*0.9 + 0.25*0.6} \approx 0.82,$$
So the answer to the problem is that the probability that your friend drove the small car is 82%.

For completeness, we could use the fact that $P\left(H_2|DX\right) = 1-P\left(H_1|DX\right) $ to find the posterior probability on $H_2$, or we can calculate it in the same way:
$$P\left(H_2|DX\right) = P\left(H_2|X\right) \frac{P\left(D|H_2X\right)}{\sum_i P\left(D|H_iX\right)P\left(H_i|X\right)} = 0.25\frac{0.6}{0.75*0.9 + 0.25*0.6} \approx 0.18$$

Compound data example

I want to cover a problem with a more complex data set. Consider the following problem that I just made up:

To make a good espresso, you need a good machine and a skilled bariste. A local coffee shop has two espresso machines: a good one and a bad one. The good one makes a good espresso 95% of the time and a terrible one 5%, even if the operator is perfect. The bad one makes only 50% good and 50% bad with perfect operation.

There are also two baristi at this shop: the owner and a trainee. The owner is always working, and is a perfect operator of espresso machines. The trainee works half of the time, and ruins the espresso 30% of the time, regardess of how the machine performs. If both people are working on a particular day, they're equally likely to be on espresso duty.

This morning you ordered two espressi. One was good and one was bad. How likely is each combination of bariste and machine?

First of all, how cool is it that we can solve this problem? It doesn't feel like we have enough information, but part of the beauty of Bayes is that there are no questions which are impossible to ask. You might get a more or less informative answer, but you can always ask. Let's do this.

Write down Bayes' rule:

$$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)}$$

We're so good at this!

Write the hypotheses:

$$H_1=\textrm{The owner is using the good machine.}$$

$$H_2=\textrm{The owner is using the bad machine.}$$

$$H_3=\textrm{The trainee is using the good machine.}$$

$$H_4=\textrm{The trainee is using the bad machine.}$$

Starting to think maybe they should stop using the bad machine.

Write down the data:

$$D=D_1D_2$$

where

$$D_1=\textrm{The first espresso is good,}$$

$$D_2=\textrm{The second espresso is good,}$$

and the notation $D_1D_2$ means $D_1$ AND $D_2$.

Find the priors:

It's easiest to think through this with a decision tree. Either both baristi are working or else just the owner (even chances), and if they're both working they have an equal chance of running the machine. In either case, it's an even split on which machine it is.

so:
$$P\left(H_1|X\right) = P\left(H_2|X\right) = 3/8$$
and
$$P\left(H_3|X\right) = P\left(H_4|X\right) = 1/8.$$

Find the sampling distributions:

For each hypothesis, how likely are we to get two good espressi? We get a good espresso if the machine works and the operator doesn't mess up. In the problem statement, I was careful to note that these probabilities were independent. That is, the chance of operator error doesn't depend on the machine, and the machine error rate doesn't depend on the operator. So the probability of a single espresso being good under each of the hypotheses is:
$$p_1 = 0.95*1 = 0.95,$$ $$p_2 = 0.5*1 = 0.5,$$ $$p_3 = 0.95*0.7 = 0.665,$$ and $$p_4 = 0.5*0.7 = 0.35$$

If you order $N$ espressi, the chance of getting exactly $g$ good ones follows a binomial distribution, so:
$$P\left(g|H_i X\right) = {N \choose g} \left(p_i\right)^g \left(1-p_i\right)^{\left(N-g\right)},$$
where I'm using $g$ as shorthand for "Exactly $g$ good espressi were made," where the $p_i$ are calculated above.

For this problem, $n=2$ and $g=1$. Then
$$P\left(D|H_1 X\right) = {2 \choose 1}\left(0.95\right)^1\left(1-0.95\right)^1 = 0.095$$
or just under 10%. It's low because with a working machine and an expert operator, it's unlikely we would get a bad espresso. Note that it's a little bit of an overkill to use a binomial distribution here, since this expression is equivalent to $2*p_{bad}*p_{good}$, but this method is applicable for longer strings of data as well. Similarly,
$$P\left(D|H_2 X\right) = 0.5,$$ $$P\left(D|H_3 X\right) = 0.45,$$ and $$P\left(D|H_4 X\right) = 0.45.$$

Construct the normalizing term:

We already have all of the pieces.

$$P\left(D|X\right)=\sum_i P\left(D|H_i X\right)P\left(H_i|X\right)$$

$$= 0.1*0.375 + 0.5*0.375 + 0.45*0.125 + 0.45*0.125 \approx 0.34.$$

Calculate the posteriors:

$$P\left(H_1|DX\right) = P\left(H_1|X\right)\frac{P\left(D|H_1X\right)}{P\left(D|X\right)} = 3/8*\frac{0.1}{0.34} \approx 0.11$$

$$P\left(H_2|DX\right) = 3/8*\frac{0.5}{0.34} \approx 0.55$$

$$P\left(H_3|DX\right) = P\left(H_3|DX\right) = 1/8*\frac{0.45}{0.34} \approx 0.17.$$

So in the end, we find that the most likely hypothesis is $H_2$, that the owner is using the bad machine. At first it seems counterintuitive that the two trainee hypotheses are equally likely, but it worked out this way because the probabilities of getting a good espresso from those hypotheses were complementary, i.e. $p_3 \approx 1- p_4$.

Hopefully this problem demonstrated how to deal with slightly more complicated priors and data sets. The point is that the protocol for solving these problems is always the same.

Physics to data science

Tuesday, November 10, 2015

A plan for textbook Bayes' rule problems