Sunday, November 15, 2015

Bayes: Thoughts on prior information

Last time I presented a protocol for solving textbook Bayes rule problems, in which I advocated tacking the prior information $X$ onto each of the terms, like so: $$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)}.$$

Here I'd like to talk briefly about why I think that's a good idea.

1. It's good for the narrative

Each term in Bayes' rule has a straightforward interpretation, which I explained last time. But if we leave out $X$, things get a bit ambiguous. Specifically, suppose we write the prior as simply $P\left(H\right)$. This can be read as "the probability that hypothesis $H$ is true." But isn't that what we were trying to calculate? Personally, I find it clearer to write $P\left(H|X\right)$ and read it as "the probability that hypothesis $H$ is true given (only) the prior information."

 

2. Prior information is sneaky

It might be useful to remember that $X$ includes the problem statement, and in most textbook problems, that's the only thing in $X$. But sometimes a problem assumes that you know something about how the world works. For example, there are a lot of Bayes problems floating around about twins (for some reason) that require you to know the incidence of identical and non-identical twins.
Beyond simple statistics, we sometimes use as prior information what was left out  of the problem statement. There is perhaps no better illustration of this than Bertrand's Paradox, which asks about lines drawn in a circle inscribed by a triangle. Jaynes suggests that the problem can be resolved by noting that it assumes nothing about where or how large the circle is. If the problem is to have a unique solution, it must obey transformation invariances under these parameters.

Even if you don't like Jaynes' thoughts on Bertrand, other transformation invariances have to be considered. A subtle one is the following: Suppose we come up with a prior probability $P\left(H|X\right)$. If we imagine an ensemble of experiments, the $j^{th}$ of which generates data $D_j$, we could calculate all of the possible posteriors, $P\left(H|D_j X\right)$. If we sum these over $j$, weighting by how likely each result is, we must get the prior back.* If we don't, then we have the wrong prior. So here, we can consider $X$ to include the statement that the prior is constrained by this invariance.

* The proof for this is trivial. We just expand the prior in the $D_j$ basis and then use the product rule: $$P\left(H|X\right) = \sum_j P\left(HD_j|X\right) = \sum_j P\left(H|D_jX\right)P\left(D_j\right).$$ This happens because when we imagine the ensemble of experiments, we can only use our prior information to do so. So if we can construct the term on the right hand side, then we must be able to construct the one on the left.

   

3. Probability is subjective

...but not arbitrary. That is, there's a unique posterior distribution for each set of prior information. Suppose Alice and Bob are each trying to estimate the proportion of red and white balls in an urn based on the next 10 balls drawn with replacement. But while Bob has just shown up, Alice has been watching the urn for hours, and has seen 200 balls drawn already. They'll rightly cacluate two different posteriors on the proportion of balls, which we might label $$P\left(H|DX_A\right) = P\left(H|X_A\right) \frac{P\left(D|HX_A\right)}{P\left(D|X_A\right)}$$ and $$P\left(H|DX_B\right) = P\left(H|X_B\right) \frac{P\left(D|HX_B\right)}{P\left(D|X_B\right)}.$$
Clearly, we would be in trouble if we didn't include $X_A$ and $X_B$ here, since these would look like identical calculations. This is really only a problem because of the common misconception that the probability distribution is an aspect of the urn rather than an expression of Alice's and Bob's ignorance about the urn.

Hopefully this is enough to convince you to include $X$ when you write down Bayes' rule. If you do, I'll thank you, since it'll be less confusing for me.

No comments:

Post a Comment