Friday, November 27, 2015

What has surprised me about data science

It's been six months since I decided to become a data scientist, and I want to take a moment to reflect on what's been surprising about the journey. When I started, I had a study plan that involved machine learning and statistics courses, and plenty of programming practice.

Surprising things about the physics -> data science transition:

  1. Data scientists love hypothesis testing. Specifically, they are serious about binary hypothesis testing, and they tend to take the classical view of null and alternative hypotheses. In business this is called A/B testing, and it's used to make high-impact decisions at major companies every day. This surprises me because I didn't run into formal hypothesis testing very much in physics (although it's more popular in biology, for example). On the other hand, we expect that anywhere people stand to make or lose money based on decisions, those decisions should probably considered formally. The emphasis on hypothesis testing is symptomatic of rationality, but we have to be careful about defining the hypothesis space. Not everything that seems binary is binary.
  2. Cognitive bias takes a back seat. I'm used to thinking about rationality as the study of a trifecta of concepts: decision theory, probability, and cognitive bias. People in DS seem very concerned about the first two, while the third is just kind of kept in mind while we do our analysis. It's a hidden variable that causes the results of A/B tests to come out the way they do, and it plays a role in the way we communicate our results to the decision makers. But there doesn't seem to be much attention paid to the pitfalls of cognitive bias in our own analysis. I also haven't found anyone trying to exploit it to influence decisions, even though many data scientists work in marketing. I wonder if this is a place where I can contribute with a unique skill set.
  3. Statistics is hard. I thought I would be able to knock this out really quickly since I'm extremely comfortable with probability, but that has not been the case. Part of the problem is that it's typically taught from a tool-oriented viewpoint. We learn that in situation $A$, with $N$ samples from $M$ ensembles, test $X$ is appropriate. Contrast this with the bottom-up approach of Bayesian probability, where we start with the question of how to define a measure of likelihood, and we write out a complete hypothesis space before any problem. This may be why concepts like confidence intervals and p-values are commonly misunderstood even by expert practitioners. I have struggled to reconcile the tools of statistics with the formal logic of probability theory, but I can at least use the tools appropriately. The rest will come with practice.
  4. Machine learning is easy. Or at least, for most problems. ML makes up a large percentage of the data science news and tutorials I see, and there's a lot of emphasis on figuring out when to apply which method (again, a tool-oriented approach). But in basically every real example I've found, you can throw any classifier at the problem and it's pretty much OK. Furthermore, unless you have a huge amount of data or a very large parameter space, you can set up a pipeline in Python that tries different classifiers with different hyperparameters (like regularization parameters) and find the one that performs best. You just have to be careful about setting aside test and validation sets. All this to say that we can often afford a brute-force approach to machine learning.
  5. Interviews take a lot of resources. I had assumed that it would be rational for companies to do phone screens for any candidate who was possibly qualified for a position, since it's well known that resumes are not good predictors of success. By asking specific technical and non-technical questions, I assume that a competent hiring manager could separate the wheat from the chaff. Sending a data challenge seems like a similarly good idea. But this ignores the fact that even half-hour phone screens require time and effort that's not being put toward high-priority projects, and that someone has to review the data challenge results, which is a lot like grading a test. Which sucks. So it seems like companies are still sort of stuck choosing interviewees based on resume keywords. Seems like a bad idea, but I don't have a great solution.
  6. People are afraid of hiring academics. This is something Kathy Copic mentioned at a panel I attended, and it sounded ridiculous at the time, but I can tell you that it's absolutely true: managers are afraid of hiring "stereotypical" academic researchers, who prefer to work alone on very difficult problems for a long time, and generate theoretically perfect results that are of no use to anyone. They also prefer well-defined problems, are not intellectually agile, and are culturally incompetent. I don't know if this fear is founded - maybe there are horror stories about previous hires who fit this description. But a good academic does none of these things either: she works efficiently on small problems in pursuit of bigger ones, adapts her strategy according to previous successes or failures, and is able to collaborate with others and communicate her results.
That's it for now. Thanks for reading!


Sunday, November 22, 2015

Digit recognition part 1: visualization

I'd like to move more toward problem-based posts rather than tool-based posts, and today I'm looking at the famous MNIST (Modified National Institute of Standards and Technology) dataset of handwritten digits. These data are available at Kaggle, for example, where there is a training "competition" to use machine learning to identify new digits based on a training set. I'm going to ignore the test set for now and just work with the training set.

In the first part of this series, I just want to get a feel for the data: visualize it and do some preliminary analysis, as if I were not familiar with the problem at all. I'll be working in an IPython notebook. If you want to follow along, I uploaded it to Github.

The data

The data consist of about 40000 digitized handwritten numerals $\in$ [0, 9]. Each numeral was digitized on a 28x28 pixel grid. These grids have been flattened into rows of 784 pixles and stacked together into a 2D array that looks like this:


Each pixel has an intensity value $\in$ [1, 256]. Here's what the raw data array looks like if we just plot it:



Each row can be reconstructed into a 28x28 array. Here are some examples of random reshaped rows:


The information

Eventually we'll want to train a classifier with these digits, and will want to figure out which features (pixels) are important, and which ones aren't. There are plenty of ways to do that automatically, but it's also straightforward to poke at the data a bit and figure it out ahead of time. First, let's make sure we have the same number of examples for each numeral. The raw data array actually has a prepended "label" column that I didn't mention above, which is just the numeral index, 0-9. We can use this to make a histogram:

where we see that there are about 4000 examples of each numeral.

Next, let's look at what range of intensities we have. Making a histogram of all of the intensities for the entire data set, we see the following:
where it's clear that the data mostly consist of black and white pixels, and not much in between. This gives us a hint that we could binarize the data set without losing much information. It also might mean that there's been some pre-processing done on the data set.

How much does each numeral change throughout the data set? We can get some intuition by grouping examples of each numeral and then plotting their means and variances.
Means of each numeral
Variances of each numeral

These two plots contain more or less the same information - that the variance is concentrated near the edges of each numeral, and there isn't too much variance within each numeral, i.e. there's nothing crazy going on here. Compare this to the average and variance taken over all of the numbers.


We expect variance to correlate well to information, so these images show that most of the information is clustered in a blob near the center, which isn't too surprising. We could probably throw away a bunch of the outer pixels without too much trouble. Let's quantify that.

Suppose we order these pixels by variance (descending) and plot the cumulative variance by pixel. That is, show how adding the next largest variance increases the total variance. That gives us an idea of how many pixels we need to capture most of the variance, which we can expect to be related to information content.

The green line is set at 90% of the total variance, which intersects the cumulative variance at about 290 pixels. So through this exercise we've learned that we can throw away almost 2/3 of the pixels and still capture about 90% of the information, which will be useful when we start training a classfier.

Sunday, November 15, 2015

Bayes: Thoughts on prior information

Last time I presented a protocol for solving textbook Bayes rule problems, in which I advocated tacking the prior information $X$ onto each of the terms, like so: $$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)}.$$

Here I'd like to talk briefly about why I think that's a good idea.

1. It's good for the narrative

Each term in Bayes' rule has a straightforward interpretation, which I explained last time. But if we leave out $X$, things get a bit ambiguous. Specifically, suppose we write the prior as simply $P\left(H\right)$. This can be read as "the probability that hypothesis $H$ is true." But isn't that what we were trying to calculate? Personally, I find it clearer to write $P\left(H|X\right)$ and read it as "the probability that hypothesis $H$ is true given (only) the prior information."

 

2. Prior information is sneaky

It might be useful to remember that $X$ includes the problem statement, and in most textbook problems, that's the only thing in $X$. But sometimes a problem assumes that you know something about how the world works. For example, there are a lot of Bayes problems floating around about twins (for some reason) that require you to know the incidence of identical and non-identical twins.
Beyond simple statistics, we sometimes use as prior information what was left out  of the problem statement. There is perhaps no better illustration of this than Bertrand's Paradox, which asks about lines drawn in a circle inscribed by a triangle. Jaynes suggests that the problem can be resolved by noting that it assumes nothing about where or how large the circle is. If the problem is to have a unique solution, it must obey transformation invariances under these parameters.

Even if you don't like Jaynes' thoughts on Bertrand, other transformation invariances have to be considered. A subtle one is the following: Suppose we come up with a prior probability $P\left(H|X\right)$. If we imagine an ensemble of experiments, the $j^{th}$ of which generates data $D_j$, we could calculate all of the possible posteriors, $P\left(H|D_j X\right)$. If we sum these over $j$, weighting by how likely each result is, we must get the prior back.* If we don't, then we have the wrong prior. So here, we can consider $X$ to include the statement that the prior is constrained by this invariance.

* The proof for this is trivial. We just expand the prior in the $D_j$ basis and then use the product rule: $$P\left(H|X\right) = \sum_j P\left(HD_j|X\right) = \sum_j P\left(H|D_jX\right)P\left(D_j\right).$$ This happens because when we imagine the ensemble of experiments, we can only use our prior information to do so. So if we can construct the term on the right hand side, then we must be able to construct the one on the left.

   

3. Probability is subjective

...but not arbitrary. That is, there's a unique posterior distribution for each set of prior information. Suppose Alice and Bob are each trying to estimate the proportion of red and white balls in an urn based on the next 10 balls drawn with replacement. But while Bob has just shown up, Alice has been watching the urn for hours, and has seen 200 balls drawn already. They'll rightly cacluate two different posteriors on the proportion of balls, which we might label $$P\left(H|DX_A\right) = P\left(H|X_A\right) \frac{P\left(D|HX_A\right)}{P\left(D|X_A\right)}$$ and $$P\left(H|DX_B\right) = P\left(H|X_B\right) \frac{P\left(D|HX_B\right)}{P\left(D|X_B\right)}.$$
Clearly, we would be in trouble if we didn't include $X_A$ and $X_B$ here, since these would look like identical calculations. This is really only a problem because of the common misconception that the probability distribution is an aspect of the urn rather than an expression of Alice's and Bob's ignorance about the urn.

Hopefully this is enough to convince you to include $X$ when you write down Bayes' rule. If you do, I'll thank you, since it'll be less confusing for me.

Tuesday, November 10, 2015

A plan for textbook Bayes' rule problems

I like (love) Bayes' rule, as a few (many) of you know. It's applicable in many situations (every situation), and employers may (should) want to know that you can use it properly. To do that, they might present you with problems like this:
A friend who works in a big city owns two cars, one small and one large. Three-quarters of the time he drives the small car to work, and one-quarter of the time he drives the large car. If he takes the small car, he usually has little trouble parking, and so is at work on time with probability 0.9. If he takes the large car, he is at work on time with probability 0.6. Given that he was on time on a particular morning, what is the probability that he drove the small car? [from here

The point of these problems, and the main function of Bayes' rule, is to combine new evidence (data) with prior information to update our beliefs about a set of hypotheses. For a good introduction to the Bayesian way of thinking, check out E.T. Jaynes' book Probability Theory: The Logic of Science. Here, I want to provide a protocol for attacking these problems that should elucidate the process, and maybe clear up some confusion about Bayes' rule.

Bayes' Rule

I like to write Bayes' rule like this:
$$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)},$$
where the symbols mean the following:
  • $H$: A hypothesis, which is a proposition like "She's a witch," or "I like pizza, and bats are reptiles." It can always be written as a full grammatical sentence.
  • $D$: The data, which may consist of many indiviudal data points.
  • $X$: The prior information. This always includes the problem statement, and may include other things. In principle, it includes everything you believe to be true about the universe in which the problem takes place. If you include irrelevant information in $X$, it will have no effect on the problem.
You'll see slightly different representations elsewhere, notably omitting the $X$ (leaving it implicit), and using other symbols for $D$ and $H$. I like this representation because it has a clear narrative. There are four quantities here, which can be interpreted as:
  • The posterior probabilty, $P\left(H|DX\right)$: The probability that the hypothesis is true, given the prior information AND the data. This is what we want to calculate.
  • The prior probability, $P\left(H|X\right)$: The probability that the hypothesis is true given the prior information, but without knowing the data.
  • The sampling distribution, $P\left(D|HX\right)$: The probability that we would see this data set if the hypothesis and the prior information were both true.
  • A normalizing constant, $P\left(D|X\right)$: The probability that we would observe the data regardless of the hypothesis. Sometimes called the marginal.

The plan 

The plan for solving problems like this is the following:
  1. Write down Bayes' rule.
  2. Write out all of the hypotheses as English sentences.
  3. Write down the data.
  4. Find all of the prior probabilities.
  5. Find all of the sampling distributions.
  6. Construct the normalizing term.
  7. Shut up and calculate.
Let's take a closer look at the example problem from above.

Write down Bayes' rule

This seems obvious, but go ahead and do it anyway. In fact, write it using the notation above, since it's hard to forget what you're doing that way. I would give you about 80% credit as an interviewer if you got this far and were able to explain the terms.

Write out the hypotheses

In the Bayesian view, there is a space of hypotheses that describe every way the universe can be. For each problem, there is a set of mutually exclusive hypotheses that span that space. We might label them $\left(H_1, H_2,...,H_N \right)$. If you don't know what every member of this space is, you can't do the problem. On the other hand, this space reflects your view of the universe, so you can always define it in principle.

For the above problem, we have the hypotheses
$$H_1 = \textrm{Your friend drove the small car.}$$ $$H_2 = \textrm{Your friend did not drive the small car.}$$
Mutually exclusive and exhaustive. Lovely.

Write down the data

The data are the things that let us update our beliefs about a set of propositions. Often, they're the things we measure. They can also be written as full sentences, and might be something like $D = \textrm{"I saw three ships come sailing in,"}$ or $D = \textrm{"Out of six die rolls, two of them resulted in a 4."}$

The data can also be a complicated logical statement, which lets us join together a bunch of points. For example, $D = \textrm{"The first roll was a 1 AND the second roll was a 5 AND..."}$, which can be represented as $D = D_1D_2D_3...$.

In the example problem, $D=\textrm{"Your friend was on time this morning."}$

Find the prior probabilities

In 99.999% of textbook problems, this step is as easy as reading some numbers from the problem statment. Prior probabilities will be provided directly, and we remain agnostic about where they came from. This leads to a great deal of confusion and skepticism about Bayes' rule, which I'll elaborate on another time. For now, be assured that for any set of prior information, there is one correct prior probability on each hypothesis.

Let's take the given priors for hypothesis $H_1$ and $H_2$:
$$P\left(H_1|X\right) = 3/4$$
or, "the probability that your friend drove the small car given the problem statement but without knowing whether he was on time is equal to $3/4$. And:
$$P\left(H_2|X\right) = 1/4$$.
Since the set of $H_i$ include the whole of possible reality, the prior probabilities had better sum to unity, and they do. 

Find the sampling distributions

If each hypothesis were true, how likely is it that we would have seen this exact data set? To generate these numbers, we might need to pull in some expertise from combinatorics or statistics, or we might read it from the problem statement. The example problem is a case of the latter: $$P\left(D|H_1 X\right) = 0.9,$$
or "The probability that your friend arrives on time given that he drove the small car is 0.9." Similarly, $$P\left(D|H_2 X\right) = 0.6$$
If the data consist of more than one thing, remember that we can always expand the joint sampling distribution like this:
$$P\left(D|HX\right) = P\left(D_1D_2...D_N|HX\right) = P\left(D_1|D_2...D_NHX\right) = ...$$
Also, if the data are independent from each other (if we're looking at dice rolls, no roll affects any other roll), then the joint sampling distribution is just a product of each sampling distribution, i.e.
$$P\left(D|HX\right) = \Pi_i P\left(D_i|HX\right).$$

Build the normalization term

To get the normalization term into a form we can calculate, we need to do a little massaging. Any probability can be broken into a sum of joint probabilities with another variable, i.e. $P\left(A\right) = \sum_i P\left(AB_i\right)$, where the $B_i$ are mutually exclusive and exhaustive. The normalization constant in particular can be broken into a sum of joint probabilities with each hypothesis:
$$P\left(D|X\right) = \sum_i P\left(DH_i|X\right).$$
Then we can use the product rule to transform the thing in the sum into this:
$$P\left(D|X\right) = \sum_i P\left(D|H_iX\right)P\left(H_i|X\right).$$
The cool thing about this form is that we've already calculated everything in it. The sum contains each prior probability with its associated sampling distrubution. That means we don't have to do any additional thinking - we just add together the numbers we already thought about.

For the example problem, $$P\left(D|X\right) = P\left(H_1|X\right)P\left(D|H_1X\right) + P\left(H_2|X\right)P\left(D|H_2X\right) = 0.75*0.9 + 0.25*0.6 = 0.825$$

Shut up and calculate:

We have all of the pieces, so let's get the posterior probability of each hypothesis:
$$P\left(H_1|DX\right) = P\left(H_1|X\right) \frac{P\left(D|H_1X\right)}{\sum_i P\left(D|H_iX\right)P\left(H_i|X\right)} = 0.75\frac{0.9}{0.75*0.9 + 0.25*0.6} \approx 0.82,$$
So the answer to the problem is that the probability that your friend drove the small car is 82%.

For completeness, we could use the fact that $P\left(H_2|DX\right) = 1-P\left(H_1|DX\right) $ to find the posterior probability on $H_2$, or we can calculate it in the same way:
$$P\left(H_2|DX\right) = P\left(H_2|X\right) \frac{P\left(D|H_2X\right)}{\sum_i P\left(D|H_iX\right)P\left(H_i|X\right)} = 0.25\frac{0.6}{0.75*0.9 + 0.25*0.6} \approx 0.18$$

Compound data example

I want to cover a problem with a more complex data set. Consider the following problem that I just made up:
To make a good espresso, you need a good machine and a skilled bariste. A local coffee shop has two espresso machines: a good one and a bad one. The good one makes a good espresso 95% of the time and a terrible one 5%, even if the operator is perfect. The bad one makes only 50% good and 50% bad with perfect operation. 
There are also two baristi at this shop: the owner and a trainee. The owner is always working, and is a perfect operator of espresso machines. The trainee works half of the time, and ruins the espresso 30% of the time, regardess of how the machine performs. If both people are working on a particular day, they're equally likely to be on espresso duty. 
This morning you ordered two espressi. One was good and one was bad. How likely is each combination of bariste and machine?
First of all, how cool is it that we can solve this problem? It doesn't feel like we have enough information, but part of the beauty of Bayes is that there are no questions which are impossible to ask. You might get a more or less informative answer, but you can always ask. Let's do this.

Write down Bayes' rule:

$$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)}$$
We're so good at this!

Write the hypotheses:

$$H_1=\textrm{The owner is using the good machine.}$$
$$H_2=\textrm{The owner is using the bad machine.}$$
$$H_3=\textrm{The trainee is using the good machine.}$$
$$H_4=\textrm{The trainee is using the bad machine.}$$
Starting to think maybe they should stop using the bad machine.

Write down the data:

$$D=D_1D_2$$
where
$$D_1=\textrm{The first espresso is good,}$$
$$D_2=\textrm{The second espresso is good,}$$
and the notation $D_1D_2$ means $D_1$ AND $D_2$.

Find the priors:

It's easiest to think through this with a decision tree. Either both baristi are working or else just the owner (even chances), and if they're both working they have an equal chance of running the machine. In either case, it's an even split on which machine it is.


so:
$$P\left(H_1|X\right) = P\left(H_2|X\right) = 3/8$$
and
$$P\left(H_3|X\right) = P\left(H_4|X\right) = 1/8.$$

Find the sampling distributions:

For each hypothesis, how likely are we to get two good espressi? We get a good espresso if the machine works and the operator doesn't mess up. In the problem statement, I was careful to note that these probabilities were independent. That is, the chance of operator error doesn't depend on the machine, and the machine error rate doesn't depend on the operator. So the probability of a single espresso being good under each of the hypotheses is:
$$p_1 = 0.95*1 = 0.95,$$ $$p_2 = 0.5*1 = 0.5,$$ $$p_3 = 0.95*0.7 = 0.665,$$ and $$p_4 = 0.5*0.7 = 0.35$$

If you order $N$ espressi, the chance of getting exactly $g$ good ones follows a binomial distribution, so:
$$P\left(g|H_i X\right) = {N \choose g} \left(p_i\right)^g \left(1-p_i\right)^{\left(N-g\right)},$$
where I'm using $g$ as shorthand for "Exactly $g$ good espressi were made," where the $p_i$ are calculated above.

For this problem, $n=2$ and $g=1$. Then
$$P\left(D|H_1 X\right) = {2 \choose 1}\left(0.95\right)^1\left(1-0.95\right)^1 = 0.095$$
or just under 10%. It's low because with a working machine and an expert operator, it's unlikely we would get a bad espresso. Note that it's a little bit of an overkill to use a binomial distribution here, since this expression is equivalent to $2*p_{bad}*p_{good}$, but this method is applicable for longer strings of data as well. Similarly,
$$P\left(D|H_2 X\right) =  0.5,$$ $$P\left(D|H_3 X\right) =  0.45,$$ and $$P\left(D|H_4 X\right) =  0.45.$$

Construct the normalizing term:

We already have all of the pieces.
$$P\left(D|X\right)=\sum_i P\left(D|H_i X\right)P\left(H_i|X\right)$$
$$= 0.1*0.375 + 0.5*0.375 + 0.45*0.125 + 0.45*0.125 \approx 0.34.$$

Calculate the posteriors:

$$P\left(H_1|DX\right) = P\left(H_1|X\right)\frac{P\left(D|H_1X\right)}{P\left(D|X\right)} = 3/8*\frac{0.1}{0.34} \approx 0.11$$
$$P\left(H_2|DX\right) = 3/8*\frac{0.5}{0.34} \approx 0.55$$
$$P\left(H_3|DX\right) = P\left(H_3|DX\right) = 1/8*\frac{0.45}{0.34} \approx 0.17.$$
So in the end, we find that the most likely hypothesis is $H_2$, that the owner is using the bad machine. At first it seems counterintuitive that the two trainee hypotheses are equally likely, but it worked out this way because the probabilities of getting a good espresso from those hypotheses were complementary, i.e. $p_3 \approx 1- p_4$.

Hopefully this problem demonstrated how to deal with slightly more complicated priors and data sets. The point is that the protocol for solving these problems is always the same. 

Thursday, November 5, 2015

Blogosphere reentry

I've held off on blogging for a while here while I've been at insight, but the program has officially ended. I'm still going into the office as I prepare for interviews, but I'm not actively working on a project right now, and I have a long backlog of things I'd like to talk about here.

Here are some things I'd like to write about:

  • Setting up Bayes rule problems: Apparently these textbook problems are asked commonly in interviews. They're all pretty much the same, and I want to show a nice protocol for solving them. Hopefully it will also help you think about non-textbook problems.
  • Statistics for probabilists: Coming from a probability background, I find statistics a little maddening. What question are we asking? What assumptions are we making? How good is the answer? What is the equivalent procedure in probability? This will be a multi-part blog, and I'm not quite qualified to write it yet.
  • Machine learning comparison: random forest, support vector machines, etc. I want to talk about the kernel trick.
  • Regularization: should probably be its own topic. It guards against overfitting and helps us choose a solution in ill-posed problem. Regularization methods have strange names in data science.
  • My Insight project (BackTweet Driver): helps people refine their tweets to increase the chance of being retweeted. You can see it at backtweetdriver.com
...and many more ideas, but you get the idea.

See you soon.