Sunday, November 22, 2015

Digit recognition part 1: visualization

I'd like to move more toward problem-based posts rather than tool-based posts, and today I'm looking at the famous MNIST (Modified National Institute of Standards and Technology) dataset of handwritten digits. These data are available at Kaggle, for example, where there is a training "competition" to use machine learning to identify new digits based on a training set. I'm going to ignore the test set for now and just work with the training set.

In the first part of this series, I just want to get a feel for the data: visualize it and do some preliminary analysis, as if I were not familiar with the problem at all. I'll be working in an IPython notebook. If you want to follow along, I uploaded it to Github.

The data

The data consist of about 40000 digitized handwritten numerals $\in$ [0, 9]. Each numeral was digitized on a 28x28 pixel grid. These grids have been flattened into rows of 784 pixles and stacked together into a 2D array that looks like this:


Each pixel has an intensity value $\in$ [1, 256]. Here's what the raw data array looks like if we just plot it:



Each row can be reconstructed into a 28x28 array. Here are some examples of random reshaped rows:


The information

Eventually we'll want to train a classifier with these digits, and will want to figure out which features (pixels) are important, and which ones aren't. There are plenty of ways to do that automatically, but it's also straightforward to poke at the data a bit and figure it out ahead of time. First, let's make sure we have the same number of examples for each numeral. The raw data array actually has a prepended "label" column that I didn't mention above, which is just the numeral index, 0-9. We can use this to make a histogram:

where we see that there are about 4000 examples of each numeral.

Next, let's look at what range of intensities we have. Making a histogram of all of the intensities for the entire data set, we see the following:
where it's clear that the data mostly consist of black and white pixels, and not much in between. This gives us a hint that we could binarize the data set without losing much information. It also might mean that there's been some pre-processing done on the data set.

How much does each numeral change throughout the data set? We can get some intuition by grouping examples of each numeral and then plotting their means and variances.
Means of each numeral
Variances of each numeral

These two plots contain more or less the same information - that the variance is concentrated near the edges of each numeral, and there isn't too much variance within each numeral, i.e. there's nothing crazy going on here. Compare this to the average and variance taken over all of the numbers.


We expect variance to correlate well to information, so these images show that most of the information is clustered in a blob near the center, which isn't too surprising. We could probably throw away a bunch of the outer pixels without too much trouble. Let's quantify that.

Suppose we order these pixels by variance (descending) and plot the cumulative variance by pixel. That is, show how adding the next largest variance increases the total variance. That gives us an idea of how many pixels we need to capture most of the variance, which we can expect to be related to information content.

The green line is set at 90% of the total variance, which intersects the cumulative variance at about 290 pixels. So through this exercise we've learned that we can throw away almost 2/3 of the pixels and still capture about 90% of the information, which will be useful when we start training a classfier.

Sunday, November 15, 2015

Bayes: Thoughts on prior information

Last time I presented a protocol for solving textbook Bayes rule problems, in which I advocated tacking the prior information $X$ onto each of the terms, like so: $$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)}.$$

Here I'd like to talk briefly about why I think that's a good idea.

1. It's good for the narrative

Each term in Bayes' rule has a straightforward interpretation, which I explained last time. But if we leave out $X$, things get a bit ambiguous. Specifically, suppose we write the prior as simply $P\left(H\right)$. This can be read as "the probability that hypothesis $H$ is true." But isn't that what we were trying to calculate? Personally, I find it clearer to write $P\left(H|X\right)$ and read it as "the probability that hypothesis $H$ is true given (only) the prior information."

 

2. Prior information is sneaky

It might be useful to remember that $X$ includes the problem statement, and in most textbook problems, that's the only thing in $X$. But sometimes a problem assumes that you know something about how the world works. For example, there are a lot of Bayes problems floating around about twins (for some reason) that require you to know the incidence of identical and non-identical twins.
Beyond simple statistics, we sometimes use as prior information what was left out  of the problem statement. There is perhaps no better illustration of this than Bertrand's Paradox, which asks about lines drawn in a circle inscribed by a triangle. Jaynes suggests that the problem can be resolved by noting that it assumes nothing about where or how large the circle is. If the problem is to have a unique solution, it must obey transformation invariances under these parameters.

Even if you don't like Jaynes' thoughts on Bertrand, other transformation invariances have to be considered. A subtle one is the following: Suppose we come up with a prior probability $P\left(H|X\right)$. If we imagine an ensemble of experiments, the $j^{th}$ of which generates data $D_j$, we could calculate all of the possible posteriors, $P\left(H|D_j X\right)$. If we sum these over $j$, weighting by how likely each result is, we must get the prior back.* If we don't, then we have the wrong prior. So here, we can consider $X$ to include the statement that the prior is constrained by this invariance.

* The proof for this is trivial. We just expand the prior in the $D_j$ basis and then use the product rule: $$P\left(H|X\right) = \sum_j P\left(HD_j|X\right) = \sum_j P\left(H|D_jX\right)P\left(D_j\right).$$ This happens because when we imagine the ensemble of experiments, we can only use our prior information to do so. So if we can construct the term on the right hand side, then we must be able to construct the one on the left.

   

3. Probability is subjective

...but not arbitrary. That is, there's a unique posterior distribution for each set of prior information. Suppose Alice and Bob are each trying to estimate the proportion of red and white balls in an urn based on the next 10 balls drawn with replacement. But while Bob has just shown up, Alice has been watching the urn for hours, and has seen 200 balls drawn already. They'll rightly cacluate two different posteriors on the proportion of balls, which we might label $$P\left(H|DX_A\right) = P\left(H|X_A\right) \frac{P\left(D|HX_A\right)}{P\left(D|X_A\right)}$$ and $$P\left(H|DX_B\right) = P\left(H|X_B\right) \frac{P\left(D|HX_B\right)}{P\left(D|X_B\right)}.$$
Clearly, we would be in trouble if we didn't include $X_A$ and $X_B$ here, since these would look like identical calculations. This is really only a problem because of the common misconception that the probability distribution is an aspect of the urn rather than an expression of Alice's and Bob's ignorance about the urn.

Hopefully this is enough to convince you to include $X$ when you write down Bayes' rule. If you do, I'll thank you, since it'll be less confusing for me.

Tuesday, November 10, 2015

A plan for textbook Bayes' rule problems

I like (love) Bayes' rule, as a few (many) of you know. It's applicable in many situations (every situation), and employers may (should) want to know that you can use it properly. To do that, they might present you with problems like this:
A friend who works in a big city owns two cars, one small and one large. Three-quarters of the time he drives the small car to work, and one-quarter of the time he drives the large car. If he takes the small car, he usually has little trouble parking, and so is at work on time with probability 0.9. If he takes the large car, he is at work on time with probability 0.6. Given that he was on time on a particular morning, what is the probability that he drove the small car? [from here

The point of these problems, and the main function of Bayes' rule, is to combine new evidence (data) with prior information to update our beliefs about a set of hypotheses. For a good introduction to the Bayesian way of thinking, check out E.T. Jaynes' book Probability Theory: The Logic of Science. Here, I want to provide a protocol for attacking these problems that should elucidate the process, and maybe clear up some confusion about Bayes' rule.

Bayes' Rule

I like to write Bayes' rule like this:
$$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)},$$
where the symbols mean the following:
  • $H$: A hypothesis, which is a proposition like "She's a witch," or "I like pizza, and bats are reptiles." It can always be written as a full grammatical sentence.
  • $D$: The data, which may consist of many indiviudal data points.
  • $X$: The prior information. This always includes the problem statement, and may include other things. In principle, it includes everything you believe to be true about the universe in which the problem takes place. If you include irrelevant information in $X$, it will have no effect on the problem.
You'll see slightly different representations elsewhere, notably omitting the $X$ (leaving it implicit), and using other symbols for $D$ and $H$. I like this representation because it has a clear narrative. There are four quantities here, which can be interpreted as:
  • The posterior probabilty, $P\left(H|DX\right)$: The probability that the hypothesis is true, given the prior information AND the data. This is what we want to calculate.
  • The prior probability, $P\left(H|X\right)$: The probability that the hypothesis is true given the prior information, but without knowing the data.
  • The sampling distribution, $P\left(D|HX\right)$: The probability that we would see this data set if the hypothesis and the prior information were both true.
  • A normalizing constant, $P\left(D|X\right)$: The probability that we would observe the data regardless of the hypothesis. Sometimes called the marginal.

The plan 

The plan for solving problems like this is the following:
  1. Write down Bayes' rule.
  2. Write out all of the hypotheses as English sentences.
  3. Write down the data.
  4. Find all of the prior probabilities.
  5. Find all of the sampling distributions.
  6. Construct the normalizing term.
  7. Shut up and calculate.
Let's take a closer look at the example problem from above.

Write down Bayes' rule

This seems obvious, but go ahead and do it anyway. In fact, write it using the notation above, since it's hard to forget what you're doing that way. I would give you about 80% credit as an interviewer if you got this far and were able to explain the terms.

Write out the hypotheses

In the Bayesian view, there is a space of hypotheses that describe every way the universe can be. For each problem, there is a set of mutually exclusive hypotheses that span that space. We might label them $\left(H_1, H_2,...,H_N \right)$. If you don't know what every member of this space is, you can't do the problem. On the other hand, this space reflects your view of the universe, so you can always define it in principle.

For the above problem, we have the hypotheses
$$H_1 = \textrm{Your friend drove the small car.}$$ $$H_2 = \textrm{Your friend did not drive the small car.}$$
Mutually exclusive and exhaustive. Lovely.

Write down the data

The data are the things that let us update our beliefs about a set of propositions. Often, they're the things we measure. They can also be written as full sentences, and might be something like $D = \textrm{"I saw three ships come sailing in,"}$ or $D = \textrm{"Out of six die rolls, two of them resulted in a 4."}$

The data can also be a complicated logical statement, which lets us join together a bunch of points. For example, $D = \textrm{"The first roll was a 1 AND the second roll was a 5 AND..."}$, which can be represented as $D = D_1D_2D_3...$.

In the example problem, $D=\textrm{"Your friend was on time this morning."}$

Find the prior probabilities

In 99.999% of textbook problems, this step is as easy as reading some numbers from the problem statment. Prior probabilities will be provided directly, and we remain agnostic about where they came from. This leads to a great deal of confusion and skepticism about Bayes' rule, which I'll elaborate on another time. For now, be assured that for any set of prior information, there is one correct prior probability on each hypothesis.

Let's take the given priors for hypothesis $H_1$ and $H_2$:
$$P\left(H_1|X\right) = 3/4$$
or, "the probability that your friend drove the small car given the problem statement but without knowing whether he was on time is equal to $3/4$. And:
$$P\left(H_2|X\right) = 1/4$$.
Since the set of $H_i$ include the whole of possible reality, the prior probabilities had better sum to unity, and they do. 

Find the sampling distributions

If each hypothesis were true, how likely is it that we would have seen this exact data set? To generate these numbers, we might need to pull in some expertise from combinatorics or statistics, or we might read it from the problem statement. The example problem is a case of the latter: $$P\left(D|H_1 X\right) = 0.9,$$
or "The probability that your friend arrives on time given that he drove the small car is 0.9." Similarly, $$P\left(D|H_2 X\right) = 0.6$$
If the data consist of more than one thing, remember that we can always expand the joint sampling distribution like this:
$$P\left(D|HX\right) = P\left(D_1D_2...D_N|HX\right) = P\left(D_1|D_2...D_NHX\right) = ...$$
Also, if the data are independent from each other (if we're looking at dice rolls, no roll affects any other roll), then the joint sampling distribution is just a product of each sampling distribution, i.e.
$$P\left(D|HX\right) = \Pi_i P\left(D_i|HX\right).$$

Build the normalization term

To get the normalization term into a form we can calculate, we need to do a little massaging. Any probability can be broken into a sum of joint probabilities with another variable, i.e. $P\left(A\right) = \sum_i P\left(AB_i\right)$, where the $B_i$ are mutually exclusive and exhaustive. The normalization constant in particular can be broken into a sum of joint probabilities with each hypothesis:
$$P\left(D|X\right) = \sum_i P\left(DH_i|X\right).$$
Then we can use the product rule to transform the thing in the sum into this:
$$P\left(D|X\right) = \sum_i P\left(D|H_iX\right)P\left(H_i|X\right).$$
The cool thing about this form is that we've already calculated everything in it. The sum contains each prior probability with its associated sampling distrubution. That means we don't have to do any additional thinking - we just add together the numbers we already thought about.

For the example problem, $$P\left(D|X\right) = P\left(H_1|X\right)P\left(D|H_1X\right) + P\left(H_2|X\right)P\left(D|H_2X\right) = 0.75*0.9 + 0.25*0.6 = 0.825$$

Shut up and calculate:

We have all of the pieces, so let's get the posterior probability of each hypothesis:
$$P\left(H_1|DX\right) = P\left(H_1|X\right) \frac{P\left(D|H_1X\right)}{\sum_i P\left(D|H_iX\right)P\left(H_i|X\right)} = 0.75\frac{0.9}{0.75*0.9 + 0.25*0.6} \approx 0.82,$$
So the answer to the problem is that the probability that your friend drove the small car is 82%.

For completeness, we could use the fact that $P\left(H_2|DX\right) = 1-P\left(H_1|DX\right) $ to find the posterior probability on $H_2$, or we can calculate it in the same way:
$$P\left(H_2|DX\right) = P\left(H_2|X\right) \frac{P\left(D|H_2X\right)}{\sum_i P\left(D|H_iX\right)P\left(H_i|X\right)} = 0.25\frac{0.6}{0.75*0.9 + 0.25*0.6} \approx 0.18$$

Compound data example

I want to cover a problem with a more complex data set. Consider the following problem that I just made up:
To make a good espresso, you need a good machine and a skilled bariste. A local coffee shop has two espresso machines: a good one and a bad one. The good one makes a good espresso 95% of the time and a terrible one 5%, even if the operator is perfect. The bad one makes only 50% good and 50% bad with perfect operation. 
There are also two baristi at this shop: the owner and a trainee. The owner is always working, and is a perfect operator of espresso machines. The trainee works half of the time, and ruins the espresso 30% of the time, regardess of how the machine performs. If both people are working on a particular day, they're equally likely to be on espresso duty. 
This morning you ordered two espressi. One was good and one was bad. How likely is each combination of bariste and machine?
First of all, how cool is it that we can solve this problem? It doesn't feel like we have enough information, but part of the beauty of Bayes is that there are no questions which are impossible to ask. You might get a more or less informative answer, but you can always ask. Let's do this.

Write down Bayes' rule:

$$P\left(H|DX\right) = P\left(H|X\right) \frac{P\left(D|HX\right)}{P\left(D|X\right)}$$
We're so good at this!

Write the hypotheses:

$$H_1=\textrm{The owner is using the good machine.}$$
$$H_2=\textrm{The owner is using the bad machine.}$$
$$H_3=\textrm{The trainee is using the good machine.}$$
$$H_4=\textrm{The trainee is using the bad machine.}$$
Starting to think maybe they should stop using the bad machine.

Write down the data:

$$D=D_1D_2$$
where
$$D_1=\textrm{The first espresso is good,}$$
$$D_2=\textrm{The second espresso is good,}$$
and the notation $D_1D_2$ means $D_1$ AND $D_2$.

Find the priors:

It's easiest to think through this with a decision tree. Either both baristi are working or else just the owner (even chances), and if they're both working they have an equal chance of running the machine. In either case, it's an even split on which machine it is.


so:
$$P\left(H_1|X\right) = P\left(H_2|X\right) = 3/8$$
and
$$P\left(H_3|X\right) = P\left(H_4|X\right) = 1/8.$$

Find the sampling distributions:

For each hypothesis, how likely are we to get two good espressi? We get a good espresso if the machine works and the operator doesn't mess up. In the problem statement, I was careful to note that these probabilities were independent. That is, the chance of operator error doesn't depend on the machine, and the machine error rate doesn't depend on the operator. So the probability of a single espresso being good under each of the hypotheses is:
$$p_1 = 0.95*1 = 0.95,$$ $$p_2 = 0.5*1 = 0.5,$$ $$p_3 = 0.95*0.7 = 0.665,$$ and $$p_4 = 0.5*0.7 = 0.35$$

If you order $N$ espressi, the chance of getting exactly $g$ good ones follows a binomial distribution, so:
$$P\left(g|H_i X\right) = {N \choose g} \left(p_i\right)^g \left(1-p_i\right)^{\left(N-g\right)},$$
where I'm using $g$ as shorthand for "Exactly $g$ good espressi were made," where the $p_i$ are calculated above.

For this problem, $n=2$ and $g=1$. Then
$$P\left(D|H_1 X\right) = {2 \choose 1}\left(0.95\right)^1\left(1-0.95\right)^1 = 0.095$$
or just under 10%. It's low because with a working machine and an expert operator, it's unlikely we would get a bad espresso. Note that it's a little bit of an overkill to use a binomial distribution here, since this expression is equivalent to $2*p_{bad}*p_{good}$, but this method is applicable for longer strings of data as well. Similarly,
$$P\left(D|H_2 X\right) =  0.5,$$ $$P\left(D|H_3 X\right) =  0.45,$$ and $$P\left(D|H_4 X\right) =  0.45.$$

Construct the normalizing term:

We already have all of the pieces.
$$P\left(D|X\right)=\sum_i P\left(D|H_i X\right)P\left(H_i|X\right)$$
$$= 0.1*0.375 + 0.5*0.375 + 0.45*0.125 + 0.45*0.125 \approx 0.34.$$

Calculate the posteriors:

$$P\left(H_1|DX\right) = P\left(H_1|X\right)\frac{P\left(D|H_1X\right)}{P\left(D|X\right)} = 3/8*\frac{0.1}{0.34} \approx 0.11$$
$$P\left(H_2|DX\right) = 3/8*\frac{0.5}{0.34} \approx 0.55$$
$$P\left(H_3|DX\right) = P\left(H_3|DX\right) = 1/8*\frac{0.45}{0.34} \approx 0.17.$$
So in the end, we find that the most likely hypothesis is $H_2$, that the owner is using the bad machine. At first it seems counterintuitive that the two trainee hypotheses are equally likely, but it worked out this way because the probabilities of getting a good espresso from those hypotheses were complementary, i.e. $p_3 \approx 1- p_4$.

Hopefully this problem demonstrated how to deal with slightly more complicated priors and data sets. The point is that the protocol for solving these problems is always the same. 

Thursday, November 5, 2015

Blogosphere reentry

I've held off on blogging for a while here while I've been at insight, but the program has officially ended. I'm still going into the office as I prepare for interviews, but I'm not actively working on a project right now, and I have a long backlog of things I'd like to talk about here.

Here are some things I'd like to write about:

  • Setting up Bayes rule problems: Apparently these textbook problems are asked commonly in interviews. They're all pretty much the same, and I want to show a nice protocol for solving them. Hopefully it will also help you think about non-textbook problems.
  • Statistics for probabilists: Coming from a probability background, I find statistics a little maddening. What question are we asking? What assumptions are we making? How good is the answer? What is the equivalent procedure in probability? This will be a multi-part blog, and I'm not quite qualified to write it yet.
  • Machine learning comparison: random forest, support vector machines, etc. I want to talk about the kernel trick.
  • Regularization: should probably be its own topic. It guards against overfitting and helps us choose a solution in ill-posed problem. Regularization methods have strange names in data science.
  • My Insight project (BackTweet Driver): helps people refine their tweets to increase the chance of being retweeted. You can see it at backtweetdriver.com
...and many more ideas, but you get the idea.

See you soon.

Sunday, September 6, 2015

Project: Hearthstone tracker [code]

I play a certian undisclosed amount of Hearthstone, Blizzard's online collectible card game. In the game, each of two players has a deck of cards which get played on alternating turns. There are many different kinds of decks belonging to nine "classes" of characters, and something like 700 unique cards.

Hearthstone screenshot: our character is at the bottom, and the opponent is at the top.

Last night I discovered that Hearthstone can be configured to create a log of every game-state change, which is awesome. People have used this information to create "deck trackers," which keep track of which cards you've drawn, so you know what's left in your deck. This is just scratching the surface, though. In principle, one could use a library of these log files to put together a sophisticated analysis of what kinds of decks people are playing, what cards are likely to appear together in a deck, and even list the most likely plays of an opponent next turn.

To start, I wanted to look at the log file and see what I could parse. Here is a link to a github repo where I've included a log file from an example game, and some Python code that reads the file and creates a DataFrame object with a list of all of the entities in the game (and entity can be a card, player, hero power, enchantment, and probably some other stuff). If at any time in the game the entity name is revealed, I update the DataFrame to include its name. The output looks like this:

where the "NaN" labels correspond to entities that were never revealed during the course of the game.

There's also a function that simply lists all of the cards I draw in the game, and that my opponent plays at some point, whose output looks like this:

where the first and second blocks are my cards and my opponent's cards, respectively. The next step is to integrate more card information into the DataFrame. I'm using regular expressions to do the parsing, so it's slow work. I'm not worried about putting a front end on this any time soon.

Saturday, August 22, 2015

Data visualization using Seaborn in Python

Last time, I presented an analysis of some education data available from IPEDS. For the visualization, I used a Python package called Seaborn. After about eight years of using MATLAB and Mathematica for plotting, I was astounded by the quality of the plots. Here, I want to talk a bit about Seaborn, and the learning curve I ascended.

If you'd like to follow along, here's a link to the .csv files I'm using for this post.

Seaborn

Seaborn specializes in plotting categorical data, visualizing linear relationships [something else?]. It handles uncertainties very well, plotting standard deviation bars, and linear regressions by default. It runs off the back of Matplotlib, another plotting package.

The Seaborn website includes a "tutorial" and a gallery, but the tutorial is very limited, and frankly, not basic enough for me. Here, I'll show a couple of examples in more detail.

Here's an example of what Seaborn can do:


Input data

You can feed Seaborn a variety of data formats, but it's convenient to use DataFrames, since (1) it's used in Pandas, and (2) it's a damned elegant way to represent data. I didn't know anything about Pandas when I started this project, and it took me a few false starts to get the .csv files into a form that Seaborn liked. Here's how I did it.

.csv to DataFrame

A comma separated value file looks like this:

while a dataframe of the same simple data looks like this:

Some differences are:
1) The column headers have become labels, and are no longer part of the columns
2) A new index column was added to act as an identifying key
3) The data type of each column is stored in memory

Pandas gives us a way to import data from a .csv directly into a dataframe using the read_csv function:

Note that we loaded some packages here, and called them by shorter aliases so we don't have to type long names later.

Plotting

I used the function regplot to generate the above plot. It generates a scatter plot, and it automatically does a regression and plots the best fit along with the 95% confidence predictions [Note: It causes me physical pain to plot a linear regression and confidence interval when I have no reliable information about the random process generating my data. In my defense, I'm ignoring the results of the regression entirely]. As inputs, it takes the DataFrame containing the data, and it takes references (in the form of the column label strings) to the columns we want to plot. The above plot just needs two columns - one for each axis.

Calling regplot returns an "axis" object. Next, we have to tell Python to put that object into a plot and show it. We use the srs.plt.show() command:

We have dozens of options to tweak the appearance of this graph, but the raw output already looks better than about 95% of the graphs I've published. The plot window has a save option, and you can export the figure as a .pdf and then edit it in any vector graphic program (I use both Adobe Illustrator and Inkscape). But if you're a baller or a masochist, you might prefer to modify it in Python.
The plots take something like a style sheet, where you can choose a theme based on what you're using the graphic for. It changes line thicknesses and font sizes, among other things, for slides, papers, or posters. Change it with the set_context command.

Color schemes

What would we sink our copious free time into if it weren't for color scheme choices?

You're free to define whatever colors you want in Seaborn plots, but as I'm learning, nobody does original work in data science (I kid!). Seaborn can tap into colorbrewer, whose color schemes are illustrated here. As an example, here's a horizontal bar chart using some of the data I provided:


which was generated with the following code:


I turned the chart sideways by providing the categories to the y-axis. Clever! If you feed Seaborn numerical data on one axis, it plots the bars on that axis. If both axes are non-numerical, it throws an error.

One of my .csv files had commas marking the thousands place for some reason, and Python imported these numbers as strings. Seaborn was very unhappy. If this happens, you can convert the strings back into numbers in Python, or you can fix your .csv manually.

Multiple columns

I had problems when I wanted to plot more than one category of data. The documentation on data structure for Seaborn is hard to find or doesn't exist, and I had to suss out what it was looking for. I first tried feeding it the following:

which generated this incorrect bar chart:


What happened here is that Seaborn thought I wanted the bars to correspond to the average of the columns, and the black lines to be the standard deviations. In the tutorial, the columns are supplied as values of a single category, which was not a feature of my data set. The solution was to "massage" the dataframe from the raw input:


into something that looked like this:

Here, I've used the "melt" function in Pandas to map the column names into values of the second column, effectively adding a new variable called "variable" whose values are in (degrees_per_100k, phys_deg_100k). I can now tell Seaborn that the "hue" of the data set is controlled by "variable" and that the bar heights are controlled by "value". The code now looks like this:


which results in this plot:

That's the extent of my limited experience with Seaborn, but I will surely continue using it. I'm pretty impressed so far.

Thursday, August 20, 2015

Secondary education data: analysis with SQL and Python/Seaborn

I recently looked at a collection of data from 2013 published by the Integrated Postsecondary Education Data System. Every year they take a census of postsecondary educational institutions in the United States, who report all kinds of information about their enrollment, costs, staff, and degrees awarded. My goal was to learn how to use SQL to interact with databases, and how to visualize data using Python.

Let's start with what I found out first, and then I'll talk about where I got the data, and how I imported it and used MySQL to analyze it. I used Python/Seaborn/Matplotlib to visualize it, but I'll go over that in a future post.

What I found:

  1. Each state awards degrees roughly in proportion to its population. The largest in each category is California, and there are two sort-of-outliers. Arizona awards disproportionatly more degrees due to the presence of the University of Phoenix online, and Texas awards proportionally fewer for some reason I don't know (suggestions?). This chart uses the total state population, including those under 18 years.

    Degrees vs. population for each state
     

  2. Women greatly outnumber men in postsecondary education, representing about 59% of all awardees and 58% of awardees of master's degrees or higher. Here are the general fields of study that have the highest percentage of women and men.

    Greatest representation among women and men
     

  3. Some states are highly under- or over-represented among Physical Science PhD's, proportionally. After scaling to the population of each state, the trend is linear-ish. Outliers include MA, home of MIT, Harvard, and others, along with Florida, where apparently physics isn't so popular.

    Physical Science Phd's vs all degrees, scaled to population

    Here is another representation of the same data, where it becomes pretty clear that total degrees per population is not a great indicator for proportion of people with Physics PhD's. Also, this list reveals that D.C. has crazy amounts of degree holders and Physics PhD's.
     

    Physical Science Phd's and all degrees, scaled to population


     
  4. Out of curiosity, I wanted to see who was awarding Optics PhD's. I went to #3, The University of Rochester, for mine.

    Institutions ranked by number of Optics PhD's awarded in 2012-13

Analysis methods

If you're more interested in education than data science, this is probably a good stopping point. But I did this project to sharpen my teeth on database interaction and Python visualization, so here we go!

The data

I made a database containing five tables regarding the 2013 IPEDS surveys and supplementary information. Here are links to the data files, along with descriptions.
  1. Directory of institutions: Contains a unique ID number for each postsecondary institution, its name, address, state (in abbreviated form), and contact information.
  2. Completions: For each institution ID, this table contains the number of degrees or certificates awarded in each subject area to men and women of various ethnic groups at all levels up through professional/research doctorates. Subject areas are catalogued by a "CIP code," which is a taxonomy system.
  3. CIP codes dictionary: For each CIP code, this table contains a title ("Underwater Basket Weaving", e.g.) and a short description.
  4. Population by state: I wanted to compare the number of degrees to the total population of each state, so I pulled in this chart. Note that each state is indexed by its full name.
  5. State abbreviations: I could have manually changed all the state names in the population table to their postal abbreviations... but I'm lazy. So instead I found this table of abbreviations and let MySQL do it for me.
For the IPEDS data, there are corresponding dictionaries that describe the column names. I ended up needing only a few of them.

The setup

I wanted to use the language SQL to interact with my data. The steps to get there were roughly:
  1. Choose and install a program to host a database server on my local machine. This server will take instructions in SQL, either from the command line or from some kind of GUI. I chose MySQL Community Server, but Microsoft's SQL server is a viable alternative.
  2. (Optional) Find a program to interact with the database server. At first I just worked from the command line, but I ended up installing Sequel Pro, a free program that's quite easy to use. You can also use Python to interact with MySQL, which is convenient for sophisticated analyses. I'm currently set up to do that, but I didn't use it for this project. This tutorial shows you how to do it.
  3. Import the data. Here's how.

Importing the data from the command prompt

You've downloaded the data sets as .csv files, and it's time to create a database where each data file will end up being a table. Assuming you're working from the command prompt, log into MySQL by typing "mysql -u root -p" (where I use the root user to make sure I have full privileges. You can also create a new user with a password, but I'm not going to have anyone else accessing my server). 


Next, create a database using "create database db_name_here"

This database is empty, and we need to fill it with tables corresponding to our .csv files. From the command line, you can create a table like this:

where "...." stands for whatever other columns you need. If we want to make a table for the directory data, for example, column_name1 is the unit ID with type INT. Once the database is created, you can use the BULK INSERT command to import the .csv, but I didn't do that. Instead, I used Sequel Pro to do it, as shown below.


Importing the data using Sequel Pro

After connecting to the server with Sequel Pro, you can select a database to use, and then click on the + in the bottom-left corner to create a new table. The column names and data types get added one by one, and you can feel free to only add the ones you really need to use. Here's what the table structure looks like for the "directory" information:


Clicking File->Import brings up a file dialog where you can choose your .csv file. Next, an import window shows up:

The left column shows the column names in the CSV, and the middle column shows the columns you just created. You can have it match the fields in order, or by name if the names are identical. You can also manually select them by clicking on the CSV field entry. Make sure "First line contains field names" is checked before clicking Import. Switching over to the content tab, we see that the .csv imported correctly.

This process has to be repeated for each table you'd like to import.

Notes: I had to add a row to the cipcode table for "99 = Total," since the completions table uses that notation. I also chose to convert the codes in this table to floats instead of strings, since I wanted to round them. Rounded CIP codes give the broader subject field.

The analysis

I spent some time just exporing and poking at the data, iteratively refining my searches until I found a few interesting things to report on. Here are some of examples of the kind of analysis I did.
  • Most popular areas of study across all award levels (not shown in charts, but healthcare crushes everything else)

"ctotalt" is the total number of degrees of a particular subject and level awarded at a particular institution. I use INNER JOIN to retrieve the name of the area of study from the cips table. I called the CIP code "ccode" in that table so it had a unique name. Otherwise, you can call it by "table.column_name" to avoid ambiguity.

  • Total degrees and physics degrees by state


Here I use a CASE to count degrees for two different CIP codes. This may not be best practice, but it works. I wanted to scale by the population of each state, so I had to join three databases. "abbrev" gives me the full state name so I can look it up in "population". I'm using a pre-filtered version of "cips" where I've eliminated all of the specific fields of study (any classification beyond the decimal place) to cut down on query time. I didn't actually need to join directory. Note that CIP code = 99 is used in "completions" to indicate the total among all CIP codes, while CIP = 40 indicates Physics-related studies.

  • Representation of women across all degrees, and for English language and literature degrees


This is a straightforward adaptation of the above. I want the proportion of degrees awarded to women, so I take the ratio ctotalw/ctotalt, and only count it when my conditions are met.

  • Representation of women by general area of study


Here, I just need to avoid counting the "total" amounts for each institution (indicated by cipcode = 99). The symbol "<>" means "not equal to."

  • Population vs. degrees awarded for each state


Here I need four tables. The population requires "population" and "abbrev" to get state names, and total degrees by state requires "completions," and "directory" gives me the state in which each institution resides. I include the filtered version of "cips" here because it was easier than deleting it.

  • Population vs. professional doctorates for each state


Same, but with the requirement that "award_lvl = 17." The award levels can be found here. 17 excludes professional degrees like MD's.

  • Optics PhD's by institution

A subject near to my heart, and an easy query. Instead of searching for the CIP codes corresponding to optics, I search for the word "optical" in the CIP title. This is in general a terrible strategy, but I checked ahead of time that this gets everything I want and nothing I don't.