Tuesday, May 26, 2015

Why logistic regressions?

The Island

On a mysterious island live two types of animals: yeebles and zooks, and they love to eat wild strawberries. A certain cave on the island is home to either a yeeble or a zook, and you'd like to know which it is. Unfortunately, it's very shy, and never comes out of its cave if you're nearby. You decide to find out by feeding it strawberries.

From previous experiments, you know that yeebles tend to eat around four pounds of strawberries in 24 hours if left alone, while zooks will eat about eight pounds. You put ten pounds of strawberries outside the cave, and come back a day later to find 6.3 pounds missing. Which kind of animal lives in the cave? How certain are you?

Classification

This is a classification problem. Given the result of a measurement, we want to figure out what class a member belongs to. In particular, this problem is binary, since the animal is either a yeeble or a zook, and can't be both. It's also univariate, since the only variable is the number of strawberries consumed in 24 hours.

Suppose our data look like this:
Fig. 1: Previous data for two classes of animal
Each circle on this plot represents one animal. The horizontal axis shows how much the animal ate, and the vertical axis shows whether it was a yeeble (labeled 0) or a zook (labeled 1). Note that there is no sharp transition between the two types of animals. Some yeebles eat more than some zooks and vice-versa, so a measurement above 6 lbs/day does not guarantee that it's a zook.

Our strategy will be to turn the previous data about strawberry consumption into a function that takes [lbs of strawberries] and converts it to [probability that the animal is a zook]. The way this is normally done in the machine learning community is through logistic regression. We assume that the probability of an animal being a zook based on the amount of strawberries eaten is of the form

$P\left(Y|S\right) = \frac{1}{1 + \exp\left(a - b*S\right)}$,

where $a$ and $b$ are constants, and $S$ is the amount of strawberries eaten, in lbs. A regression is used to choose $a$ and $b$ so that the curve matches the data optimally. Just like in a linear regression, we choose the parameters that minimize a cost function like the sum of square errors.

Why the logistic curve?

Why do we assume a logistic curve is the correct model for the probability? Clearly, a line wouldn't work. At minimum, we need something with a range of $\left(0,1\right)$, since $P\left(Y\right)$ can't be less than 0 or greater than 1. But lots of functions have this property, including the sigmoids, of which a logistic curve is one. Choosing the logistic curve implies some assumptions about how the data are generated. 

In particular, the logistic curve is the correct probability distribution if each class exhibits a normally distributed feature with equal variance and different center values.

Here's an illustration of that assumption. Suppose we make a histogram of lbs of strawberries eaten for both yeebles and zooks:
Fig. 2: Histogram of previous data from which we construct our model. This plot shows 10k data points, wheras the scatter plot shows only 100 for clarity.
They're Gaussian-shaped (normally distributed), they have the same standard deviation, and they have different center values*. Again, it's clear that some yeebles eat more than some zooks, and vice versa.

These distributions are represented by

$P(S|Y) \propto \exp\left(-(S - S_Y)^2/2\sigma^2\right)$, 
and
$P(S|Z) \propto \exp\left(-(S - S_Z)^2/2\sigma^2\right)$,

where $S_Y$ and $S_Z$ are the centers of the distributions and $\sigma$ is the standard deviation. We want to know the probability that an animal is a zook given a measurement $S'$. By Bayes' rule, this is

$P(Z|S') = P(Z)\frac{P(S'|Z)}{P(S'|Z) + P(S'|Y)}$,

where $P(Z)$ is the prior probability of finding a zook (independent of strawberry consumption), and $P(S'|Y),P(S'|Z)$ are called sampling distributions. They tell us the probability that the data would have been generated if each hypothesis were true. They're just the Gaussian distributions given above.

We're further going to assume that $P(Y)=P(Z)$, that the number of yeebles and zooks is the same. If we knew otherwise, we would modify this. I'll talk a lot more about priors in other blog posts. Substituting in the known distributions and reducing, we have

$P(Z|S') = \frac{1}{1 + \exp\left((S_Z^2 - S_Y^2 - 2S'\left(S_Y - S_Z\right))/2\sigma^2\right)}$.

This is a logistic curve. We have shown that under the assumption that the measured features are normally distributed with equal standard deviations, a logistic curve is the correct probability model to use. This analysis extends to multivariate distributions as well.

Mathematical convenience

Logistic curves have some convenient properties, so we're lucky everything turned out this way. The whole shape of the curve is controlled by the argument in the exponential, $(S_Z^2 - S_Y^2 - 2S'\left(S_Y - S_Z\right))/2\sigma^2$, which is linear in $S'$. That means that we can use a generalized linear regression to find the parameters $a$ and $b$ above, which in this case turn out to be $a = 18.25$ and $b = 3.058$.

We can also easily show (by integrating the log of the likelihood ratio) that the probability of making an incorrect classification decision in this case is given by

$P(\textrm{fail}|S') = \frac{1}{2}\textrm{erf}\left(\sqrt{S_Y^2 - S_Z^2}/2\sigma\right)$, where $\textrm{erf}$ is the error function (another sigmoid!). Knowing this lets us address questions like "how close can the distributions be before we can't classify very effectively?" You tell me what error rate is acceptable, and I'll tell you how far the distributions have to be separated to achieve that error rate.

The answer

If we observe that 6.3 lbs of strawberries are missing after a day, all we have to do is check what probability it corresponds to on our logistic curve. Here is what the best fit curve looks like:
Fig. 3: The best fit logistic curve for the data allows us to classify future members.
In this case, l the answer is P(Z|S' = 6.3) = 0.625, There is a 62.5% chance that the animal is a zook.

Mathematical inconvenience

I want to briefly break the problem here to illustrate the limitations of the logistic model. We assumed above that the standard deviation of each distribution was the same, but that's not always a valid assumption. In fact, it's quite common to find Gaussian-distributed processes (or approximately Gaussian) where the standard deviation is proportional to the center value. If the standard deviations are different, the argument in the exponent is quadratic in both $S_Y$ and $S_Z$, so a linear regression doesn't work any more.

In fact, we can keep adding higher order moments to the Gaussian distributions, and the argument becomes some kind of higher-order polynomial, and we can use a polynomial regression to do our classification. 

What if the distributions are not Gaussian distributed? Well, that sucks as always. The sampling distributions won't be as nice, and P(Z|S') may not be parameterizable as a polynomial at all. Then we would need to perform general nonlinear fitting. That can be done of course, but not as efficiently.

I leave this as an exercise to the reader. ;)

- b


*: Gaussians actually extend over the entire real line, and I'm truncating at 0 lbs since it's hard to eat a negative weight of strawberries. The error incurred is small if the center of the distributions is a few times greater than the width.

Acknowledgement: Thanks to Thomas Stearns for helpful discussions.

Thursday, May 21, 2015

A study plan

Now that I have a better idea of what I need to learn, it's time to make a study plan. The point is to put things in an order that makes sense, knock off the highest priorites as early as I can, and to set quantitative goals to meet.

I found a really lovely graphic (left) with an explanation of what a data scientist needs to know, and nature jobs has a fairly good article from 2013 on the subject. There are also good discussions on Quora. Taking these sources together, here are the major tasks I need to accomplish, in no particular order.


*) Get good at math, statistics, and machine learning
By "math," they mean algebra, calculus, and linear algebra. I may need to brush up on the last of these. A friend recommended this book, which is free, and has some overlab with machine learning.

*) Learn to code
Spefically, pick a first language. Python, I choose you! Learn it at Codecademy and Google Classroom. I've already gone through the Codecademy course.

*) Learn about databases
This is where they keep the data. I should at least learn SQL.

*) Learn about data munging, visualization, and reporting
Munging means putting the data into a digestable form, which I assume involves dimension reduction. I understand the principles, but need to look into it more. This category seems like it should come out organically when I start working on side projects for fun.

*) Start using Big Data
This happens after smaller projects succeed.

*) Get experience
Kaggle competitions, side projects, and the like. Has to happen once I'm comfy with the basics.

*) Internship, bootcamp, job
I'll apply for an internship, but I won't have a shot unless I make huge progress before then.

*) Engage with the community
I already read fivethirtyeight. I signed up for a couple of societies and followed some people on Twitter. That's a start. Not enough time in the day to consume the content created by popular data scientists.


Here's a plan that makes sense to me:

  1. Take the Coursera machine learning course. It's free, I have the required math chops, and I've already started it. There's about 43 hours worth of video in this course, plus assignments. They estimate 5-7 hours per week for an unknown number of weeks, so... that's not useful. Let's say three months, so I expect to be done by August.
  2. Simultaneous with (1), start programming. Python will be my default language, but I'll use something else if I have a good reason. I already have a project lined up, but it's in C++. I'll tell you about it soon.
  3. After the Machine Learning course, take more statistics. Intro to Statistics by Udacity, and/or OpenIntro Statistics.
  4. Simultaneously with (3), start entering Kaggle competitions and engaging the community there. Scale up to big data when possible.


During all of this, I'll keep up with the blog and try to post mini-projects. I think this is a good start. I have a quantitative deadline of finishing the ML course by August, and I have a project that must be done by the end of the summer that involves programming. Hopefully I have not succumbed to the planning fallacy.

- b

Monday, May 18, 2015

Bootcamps, fellowships, and headhunters

There's a large demand for data scientists right now, and many people are making similar career transitions to mine. A partial infrastructure exists to move people with advanced science degrees into data science by reorienting their skills a bit, and in this blog entry I want to describe the landscape as I understand it.
***

Problem statement:

A group of highly educated people with a strong quantitative background would like careers analyzing data, which requires a specialized skills set including programming and statistical analysis. Very little exists in the way of secondary or tertiary education in this exact skill set. Who can make money off this problem?

Solution:

Educator-headhunter hybrids. Headhunters are formally called recruiters. They freelance or work for independent firms who match job seekers with employers, and they always get paid by the employers iff the applicant accepts a job offer. Essentially, they make money by wading through the slush pile to pick out the best candidates so the employers don't need to spend time on it.

In data science, there is a large worker pool that is not well matched to fill a recruiting demand, but could be retrained in a short time, say 6-12 weeks. The hiring companies can't retrain the workers themselves because they don't have the expertise. It would be like hiring a management consultant who had to be retrained in management theory before starting. If they could train the workers, they wouldn't need them.

The response to this demand has been two kinds of trainer/recruiter hybrids, loosely called fellowships and bootcamps. Both require an application process. Both attempt to match graduates with employers, and both reap recruiting fees from successful employment matches. Here are the differences.

Fellowships:
  • Are free. Anyone who gets accepted gets a full ride.
  • Are pretty short in duration. Typically about 6 weeks with full time attendance.
  • Are rare. There are only two I know of.
  • Have a highly competitive application process and require a PhD for admission.
  • Aim at people who are around 80-90% of the way to becoming a data scientist
  • Have ~100% placement rate after graduation

Bootcamps:
  • Cost about $12-16k. Partial refunds are sometimes offered if graduates accept offers from their corporate partners.
  • Last longer, around 12 weeks.
  • Are plentiful and widespread (in major cities globally). I've seen about a dozen.
  • Are still competitive, but do not require a PhD.
  • Aim at people who are about 40% of the way to becoming a data scientist.
  • Have ~90% placement rate after graduation.
In addition, there are stand-alond courses that don't match people with employers.

Online programs:

  • Pretty cheap: ~$4k per program, or a few hundred dollars a month.
  • Can be part time, and last from a month to a few months.
  • Always available and open to anyone.
  • Available at many levels of prior expertise.
  • No job assistance
And then there's always

Doing it your damn self:
  • Free minus opportunity costs
  • Minimum 12 weeks, probably closer to a year.
  • Tons of resources, harder to find peers for support.
  • No application process.
  • Can go from 0-90% pretty easily I think. Last 10% would be hard.
  • Placement rate unknown, but it's not a full time commitment.
I still have to answer some questions about these options before I decide which to pursue.

How do people support themselves while attending these programs? The fellowships only admit PhDs, who are old enough that they can be expected to have families (fun facts: the average new physical sciences PhD is 30.1 years old [source]. A 30-year old has on average something like 1 child [not quite applicable source]).

Can I find a job where I can learn these skills as I go? Maybe an internship? It probably wouldn't cover all of the bases, but it might be a start.

What are my chances of getting a fellowship? Most likely pretty bad if I were to apply right now. Deadlines are coming up in a month, so I might be able to build an impressive portfolio before then if I dedicate myself to it.

What do the reviews look like? Are there obvious scams? Do they not really teach you anything you couldn't learn just as easily by yourself?

Are there other options I'm not aware of? Cheaper or faster programs at a college or university? Online certifications with similar placement rates?

Need to keep digging.

***

A comprehensive list of options like this can be found here.

- b

Sunday, May 17, 2015

Simple linear inversion [code]

A couple of days ago I was asked to make a figure showing a simple linear mixing/inversion problem. It's a very straightforward task, but I thought this would be a good opportunity to share some MATLAB code online for the first time.

You can find the code here. First I'll explain what it produces, and then I'll talk a little about the code itself.

My task was to show the basics of spectral mixing. In spectroscopy, light is used to distinguish different chemicals by which wavelengths they absorb or reflect. A spectrum looks something like this:

The horizontal axis is in wavenumbers, which is just the inverse of the wavelength, $k = 1/\lambda$. The vertical axis gives the percentage of light absorbed if shone through the test sample. If there's a peak, it means that the molecules in the sample can be excited by that wavelength, which is a clue as to what kind of chemical it is.

Here I'm just making up spectra: there are no chemicals that correspond to these Gaussian lineshapes, but I needed something I could do quickly, and Gaussians are quick. I wanted to create a sample with several chemical contributions drawn from a large library of chemicals. Each member of the library is assigned a Gaussian spectrum with a randomly generated height, center, and width. The library is called $\mathbf{A}$ and looks like this:
I randomly select a small number of these, say five. Then I have to decide how much of each chemical is in the simulated sample. I do this by choosing a coefficient for each spectrum randomly, and then normalizing to make sure the coefficients add up to 1. That is, I choose a percent contribution for each spectrum in the sample. If we represent this so-called density vector is by $D$, then the composite spectrum is just the product $S=\mathbf{A}D$. It looks like this:


The black line shows the composite spectru, and the colored lines show the individual spectra multiplied by their percent contributions.

So far, we've looked at the forward model. That is, you tell me what the sample is made of, and I can tell you what the measurement should look like. This model is really simple. It just says that you take a weighted sum of the individual spectra. It's a linear system, basically meaning that if you double the inputs you get double the output.

Now we want to do the inverse problem: you tell me what the measurement looks like, and I'll tell you what the sample is made of. In general, inverse problems are close to impossible. Imagine if I told you that I weighed an object and discovered that it weighed one pound, and I ask you to identify the object. You might be able to make a good guess, but you don't have enough information to solve the problem exactly. I'll talk a lot more about making decisions under uncertainty, but the point for now is that you can't expect all inverse problems to have unique answers.

This problem is different. Not only do we have enough information to solve it, but inverting linear problems can be done with one line of simple code. The density is recovered by the equation
\begin{equation}
D = \mathbf{A}^{-1}S
\end{equation}
where  $\mathbf{A}^{-1}$ is the inverse of $\mathbf{A}$. We could find it by hand, but MATLAB does it for us if we call the function pinv().

The result is a density for each member of the spectral library, most of which are zero since they're not in the sample. In fact, we can quickly check which members are non-zero (actually about a threshold, just in case of machine errors). We can compare the recovered members to the members we chose to make sure the inversion worked.

***
Notes on the code

The main code is spectral_mixing, and there are three supporting functions, create-library, plot_spectsum, and plot_composite_spectrum. These output nicer plots that MATLAB does natively.

If you want to export these graphics as .eps files, I recommend using the print2eps function included in the wonderful export_fig package by Yair Altman.

Thursday, May 14, 2015

But I regress...

I used to play poker with a mathematician who liked to make fun of physicists for being sloppy about math. He had a point, for the most part. In an effort to cover the necessary material fast enough, we skip over a lot of the proofs. Or, you know, nearly all the proofs. The idea is to develop a physical intuition that's right most of the time. After all, if you're wrong, nature will tell you. 

We have the luxury of letting nature check our math for us. If you find a violation of conservation of momentum in your homework problem, or if your equation for the energy of a system diverges, you're wrong. You can tell because the universe didn't evaporate. Feynman summarizes the idea in The Character of Physical Law: So far as we know, math is consistent. So far as we know, physics is consistent. If they agree anywhere, they agree everywhere (here too I'm being a little sloppy, but I hope my mathematician will forgive me). I'm trained to be relaxed about math.

So why did I start getting huffy about hand-waving math after my first six-minute video on machine learning at Coursera? Consider the following.

***

In the video, "Supervised Learning," we were taught the distinction between classification and regression tasks. In classification, we want to estimate a discrete parameter, and in regression we want to estimate a continuous parameter. Examples:

1) I bought a piece of fruit at the store. Guess what it was based on the fact that it cost $1.19. This is a classification task. 

2) I bought a dozen Krispy Kremes at the store and ate nine of them while driving home. Estimate how far the store is from my house (hint: not far). This is a regression task.

Easy. Now, it's clear that there are a few cases where the line is a little blurry. Consider the example problem given in the video;

3) Your business sells lots of identical widgets. Given a record of sales over the last year, estimate how many widgets will be sold in the next month. Is this a regression or classification task?

The parameter to be estimated is discrete (number of widgets), but it's big. and maps easily onto a monotonic continuous variable (money or value). We might as well treat it as a regression task, and allow answers like "1051.4" widgets. Then we can round it off, right?

Maybe. Maybe not. Here's what I posted in the discussion section;
If we ask "What is the most likely number of widgets to be sold next month? (minimizing mean squared error, e.g.)," the answer must be a whole number. If we allow the estimate to be fractional, we need to specify a way to map from the fraction back onto a whole number. Saying that another way, we want to map from an *impossible result* to a *possible result*. Formally, that's nonsense, and we need to supply prior information to make the jump.
If the estimate ends up being chaotic or highly oscillatory, we're going to have problems just rounding. In other words, we have to be very careful allowing a discrete vector space to melt into a continuous vector space. We have to make sure our measures are still measures in both spaces. 

Did I just say that out loud?

Obviously this almost never matters, but the distinction between these two types of problems gets me riled up, and that's because of what I know about parameter estimation. All estimation problems can be thought of determining the probability distribution on a set of propositions

Example:

Asking "How hot is it outside?" is the same as asking for (an estimate that minimizes some error metric of) a distribution on ["It is between 0 and 0.1K outside." It is between 0.1 and 0.2K outside. ... ] where the spans get arbitrarily small. This is clear for numerical propositions, but here's where it gets good!

Goedel showed (or at least used the result in his incompleteness theorem) that any logical proposition can be represented as a real number. If I state any proposition, you can assign it a unique index. That sounds obvious, but it isn't. Ask Goedel. Anyway, that means that any classification problem can be mapped to a regression problem and vice versa. Possibly. Have to think about countability at some point.


So why am I uncomfortable with this? It's because I don't understand data science. Can we be as cavalier with math in DS as in physics? Is there some analog of nature that will put us back on the right track, or can we get ourselves into serious trouble? What's the worst that could happen?


[this]



***

Epilogue:

A nice man called Tom replied to my original comment:
Thanks for your post. I look it like this: If the question calls for a quantity, then it's clearly linear regression. If the question calls for a classification using known labels, then it's logistic regression. This leads to selecting a classification by choosing the logistic value with the highest output, or (in the case of a true/false proposition) that which exceeds some fixed threshold. Both are forms of supervised learning.
My comment has been “unfollowed” by two people in a few hours, so it looks like it's either wrong, or so far outside the scope of the discussion that it's distracting. I don't really want to get in trouble on my first day, and I don't really want to be that guy. The question is academic anwyay, so I gave it a rest. My response:"Hi Tom, thanks for the reply! I understand."

By the way readers, please feel free to educate me in the comments section. I'm often wrong, and I love to hear why.


Wednesday, May 13, 2015

Hello World!

I'm a physicist, and I want to be a data scientist. This blog documents my transition. I will now pretend that you are asking me questions.


Q: What is a data scientist?

A: Data scientists look at (typically large) collections of various kinds of data, subject them to statistical analysis, and draw descriptive, predictive, and prescriptive conclusions. Then they present those conclusions to other people. The job relies heavily on machine learning, statistics, probability, decision theory, and programming ability.


Q: Why do you want to be a data scientist?

A: I've been a data nerd for a long time, and have a... let's say "strong interest" in probability theory since about mid-grad-school. My partner says that every conversation with me ends in Bayes' Rule if it goes on long enough. I see the world mostly in mathematical models, and like explaining technical things to people, whether or not they're usually into technical things.


Q: Why do you want to leave physics?

A: I don't, in particular. But something that surprises people about me is that I've never cared much about physics. I do care about learning new things, working with smart people, and solving problems. Physics has been a wonderful place to do those things, but it's not the only place.

I'm also getting pretty discouraged about the driving motivations of academic research. Most of my research energy is derected at (1) what can get funded, and (2) what can be published, leading to more funding. It's how we can afford to eat. There was an episode of The West Wing (which I can't find right now) where someone comments that they spend all their time getting elected, hoping that they will accidentally do some good in the process. This is how I feel a lot of the time. I want to produce something that people need produced.


Q: Why are you blogging about this?

A: Two reasons. First, I work better with structure. This will hopefully force me to make progress, even if it's slow. Second, I'm going to need a portfolio to get a job. That includes projects on GitHub, competition entries at Kaggle, and it wouldn't hurt to have evidence of my progress along the way.


Q: What do you have to learn before you can be a data scientist?

A: I only sort of know the answer to this question. I definitely need either Python or C++ fluency, experience with data science problems, a good understanding of high level statistics, and a network of contacts. I may need to be familiar with SQL, R, or other specialized software.


Q: What do you have going for you?

A: I'm more than competent at MATLAB. Compared to the average physicist, I'm a good speaker, and a damn good technical writer. For whatever perverse reason, I love uncertainty and noise analysis more than almost everything.


Q: What's the plan?

A: The nice thing about starting a project is that there's a lot of low-hanging fruit. Today I made a GitHub account and did the tutorial. I'm going to take a Coursera on machine learning, and brush up on my Python. If I know you, and you know about data, I'm going to talk to you.

That's it for now. See you next time.

- b