Thursday, May 14, 2015

But I regress...

I used to play poker with a mathematician who liked to make fun of physicists for being sloppy about math. He had a point, for the most part. In an effort to cover the necessary material fast enough, we skip over a lot of the proofs. Or, you know, nearly all the proofs. The idea is to develop a physical intuition that's right most of the time. After all, if you're wrong, nature will tell you. 

We have the luxury of letting nature check our math for us. If you find a violation of conservation of momentum in your homework problem, or if your equation for the energy of a system diverges, you're wrong. You can tell because the universe didn't evaporate. Feynman summarizes the idea in The Character of Physical Law: So far as we know, math is consistent. So far as we know, physics is consistent. If they agree anywhere, they agree everywhere (here too I'm being a little sloppy, but I hope my mathematician will forgive me). I'm trained to be relaxed about math.

So why did I start getting huffy about hand-waving math after my first six-minute video on machine learning at Coursera? Consider the following.

***

In the video, "Supervised Learning," we were taught the distinction between classification and regression tasks. In classification, we want to estimate a discrete parameter, and in regression we want to estimate a continuous parameter. Examples:

1) I bought a piece of fruit at the store. Guess what it was based on the fact that it cost $1.19. This is a classification task. 

2) I bought a dozen Krispy Kremes at the store and ate nine of them while driving home. Estimate how far the store is from my house (hint: not far). This is a regression task.

Easy. Now, it's clear that there are a few cases where the line is a little blurry. Consider the example problem given in the video;

3) Your business sells lots of identical widgets. Given a record of sales over the last year, estimate how many widgets will be sold in the next month. Is this a regression or classification task?

The parameter to be estimated is discrete (number of widgets), but it's big. and maps easily onto a monotonic continuous variable (money or value). We might as well treat it as a regression task, and allow answers like "1051.4" widgets. Then we can round it off, right?

Maybe. Maybe not. Here's what I posted in the discussion section;
If we ask "What is the most likely number of widgets to be sold next month? (minimizing mean squared error, e.g.)," the answer must be a whole number. If we allow the estimate to be fractional, we need to specify a way to map from the fraction back onto a whole number. Saying that another way, we want to map from an *impossible result* to a *possible result*. Formally, that's nonsense, and we need to supply prior information to make the jump.
If the estimate ends up being chaotic or highly oscillatory, we're going to have problems just rounding. In other words, we have to be very careful allowing a discrete vector space to melt into a continuous vector space. We have to make sure our measures are still measures in both spaces. 

Did I just say that out loud?

Obviously this almost never matters, but the distinction between these two types of problems gets me riled up, and that's because of what I know about parameter estimation. All estimation problems can be thought of determining the probability distribution on a set of propositions

Example:

Asking "How hot is it outside?" is the same as asking for (an estimate that minimizes some error metric of) a distribution on ["It is between 0 and 0.1K outside." It is between 0.1 and 0.2K outside. ... ] where the spans get arbitrarily small. This is clear for numerical propositions, but here's where it gets good!

Goedel showed (or at least used the result in his incompleteness theorem) that any logical proposition can be represented as a real number. If I state any proposition, you can assign it a unique index. That sounds obvious, but it isn't. Ask Goedel. Anyway, that means that any classification problem can be mapped to a regression problem and vice versa. Possibly. Have to think about countability at some point.


So why am I uncomfortable with this? It's because I don't understand data science. Can we be as cavalier with math in DS as in physics? Is there some analog of nature that will put us back on the right track, or can we get ourselves into serious trouble? What's the worst that could happen?


[this]



***

Epilogue:

A nice man called Tom replied to my original comment:
Thanks for your post. I look it like this: If the question calls for a quantity, then it's clearly linear regression. If the question calls for a classification using known labels, then it's logistic regression. This leads to selecting a classification by choosing the logistic value with the highest output, or (in the case of a true/false proposition) that which exceeds some fixed threshold. Both are forms of supervised learning.
My comment has been “unfollowed” by two people in a few hours, so it looks like it's either wrong, or so far outside the scope of the discussion that it's distracting. I don't really want to get in trouble on my first day, and I don't really want to be that guy. The question is academic anwyay, so I gave it a rest. My response:"Hi Tom, thanks for the reply! I understand."

By the way readers, please feel free to educate me in the comments section. I'm often wrong, and I love to hear why.


3 comments:

  1. "Anyway, that means that any classification problem can be mapped to a regression problem and vice versa."

    Hmmmm. I agree that as far as most supervised ML goes (especially non-parametric methods like decision tree ensembles or k-NN) all regression problems are really just classification problems. This is because a non-parametric model's output is either a vote by a set of predictors (classification) or an average over a set of predictors (regression) and averaging over a set of predictors is just a special kind of voting.

    However, to say that all classification problems can be turned into regression problems is incorrect, in my view. Godel's incompleteness result uses integer indexing to prove a theorem that roughly says "any consistent logical system strong enough to prove statements about integers contains a true but unprovable proposition". Suppose you have a classification problem with propositions like "given this genomic training data, x is a hermit crab" or "given this genomic training data, x is a king crab". these propositions are not strong enough to prove anything about integers, which is why i call shenanigans on invoking Godel to turn this into a regression problem via Godel numbering.

    Suppose you want to be cute and say "well, the hermit crab label corresponds to 0, and the king crab labels corresponds to 1, and the spider crab label corresponds to 2, etc" and then do this as a regression problem. if you run a new genomic data through the trained model which returns as its answer 1.354, it's really unclear if that result is meaningful at all because crabs species do not exist on a 1-dimensional continuum. You could argue is means something like "the average predictor thinks this new crab is most like a king crab" but I would be inclined to just give up and treat it as a classification problem and not a regression problem.

    Smart people who critique machine learning as blackbox methodology are largely correct. If some ML model thinks a new crab is a king crab but doesn't say why and doesn't assign a probability distribution to answer, we're right to treat it with a bit of suspicion.

    ReplyDelete
    Replies
    1. Thanks Thomas. Yes, this is exactly the situation I was worried about with my original comment. It's clear that in some cases it makes sense to map a classification problem onto a regression problem, and the widget example is (probably) one of those cases. What are the requirements to be able to do this mapping?

      Conjecture: We can map classification task in discrete space S1 to a regression problem in continuous space S2 if we can define a distance measure on S2, AND map each element of S1 onto S2.

      Delete
    2. Yes, I agree the widget example is a regression problem.

      re the conjecture: I'm not sure having a metric on S2 is relevant. For instance, even if S2 is something simple (like a 2-dimensional plane), there are infinitely-many metrics on it (think all the Lp metrics for p >= 1) and it's extremely difficult to figure out which one is best for the particular classification problem you have (In the crab example, you might be able to figure out a cogent metric between crab species, but it might be completely ill-suited to applied crab species classification).

      The thing that comes to my mind here isn't a metric on S2, but an ordering on S1. A classification problem with labels like (white, pink, red) could be reasonably well-mapped to (0,1,2) (which is to say the model *could* be improved by doing this) but a classification problem with labels like (hermit crab, king crab, spider crab) cannot be well-mapped to (0,1,2).

      My Conjecture: We can map a classification task in discrete space S1 to a regression problem in continuous space S2 if there is a total order on S1 and there is a partial order on S2.

      Delete