Processing math: 100%

Saturday, December 5, 2015

Digit recognition part 2: a validation pipeline

[Link to part 1]

I've been looking recently at the MNIST data set, which contains thousands of hand-written digits like this:
Example hand-written numerals from the MNIST data set

where we also have a label for each digit \in \left[0,9\right]. We would like to use these examples to assign labels to a set of unknown digits.

In part 1 of this series, I looked at the data set and did some preliminary analysis, concluding that:
  1. There's not much variance within each digit label, i.e. all 5's look pretty much the same.
  2. Most inter-numeral variance occurs near the center of the field, implying that we can probably throw away the pixels near the edge.
Rather than jumping right into optimizing a classifier in part 2, I'd like to build a validation pipeline. Any time we do machine learning, we want to try to quantify how well our regression or classification should perform on future data. To do otherwise is to leave ourselves prone to errors like overfitting. Validation in this case will apply the classifier to a new set of digits, and then compare the predicted labels to the actual labels.

The Methodology

Here is a pretty concise description of the usual validation methodology. Basically, we break the data into three chunks before we start: a training set, validation set, and test set. Every time we train a classifier we use the training set, and then evaluate its performance using on the validation set. We do that iteratively while tuning metaparameters until we're happy with the classifier, and then test it on the test set. Since we use the validation set to tune the classifier, it sort of "contaminates" it with information, which is why we need the pristine test set. It gives us a better indicator of how the classifier will perform with new data.

The pipeline

What do we want our validation suite to look like? It might include:
  1. Standard goodness-of-fit scores, like precision, accuracy, or F1 scores.
  2. Confusion matrices, which illustrate what numerals are likely to be assigned which incorrect labels (e.g. "6" is likely to be labeled "8")
  3. Classifier-specific performance plots to evaluate hyperparameters, like regularization constants. These show the training and test error vs. each hyperparameter.

Example: logistic classification

It will be helpful to have a classifier to train in order to build the validation pipeline, so let's choose a simple one. A logistic classifier is a logistic regression in which we apply a threshold to the probability density function to classify a data point. Besides being simple, it's also not going to work very well. For illustrative purposes, that's perfect. I'd like to look at how the performance changes with the hyperparameters, which won't be possible if the performance is close to perfect.

I'm using IPython Notebook again, and I've uploaded the notebook to GitHub so you can follow along, but I'll also paste in some code in case you just want to copy it (please copy away!).

We're just going to use the logistic regression functionality from SciKit-Learn. First I import the data and split it into three groups. 70% goes to training, and 15% each to validation and test sets.

Partitioning the data into training, validation, and test sets.


# Code adapted from http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html
# originally due to Gaël Varoquaux
# Modified by Brad Deutsch
# License: BSD 3 clause
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.image as mpimg
from sklearn import linear_model, datasets, svm, metrics
%matplotlib inline
# Import data
data = pd.read_csv("../input/train.csv")
data_shuffled = data.sample(len(data.index)).values
# create training, validation, test sets
# decide what proportion of samples we want in validation and test sets.
val_prop = 0.15
test_prop = 0.15
# convert to integer locations
val_index = int(val_prop*len(data.index))
test_index = val_index + int(test_prop*len(data.index))
# select the three sets by indexing the data.
val = data_shuffled[:val_index,:]
test = data_shuffled[val_index:test_index,:]
train = data_shuffled[test_index:,:]
# split each set into x and y, where y is the digit label.
train_x = train[:,1:]
train_y = train[:,0]
val_x = val[:,1:]
val_y = val[:,0]
test_x = test[:,1:]
test_y = test[:,0]

Here I implement a logistic regression with a linear kernel from SciKit-learn. To do some basic validation, I'll just choose a regularization parameter (C in this case) and train the classifier.

Then we can create a validation report, which includes precision, recall, and F1 score for each numeral. 


# for the classifier, set the X and Y values.
X = train_x
Y = train_y
# Set regularization parameter (C)
reg_param = 1e5
# Define the classifier
logreg = linear_model.LogisticRegression(C=reg_param)
# train the classifier
logreg.fit(X, Y)
view raw MNIST_train.py hosted with ❤ by GitHub

# Predict digits for the validation set
predicted = logreg.predict(val_x)
conf_matrix = metrics.confusion_matrix(val_y , predicted)
print("Classification report for classifier %s:\n%s\n"
% (logreg, metrics.classification_report(val_y, predicted)))
print("Confusion matrix:\n%s" % conf_matrix)
# Let's plot the confusion matrix, but get rid of the diagonal to better show where the problems are.
off_diag = conf_matrix*(1 - np.identity(10))
plt.imshow(off_diag, cmap = cm.Greys_r, interpolation='none')
plt.grid(False)
plt.title('Digit confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

Classification report for classifier LogisticRegression(C=100000.0, class_weight=None, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=100,
multi_class='ovr', penalty='l2', random_state=None,
solver='liblinear', tol=0.0001, verbose=0):
precision recall f1-score support
0 0.93 0.97 0.95 634
1 0.94 0.95 0.94 701
2 0.92 0.86 0.89 628
3 0.89 0.87 0.88 675
4 0.91 0.89 0.90 622
5 0.88 0.86 0.87 545
6 0.93 0.95 0.94 608
7 0.93 0.94 0.93 645
8 0.83 0.89 0.86 623
9 0.88 0.87 0.87 619
avg / total 0.90 0.90 0.90 6300
Confusion matrix:
[[612 0 1 3 1 4 5 1 7 0]
[ 0 663 5 3 0 7 2 1 20 0]
[ 7 9 540 10 8 4 10 11 25 4]
[ 10 5 16 587 5 18 5 6 17 6]
[ 3 6 6 0 551 2 9 4 8 33]
[ 7 2 7 18 2 471 11 1 21 5]
[ 6 3 3 1 4 10 576 0 4 1]
[ 4 2 4 4 7 1 0 604 0 19]
[ 3 13 5 18 1 14 4 3 553 9]
[ 3 1 2 12 28 3 0 21 8 541]]
It's a bit easier for me to parse things in visual format, so I also made an image out of the confusion matrix. I set the diagonal elements (which were classified correctly) to zero to increase the contrast.


Whiter squares indicate more misclassifications. We see that the most frequent mistake is that "4" tends to get classified as "9", but we also tend to over-assign the numeral "8" to inputs of "1", "2", and "5". Interestingly, this is not a symmetric matrix, so for example we tend to assign the right label to "8" as an input.


Hyperparameters

If we stick with models that are linear in each pixel value, the only hyperparameter we need to choose for logistic regression is the regularization constant, which controls to what degree we weight the input pixels. The two common regularization choices I'll consider are are l2 (ridge regression or Tikhonov regularization), and l1 (lasso). The former tends to result in a "smooth" weighting, where we put similar weights on everything, but the total overall weight is small. The latter results in "sparse" weighting, where we eliminate many of the inputs as being noninformative. 

If we regularize too little, we'll find that while we have low fit error on the training set, we have large errors on the validation set, which is called overfitting. If we regularize too much, we'll find that we're ignoring important information from the input, resulting in large errors for th training and validation sets. This is called underfitting, and the error is called bias.

It can be useful to plot the training and validation error as a function of the regularization constants to see where the regularization performs best. And since we have a pretty large data set, I'll take only a small fraction of the training set. This will make the training go faster, and will just give us an idea of the parameters we should use in the classifier. Let's look at l2 regularization first.

# gridsearch to find best regularization parameter
C_vect = np.logspace(-8, 5, 50)
# set the X and Y values.
X_tr = train_x[:1000,:]
Y_tr = train_y[:1000]
X_val = val_x[:1000,:]
Y_val = val_y[:1000]
f1_train=[]
f1_val=[]
# create a for loop over the regularization constant
for i in range(len(C_vect)):
# define and train the regression
regression = linear_model.LogisticRegression(penalty='l2', C=C_vect[i])
regression.fit(X_tr, Y_tr)
# find the f1 score for training set
f1_train.append(metrics.f1_score(Y_tr, regression.predict(X_tr), average='micro'))
# find the f1 score for validation set
f1_val.append(metrics.f1_score(Y_val, regression.predict(X_val), average='micro'))
plt.plot(C_vect, f1_train, 'rs', C_vect, f1_val, 'bs')
# log scale on the x-axis
plt.xscale('log')



In this plot, larger values mean that the classifier is doing a better job, with 1.00 implying perfect classification. On the horizontal axis, larger values mean less regularization. The red squares show that as we weaken the regularization, the classifier does a better job with the training data. But the performance on the validation data improves for a bit, and then slowly degrades. So for very little regularization, we have overfitting. From a probabilistic point of view, the classifier is no longer representative of the ensemble from which we draw the data.

The validation score peaks around C\approx 10^{-2.5}, so even though I've trained on a small subset of the data, I would use this value moving forward.

Now let's make the same graph using l1 regularzation.
The same trends are present here, but the exact value of the optimum is different - around C\approx 10^{-5.5}. As a nice illustration, we can run the classifier with this value and see which pixels it elminates. To do that, we retrieve the coefficients from the classifier, of which we get one per pixel per numeral. Keeping only those pixels whose coefficients are >0 for at least one of the numerals generates this map:

# L1 regularization for C = 0.003
X = train_x[1:1000,:]
Y = train_y[1:1000]
# Specify the classifier. L1 penalty.
logreg_l1 = linear_model.LogisticRegression(penalty='l1', C=.003)
# train the classifier
logreg_l1.fit(X, Y)
# get the coefficients
coef_l1_LR = logreg_l1.coef_.ravel()
# put into matrix form (each row is for one numeral, each column for one pixel)
coef_l1_matrix = coef_l1_LR.reshape(10,784)
# sum these to get an average over all numerals
coef_l1_sum = abs(coef_l1_matrix).sum(axis=0) > 0
# show which pixels were used
plt.imshow(coef_l1_sum.reshape(28,28), cmap = cm.Greys_r, interpolation='none')
plt.grid(False)
plt.title('Nonzero coefficients')
# express the sparsity as a percentage. higher means more 0-values.
sparsity_l1_LR = np.mean(coef_l1_LR == 0) * 100
nonzero = coef_l1_matrix > 0
# show which pixels were used
plt.imshow(coef_l1_sum.reshape(28,28), cmap = cm.Greys_r, interpolation='none')
plt.grid(False)
plt.title('Nonzero coefficients, sum')
So to recap, white pixels are those the classifier decides to keep if we tell it to get rid of the least informative ones. Compare this to our map of the variance of each pixel:

# gridsearch to find best regularization parameter
C_vect = np.logspace(-8, 5, 50)
# set the X and Y values.
X_tr = train_x[:1000,:]
Y_tr = train_y[:1000]
X_val = val_x[:1000,:]
Y_val = val_y[:1000]
f1_train=[]
f1_val=[]
# create a for loop over the regularization constant
for i in range(len(C_vect)):
# define and train the regression
regression = linear_model.LogisticRegression(penalty='l2', C=C_vect[i])
regression.fit(X_tr, Y_tr)
# find the f1 score for training set
f1_train.append(metrics.f1_score(Y_tr, regression.predict(X_tr), average='micro'))
# find the f1 score for validation set
f1_val.append(metrics.f1_score(Y_val, regression.predict(X_val), average='micro'))
plt.plot(C_vect, f1_train, 'rs', C_vect, f1_val, 'bs')
# log scale on the x-axis
plt.xscale('log')
and we see that our hunch was correct. The classifier preferentially kept the high-variance pixels.

Now that we have this pipeline, we should be able to use it for other classifiers. The exact analysis will likely change, but at least we'll have a basis for comparison.

No comments:

Post a Comment