Large Margin Classification
Using the Perceptron
Algorithm
Yoav Freund
Robert E. Schapire
Presented by
Amit Bose
March 23, 2006
Goals of the Paper
• Enhance Rosenblatt’s Perceptron
algorithm so that it can make use of
large margin
• Analyze the error bounds of such an
algorithm
• Verify hypotheses using experimental
results
But Why do It?
• Achieving a large margin is desirable
• Non-linear mapping to higher-
dimensional spaces is possible
– Improves separability
– Improves chances of widening the margin
• Kernel functions allow computational
tractability despite high dimensionality
• Good old Perceptron doesn’t care about
margins
• It is just happy to reduce the training
error as much as it can
…haven’t you heard of SVMs?
• Yeah, sure. But we love Perceptrons!
• SVM
– Upsides: linear, maximal margin, kernel
compatible
– Downside: optimization involves solving a
large quadratic program
• Enter the voted-perceptron
– Perceptron based linear classifier
– Retains simplicity
– Takes advantage of any margin that can be
achieved
Perceptron Revisited
• Given: A sequence of m training samples (xi , yi);
each xi is a n-dimensional real vector and each yi is
either +1 or -1
• Maintain a prediction vector w initialized to 0
• For each sample x presented, predict
y(pred) = sign(w x)
• If y(pred) matches y, do nothing; else update w :
w = w + yx
• Modification to get voted-perceptron:
– Don’t discard intermediate prediction vectors
– Instead maintain weights on the prediction vectors
themselves, so that different predictions can be combined
– Weight = Number of samples that predictor can survive
without making a mistake
From Online to Batch
• Perceptron is naturally online algorithm
• Several ways to convert an online algorithm
to batch
– Cycle through the data for pre-defined number of
epochs or till algorithm converges
– Pocket algorithm – track which intermediate
vector has the longest run of correct predictions
– Test all generated prediction vectors against
validation set and pick the best
– Leave-one-out
• Voted-perceptron uses the deterministic
version of leave-one-out
Leave-One-Out Conversion
• Train only once on subset of training
samples and make prediction on a test
instance
• Two ways of choosing subsets:
– Randomized – pick subset size r randomly and
train on first r samples
– Deterministic – for all possible subset sizes r, train
separately on first r to get a set of classifiers;
prediction is made by majority wins rule
• Perceptron with modifications suggested
earlier, when run for exactly one epoch, is
actually deterministic leave-one-out
– Each presentation of sample in perceptron is like
training the perceptron to a different subset-size
– Maintaining weights on the prediction vectors is
like aggregating the votes of a predictor in leave-
one-out
General Comments
• Algorithm is exceedingly simple – original
algorithm remains untouched except for
saving intermediate predictors and
combining these (compare this to quadratic
programming)
• Modus operandi is similar in many respects
to that of boosting
• Doesn’t make an explicit attempt to
maximize any margin, yet like boosting,
enjoys benefits of a large margin
• Enhancements of Perceptron that do so
exist, notably The AdaTron
– Converges asymptotically to the correct solution
– Rate of convergence follows an exponential law in
the number of iterations
Analysis: Leave-one-out
• Suppose the online algorithm is expected to
make P mistakes when given m+1 samples
drawn randomly from an i.i.d.
• Now convert the online algorithm to batch
using leave-one-out and provide it m random
training samples
• Expected probability of making an error on a
randomly selected test sample is upper
bounded by:
P/(m+1) for the randomized version
2P/(m+1) for the deterministic version
• This is indeed a generalization error bound,
but is slightly different from bounds we have
seen – called instantaneous error
Analysis: Perceptron
• Need to find P for the online Perceptron
• In the separable case,
number of errors at any stage ≤ (R/γ)2
• For inseparable,
define slack variables R
• Add dimensions so as to
reduce the problem
to separable case in
higher dimensions 2γ
• Upper bound on number
of errors ((R+D)/γ)2,
where D is the square root of the squared
sum of the slack variables
Analysis: Putting it Together
• Given m samples, probability of instantaneous
error of voted-perceptron is at most
2 R + D ,γ 2
m+ u
E[ inf ]
1 >0 γ
||u||=1;γ
• Notice the inverse proportionality to margin γ
• Here expectation E[ ] is taken over all possible
m+1 samples that can be drawn from the
underlying distribution
• A stronger statement on the bound follows where
R and D are determined for only those samples
that are misclassified
Using Kernels
• Kernels are useful for calculating inner products in
high-dimensional mapped spaces
• The Perceptron algorithm can formulated so that
computations involve inner products between
samples
• Training and prediction involve inner products
between samples and prediction vectors
• Because of the update rule, prediction vector itself
is calculated as a sum of (misclassified) samples
• Used inner products can be expanded to a sum of
inner products between samples – kernel
compatible
• Calculation of final prediction vector can also be
done in linear time
Experimental Results
• NIST OCR database
• Classification of hand-written digits
• Multi-class problem – reduce to several one-vs-rest 2-class
problems; go with the class that has highest prediction score
• Score calculated in 4 different ways:
– Using final prediction vector (normalized and unnormalized)
– Using voted prediction vector
– Using averaged prediction vector (normalized and unnormalized)
– Using randomized prediction vector (normalized and unnormalized)
• Polynomial kernels of upto 6 degrees
• Observations
– Moving to higher dimensions causes large drop in error
– Voting and averaging prediction vectors better than traditional
perceptron
– These are significantly better for less epochs
– Random vectors performed worst in all cases
– Algorithm ran slowly for low degree polynomial kernels
– Accuracy inferior, but comparable, to that of SVM
Summary
• Paper described simple extension to
Perceptron
• Extension allows Perceptron to utilize
any margin that may be achievable
• Kernel formulation can reduce
computation
• The Perceptron is now ready for
classification in non-linearly mapped
higher dimensions
• Experiments verify this proposition
The (Kernel) AdaTron Algorithm