The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Large Margin Classification Using the Perceptron Algorithm YoavFreund Robert E. Schapire Presented by Amit Bose March 23, 2006

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2016-01-30 20:12:02

Large Margin Classification Using the Perceptron Algorithm

Large Margin Classification Using the Perceptron Algorithm YoavFreund Robert E. Schapire Presented by Amit Bose March 23, 2006

Large Margin Classification
Using the Perceptron
Algorithm

Yoav Freund
Robert E. Schapire

Presented by
Amit Bose

March 23, 2006

Goals of the Paper

• Enhance Rosenblatt’s Perceptron
algorithm so that it can make use of
large margin

• Analyze the error bounds of such an
algorithm

• Verify hypotheses using experimental
results

But Why do It?

• Achieving a large margin is desirable
• Non-linear mapping to higher-

dimensional spaces is possible

– Improves separability
– Improves chances of widening the margin

• Kernel functions allow computational
tractability despite high dimensionality

• Good old Perceptron doesn’t care about
margins

• It is just happy to reduce the training
error as much as it can

…haven’t you heard of SVMs?

• Yeah, sure. But we love Perceptrons!
• SVM

– Upsides: linear, maximal margin, kernel
compatible

– Downside: optimization involves solving a
large quadratic program

• Enter the voted-perceptron

– Perceptron based linear classifier
– Retains simplicity
– Takes advantage of any margin that can be

achieved

Perceptron Revisited

• Given: A sequence of m training samples (xi , yi);
each xi is a n-dimensional real vector and each yi is

either +1 or -1

• Maintain a prediction vector w initialized to 0
• For each sample x presented, predict

y(pred) = sign(w— x)
• If y(pred) matches y, do nothing; else update w :

w = w + yx
• Modification to get voted-perceptron:

– Don’t discard intermediate prediction vectors
– Instead maintain weights on the prediction vectors

themselves, so that different predictions can be combined
– Weight = Number of samples that predictor can survive

without making a mistake

From Online to Batch

• Perceptron is naturally online algorithm

• Several ways to convert an online algorithm
to batch

– Cycle through the data for pre-defined number of
epochs or till algorithm converges

– Pocket algorithm – track which intermediate
vector has the longest run of correct predictions

– Test all generated prediction vectors against
validation set and pick the best

– Leave-one-out

• Voted-perceptron uses the deterministic
version of leave-one-out

Leave-One-Out Conversion

• Train only once on subset of training
samples and make prediction on a test
instance

• Two ways of choosing subsets:

– Randomized – pick subset size r randomly and
train on first r samples

– Deterministic – for all possible subset sizes r, train
separately on first r to get a set of classifiers;
prediction is made by majority wins rule

• Perceptron with modifications suggested
earlier, when run for exactly one epoch, is
actually deterministic leave-one-out

– Each presentation of sample in perceptron is like
training the perceptron to a different subset-size

– Maintaining weights on the prediction vectors is
like aggregating the votes of a predictor in leave-
one-out

General Comments

• Algorithm is exceedingly simple – original
algorithm remains untouched except for
saving intermediate predictors and
combining these (compare this to quadratic
programming)

• Modus operandi is similar in many respects
to that of boosting

• Doesn’t make an explicit attempt to
maximize any margin, yet like boosting,
enjoys benefits of a large margin

• Enhancements of Perceptron that do so
exist, notably The AdaTron

– Converges asymptotically to the correct solution

– Rate of convergence follows an exponential law in
the number of iterations

Analysis: Leave-one-out

• Suppose the online algorithm is expected to
make P mistakes when given m+1 samples
drawn randomly from an i.i.d.

• Now convert the online algorithm to batch
using leave-one-out and provide it m random
training samples

• Expected probability of making an error on a
randomly selected test sample is upper
bounded by:

P/(m+1) for the randomized version

2P/(m+1) for the deterministic version

• This is indeed a generalization error bound,
but is slightly different from bounds we have

seen – called instantaneous error

Analysis: Perceptron

• Need to find P for the online Perceptron

• In the separable case,

number of errors at any stage ≤ (R/γ)2

• For inseparable,

define slack variables R
• Add dimensions so as to

reduce the problem

to separable case in

higher dimensions 2γ
• Upper bound on number

of errors ((R+D)/γ)2,

where D is the square root of the squared

sum of the slack variables

Analysis: Putting it Together

• Given m samples, probability of instantaneous
error of voted-perceptron is at most

2  R + D ,γ 2
m+ u
E[ inf ]
1 >0 γ
||u||=1;γ

• Notice the inverse proportionality to margin γ

• Here expectation E[ ] is taken over all possible
m+1 samples that can be drawn from the

underlying distribution

• A stronger statement on the bound follows where
R and D are determined for only those samples
that are misclassified

Using Kernels

• Kernels are useful for calculating inner products in
high-dimensional mapped spaces

• The Perceptron algorithm can formulated so that
computations involve inner products between
samples

• Training and prediction involve inner products
between samples and prediction vectors

• Because of the update rule, prediction vector itself
is calculated as a sum of (misclassified) samples

• Used inner products can be expanded to a sum of
inner products between samples – kernel
compatible

• Calculation of final prediction vector can also be
done in linear time

Experimental Results

• NIST OCR database
• Classification of hand-written digits
• Multi-class problem – reduce to several one-vs-rest 2-class

problems; go with the class that has highest prediction score
• Score calculated in 4 different ways:

– Using final prediction vector (normalized and unnormalized)
– Using voted prediction vector
– Using averaged prediction vector (normalized and unnormalized)
– Using randomized prediction vector (normalized and unnormalized)

• Polynomial kernels of upto 6 degrees
• Observations

– Moving to higher dimensions causes large drop in error
– Voting and averaging prediction vectors better than traditional

perceptron
– These are significantly better for less epochs
– Random vectors performed worst in all cases
– Algorithm ran slowly for low degree polynomial kernels
– Accuracy inferior, but comparable, to that of SVM

Summary

• Paper described simple extension to
Perceptron

• Extension allows Perceptron to utilize
any margin that may be achievable

• Kernel formulation can reduce
computation

• The Perceptron is now ready for
classification in non-linearly mapped
higher dimensions

• Experiments verify this proposition

The (Kernel) AdaTron Algorithm


Click to View FlipBook Version