The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Maximum Margin Clustering on Data Manifolds Fei Wang, Xin Wang, Tao Li School of Computing and Information Sciences, Florida International University, Miami, FL 33199

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2016-01-30 07:33:03

Maximum Margin Clustering on Data Manifolds

Maximum Margin Clustering on Data Manifolds Fei Wang, Xin Wang, Tao Li School of Computing and Information Sciences, Florida International University, Miami, FL 33199

2009 Ninth IEEE International Conference on Data Mining

Maximum Margin Clustering on Data Manifolds

Fei Wang, Xin Wang, Tao Li
School of Computing and Information Sciences, Florida International University, Miami, FL 33199

{feiwang,xwang009,taoli}@cs.fiu.edu

Abstract—Clustering is one of the most fundamental and Compared with existing MMC and conventional clustering
important problems in computer vision and pattern recognition approaches, our method has the following strengths:
communities. Maximum Margin Clustering(MMC) is a recently
proposed clustering technique which has shown promising (1). As indicated by [13][10], many real world data sets
experimental results. The main theme behind MMC is to (especially the data sets used in pattern recognition and
extend the standard maximum margin principle in Support computer vision like faces, hand written digits and images)
Vector Machine (SVM) to the unsupervised scenario. This demonstrate some low dimensional manifold structures.
paper will consider the problem of maximum margin cluster- However, the traditional MMC approaches only consider
ing on data manifolds. Specifically, we propose an approach how to discriminate different clusters and leave those man-
called Manifold Regularized Maximum Margin Clustering ifold information away. Our MRMMC method combines
(MRMMC) which combines both the maximum margin data the maximum margin discrimination and data manifold
discrimination and data manifold information in a unified information in a principled way by incorporating the cluster
clustering objective and propose an efficient algorithm to solve assignment smoothness as a prior regularization term, so that
it. Finally the experimental results on several real world data the cluster results take into account both the discrimiantive
sets are presented to show the effectiveness of our method. and geometric information contained in the data sets.

I. INTRODUCTION (2). Different from CPMMC [21][22], we propose a
theoretically more elegant approach to solve the MRMMC
Maximim Margin Clustering (MMC) is a recently pro- problem. It first apply the Constrained Convex Concave Pro-
posed clustering method and it extends the standard theory in cedure (CCCP) [12] to decompose the original nonconvex
Support Vector Machine (SVM) to the unsupervised scenario problem into a series of convex problem, and then apply the
[14][16]. The maximum margin principle behind SVM is a cutting plane method to solve each of them. Thus the final
general rule for discriminating the data from different classes solution is guaranteed to converge to a local optimum. The
and it has been used widely in computer vision (e.g., object idea is borrowed from [6].
tracking [23] and stereo vision [9]) and pattern recognition
(e.g., visual category recognition [19] and face recognition (3). Similar to existing MMC approaches, MRMMC can
[5]) communities. Most of the maximum margin methods obtain the cluster assignments as well as the cluster hyper-
are supervised because they can be formulated as convex planes, which is easy to extend to out-of-sample data points.
problems and solve efficiently. However, in MMC, since we
should solve the clustering hyperplanes as well as the cluster The rest of this paper is organized as follows. Section 2
assignments, the resultant problem is usually nonconvex and will briefly review the basic knowledge of MMC and CCCP.
hard to solve. Originally Xu et al. [16] proposed to first relax The detailed procedure of MRMMC will be introduced in
the MMC problem to a convex one and then apply Semi- section 3. Section 4 will present the experimental results,
Definite Programming to solve it. They showed promising followed by the conclusions in Section 5.
empirical results, but the required computational complexity
is too high to handle large scale data sets. II. PRELIMINARIES AND BACKGROUND KNOWLEDGE

To make the MMC problem more practical, many effi- A. Maximum Margin Clustering
cient implementations are proposed [14][20][21][22], among
which Cutting Plane Maximum Margin Clustering (CP- Mathematically, for two-class problems, given a point set
MMC) [21][22] has shown state-of-the-art performances,
in both accuracy and efficiency. However, the CPMMC X = {x1, · · · , xn}1 and their labels y = (y1, . . . , yn) ∈
algorithm directly apply the Cutting Plane approach [7], {−1, +1}n, SVM seeks a hyperplane with normal vector
which is designed for convex nonsmooth problems, to solve w2
the nonconvex MMC problem. Thus the convergence and
optimality of the final solution may not be guaranteed. f (x) = wT x (1)

In this paper, we propose a novel algorithm called Man- 1In the following we will use X to denote the data set and data space
ifold Regularized Maximum Margin Clustering (MRMMC). interchangeably without any confusion.

2Two remarks: (1) We do not consider the offset b of the hyperplane
since it can be easily incorporated into our framework by augmenting x˜i =
F[xoTir , 1]T and w˜ = [wT , b]T ; (2) We do not consider the kernel trick here.
nonlinear cases, xi can be viewed as the low dimensional embedding

of the i-th data point by performing Kernel PCA [11][3].

1550-4786/09 $26.00 © 2009 IEEE 1028
DOI 10.1109/ICDM.2009.104

by solving the following optimization problem where fi and gi are real-valued convex functions on a
vector space Z and ci ∈ R for all i = 1, . . . , n. Denote
1 w 2+C n by T1{f, z}(z ) the first order Taylor expansion of f at
2 n location z, that is T1{f, z}(z ) = f (z) + ∂zf (z)(z − z),
minw,ξi ξi (2) where ∂zf (z) is the gradient of the function f at z3. Given
s.t. an initial point z0, the CCCP computes zt+1 from zt by
i=1 replacing gi(z) with its first-order Taylor expansion at zt,
i.e., T1{gi, zt}(z), and setting zt+1 to be the solution of the
i = 1, . . . , n, ξi ≥ 0, yif (xi) ≥ 1 − ξi following relaxed optimization problem

where {ξi} are slack variables and C > 0 is a constant. min f0(z) − T1{g0, zt}(z) (6)
Following standard SVM, for two-class problems, MMC
z
aims to find the optimal cluster hyperplane together with the

cluster assignments by solving [16]:

min min 1 w 2+ C n (3) s.t. fi(z) − T1{gi, zt}(z) ≤ ci i = 1, . . . , n
2 n
y∈{−1,+1}n w,b,ξi i=1 ξi

s.t. yif (xi) ≥ 1 − ξi The above procedure will continue until zt converges, and
[12] proved that the CCCP will finally converge to a local
ξi ≥ 0 i = 1, . . . , n minimum of problem (5).
−l ≤ eT y ≤ l

where l ≥ 0 is a constant controlling the cluster imbalance III. MANIFOLD REGULARIZED MAXIMUM MARGIN
and e is the all-one vector. CLUSTERING (MRMMC)

Zhao et al. [21] recently proposed to compute the cluster In this section we introduce our Manifold Regularized
assignment of xi by Maximum Margin Clustering (MRMMC) algorithm in detail.

yi = sign(wT xi) A. Motivation and Formulation

where sign(·) is a sign function. Then, they formulate the Although Equations (3) and (4) provide us an elegant way
two-class MMC problem as follows of incorporating the maximum margin principle into unsu-
pervised learning, it does not explore any prior knowledge
min 1 w 2 + C n (4) in the data space. Specifically, many previous research show
2 n that the data in computer vision and pattern recognition usu-
w,ξi i=1 ξi ally form some low-dimensional manifolds [13][10]. In such
cases, how to take into account the geometric information
s.t. i = 1, . . . , n, ξi ≥ 0, |f (xi)| ≥ 1 − ξi contained in the data sets would be of crucial importance to
the performances of the algorithms.
n
Consider the statistical framework of learning from
−l ≤ i=1 f (xi) ≤ l. examples, where there is a joint probability distribution
P on the product space X × R, such that the unlabeled
They proved that without the cluster-balance constraint, the examples x ∈ X are sampled from the marginal distribution
PX of P by integrating the label space out. Then our goal
solution to problem (4) is identical to the solution to problem is to learn the conditional P (y|x). In general, we can make
a specific assumption that there is a strong connection
(3) and made use of the cutting plane method [7] to solve between the marginal PX and the conditional P (y|x) such
that we can explore some prior knowledge on PX for
the problem. However, it can be easily observed that the first better function learning. More concretely, we incorporate
the smoothness assumption [1] in semi-supervised learning
constraint in problem (4) is nonconvex with respect to w, field into the MMC framework, which says that
and the cutting plane algorithm [7] is originally designed
for convex problems, thus the convergence and optimality (Smoothness Assumption)[1][2]. If two points xi, xj ∈ X
are close in the intrinsic geometry of PX , then the
of directly applying cutting plane to solve problem (4) may conditional distributions P (y|xi) and P (y|xj) should be
similar.
not be guaranteed.
An equivalent statement of the above smoothness assump-
B. Concave-Convex Procedure tion is that the prediction function f should be sufficiently
smooth with respect to the intrinsic geometry of PX , there-
The concave-convex procedure (CCP) [18] is a method fore the question is how to estimate the intrinsic geometry
for solving non-convex optimization problem whose objec-
tive function can be expressed as a difference of convex
functions. While [18] only considered linear constraints,
[12] proposed the constrained concave-convex procedure
(CCCP), which can solve the optimization problems with
a concave-convex objective function under concave-convex
constraints. Mathematically, assume we want to solve the
following optimization problem [12]

min f0(z) − g0(z) (5) 3For non-smooth functions, the gradient ∂zf (z) can be replaced by the
subgradient [4].
z

s.t. fi(z) − gi(z) ≤ ci i = 1, . . . , n

1029

of PX . As suggested by [1], we are particularly interested in where sign(·) is the sign function. Thus by replacing |f (xi)|
the case when the support of PX is a compact submanifold in problem (10) with Eq.(11), we have the following relaxed
M ∈ X , in which case we can estimate the smoothness of convex optimization problem for each CCCP iteration
f over PX by
1 C1
min 2 w 2 + n fT Lf + C2ξ (12)

f 2 = ∇Mf, ∇Mf (7) w,ξ≥0
I
M s.t. ∀c ∈ Ω :

where ∇M is the manifold gradient. The smaller f 2 is, 1 n 1 n
I n n
the smoother f will be. [1] also pointed out that 2 can cisign(f (t)(xi))f (xi) ≥ ci − ξ
f I
i=1 i=1
be approximated on the basis of the data points in X as n
−l ≤ i=1 f (xi) ≤ l
f 2 = 1 f T Lf (8)
I n where c = [c1, c2, · · · , cn] and Ω = {0, 1}n. The above
problem is a standard quadratic programming problem
where f = [f (x1), · · · , f (xn)]T , and L = D − W ∈ Rn×n which can be solved in standard ways. After obtaining the
is the graph Laplacian defined on the data adjacency graph solution w of the above problem, we use it as w(t+1) and
where Wij represents the weight on the edge connecting xi continue the iterations until convergence.
and xj, and D is a diagonal matrix with Dii = j Wij .
Combining Eq.(4) and Eq.(8) together, we can derive our 2) Cutting Plane Method: Now the problem becomes
MRMMC problem as how to solve Eq.(12) efficiently, which is convex and has
exponential number of constraints. In the following we will
min 1 w 2 + C1 f T Lf + C2 n (9) employ an adaptation of the cutting plane algorithm [7] to
2 nn solve problem (12), which targets to find a small subset of
w,ξi i=1 ξi constraints from the whole set of constraints in problem (12)
that ensures a sufficiently accurate solution. Using such an
s.t. i = 1, . . . , n, ξi ≥ 0, |f (xi)| ≥ 1 − ξi algorithm, we construct a nested sequence of successively
tighter relaxations of problem (12). Similar to [21], we can
n generally find a polynomially sized subset of constraints,
with which the solution of the relaxed problem fulfills all
−l ≤ i=1 f (xi) ≤ l constraints from problem (12) up to a precision , i.e., for
∀ ci ∈ {0, 1}:
In the following section we will introduce an efficient

algorithm to solve problem (9).

B. Problem Solving

In problem (9), we have n slack variables {ξi}. To solve 1 n cisign(f (t)(xi))f (xi) ≥ 1 n ci − (ξ + ) (13)
it efficiently, we first derive the 1-slack form of problem (9) n i=1 n i=1
as in [21]. Specifically, we introduce a single slack variable
ξ ≥ 0 and rewrite problem (9) as

min 1 w 2 + C1 f T Lf + C2ξ (10)
2 n
w,ξ≥0 That is, the remaining exponential number of constraints are
guaranteed to be violated by no more than , without the
s.t. ∀ci ∈ {0, 1} : 1 n ci|f (xi)| ≥ 1 n ci −ξ need for explicitly adding them to the optimization problem.
n i=1 n i=1
Specifically, our algorithm starts with only the cluster
n balance constraint subset and solves the following problem.
−l ≤ i=1 f (xi) ≤ l

It can be proved that the solution to problem (10) is identical min 1 w 2 + C1 f T Lf (14)
=pron1blemni=1(1ξ0i)(sisimniolanrctoon[v2e1x])i.nCwleaarnlyd, 2 n
to problem (9) with ξ w n
the first constraint in
s.t. −l ≤ i=1 f (xi) ≤ l
it is a difference of two convex functions, therefore we can

resort to CCCP to solve it. After getting the solution wt0 to the above problem4, it
computes the most violated constraint as
1) CCCP Decomposition: Given an initial point

(w(0), ξ(0)), CCCP computes (w(t+1), ξ(t+1)) from

(w(t), ξ(t)) by replacing |f (xi)| with its first order Taylor cti0 = 1, if sign(f (t)(xi))f t0 (xi) < 1 (15)
expansion at w(t) and optimizes the resulting convex 0, otherwise

problem. Since |f (xi)| is nonsmooth at w(t), we should where f t0(xi) = (wt0 )T xi. Then the algorithm adds such
replace its gradient with subgradient when computing its

tangent in CCCP. By Eq.(1), we can compute the tangent

of |f (xi)| at w(t) as 4Here we use the superscript ti to denote that this is the i-th iteration of
the cutting plane algorithm for solving the problem derived from the t-th
∂w(|f (xi)|)|(w(t)) = sign(f (t)(xi))(wT xi) (11) iteration of CCCP.

1030

constraint to problem (14) and solve variable formulation as

min 1 w 2 + C1 fT Lf + C2ξ (16) min 1 k + C1 k + C2ξ (18)
2 n 2 nk
w,ξ≥0 {wp},ξir ≥0 ||wp||2 fpT Lf p
n n
s.t. p=1 p=1

1 cit0 sign(f (t)(xi))f (xi) ≥ 1 cit0 − ξ s.t. ∀i = 1, . . . , n, r = 1, . . . , k, cri ∈ {0, 1} :
n n
i=1 i=1 1 cir max wpT xi − wrT xi ≥ 1 cir − ξ,
n nk nk
−l ≤ i=1 f (xi) ≤ l p∈{1,2,··· ,k}
i,r i,r

nn

Then the algorithm will get the solution wt1 and computes ∀p, q ∈ {1, · · · , k} : −l ≤ wpT xi − wqT xi ≤ l
the most violated constraint {cit1} similarly as in Eq.(15)
with f t0 replaced by f t1, and this constraint will be added i=1 i=1
to problem (16). This procedure will be repeated until no
constraint is violated by more than . In this way, we Now we introduce two concatenated vectors as
construct a successive strengthening approximation series of
the problem (12) by a series of cutting planes that cut off w˜ = [w1T , w2T , · · · , wpT , · · · , wkT ]T (19)
the current optimal solution from the feasible set [7]. x˜ip = [0, 0, · · · , xiT , · · · , 0]T (20)

C. Multi-Class MRMMC where 0 is an 1 × d all-zero vector with d being the
dimension of xi, i.e., only the (p− 1)d to pd-th elements are
Following [17], we can formulate the multi-class manifold nonzero (equals xi) in x˜ip. Then we have w˜ T x˜ip = wpT xi,
regularized maximum margin clustering problem as: and problem (18) can be reformulated as

min 1 ||w˜ ||2 + C1 ˜f T L˜ ˜f + C2ξ (21)
2 nk
w˜ ,ξir≥0
1 k C1 k C2 n ∀i = 1, . . . , n, r = 1, . . . , k, cri ∈ {0, 1} :
2 nk n s.t.
min ||wp||2 + fpT Lf p + ξi
1 cir max w˜ T x˜ip − w˜ T x˜ir 1 cir − ξ,
{wp },ξi ≥0 p=1 p=1 i=1 nk ≥ nk
p∈{1,2,··· ,k}
s.t. ∀i = 1, . . . , n, r = 1, . . . , k : i,r i,r

wyTi xi + δyi,r − wrT xi ≥ 1 − ξi, (17) nn

∀p, q ∈ {1, · · · , k} : −l ≤ w˜ T x˜ip − w˜ T x˜iq ≤ l

where we assume the data set X = {x1, · · · , xn} comes i=1 i=1
from k clusters, and a separate weight vector wr is defined
for each cluster r such that wrT xi returns the confidence where ˜f = [w˜ T x˜11, w˜ T x˜21, · · · , w˜ T x˜12, · · · , w˜ T x˜nk]T ,
that xi belongs to cluster r. fp = [fp(x1), · · · , fp(xn)]T and L˜ = L ⊗ Ik with Ik being the identity matrix of
with fp(xi) = wpT xi. yi = arg maxr wrT xi is the cluster size k × k and ⊗ is the Kronecker product. Next, in order
membership of xi. δuv = 1 if u = v and 0 otherwise. to apply CCCP, we should compute the subgradient of

To further simplify problem (17), we propose to absorb maxp∈{1,··· ,k} w˜ T x˜ip first. For finite pointwise maximum
δyi,r into ξi and use a separate variable ξir for each con- hfu(xll)o=f tmheaxunkp=io1nfspo(xf )a,cittisvesufbudnicftfieornesn5t,iai.lei.s, just the convex
straint. Then problem (17) can be relaxed to
∂f (x) = conv{∂fp(x)|fp(x) = f (x)}
min 1 k + C1 k + C2 n Note that in our case, fp = w˜ T x˜ip and the variable to be
2 nk nk solved is w˜ , therefore
{wi},ξir ≥0 ||wp||2 fpT Lf p ξir

s.t. p=1 p=1 i=1

∀i = 1, . . . , n, r = 1, . . . , k : k

max ,k} wpT xi − wrT xi ≥ 1 − ξir , ∂ max w˜ T x˜ip = βir x˜ ir βir = 1 (22)

p∈{1,2,··· p∈{1,··· ,k} r=1 r

In order to avoid trivial solutions, we can also enforce the where
class balance constraint in [22] as
= 0, if w˜ T x˜ir = maxp∈{1,··· ,k} w˜ T x˜ip
nn βir ≥ 0, otherwise

∀p, q ∈ {1, · · · , k} : − l ≤ wpT xi − wqT xi ≤ l In multiclass clustering, we usually expect that we assign the
data into one unique cluster (we don’t consider the multi-
i=1 i=1 label case in this paper), i.e., we expect there is only one

Similar to two-class MRMMC, we may find the constraint 5See http://www.ee.ucla.edu/ee236b/lectures/sg.pdf, page 8.
in problem (18) is nonconvex and thus we can also resort to
CCCP to solve it. We first rewrite problem (18) in a 1-slack

1031

active function when computing the subgradient in Eq.(22). baseball and hockey from the version 20-news-18828. For
So there is a unique p∗ satisfying6 RCVI, we just use the data samples with the highest four
topic codes (CCAT, ECAT, GCAT, and MCAT) in the “Topic
w˜ T x˜ip∗ = max w˜ T x˜ip (23) Codes” hierarchy in the training set.

p∈{1,··· ,k} (3). Digits Data. We use images of digits 1, 2, 3 and
4 in the USPS 9 handwritten 16 × 16 digits image set,
Then at the t-th iteration of CCCP, we can compute the first which contain 1005, 731, 658 and 652 samples in each class,
order Taylor expansion of maxp∈{1,··· ,k} w˜ T x˜ip at w˜ (t) as with a total of 3046 samples. We also perform experiments
on the MNIST digits data sets, we give a more thorough
k comparison by considering all 45 pairs of digits 0 − 9.

max (w˜ (t))T x˜ip + βi(pt) (w˜ − w˜ (t))T x˜ip = w˜ T x˜ip(∗t) Table I
CLUSTERING ACCURACY(%) COMPARISONS FOR TWO-CLASS
p∈{1,··· ,k} p=1
PROBLEMS.
where T

p(∗t) = arg max w˜ (t) x˜ip (24)
p∈{1,2,··· ,k}

Correspondingly we will solve the following problem at the Data MMC GMC SVR CPMMC MRMMC
t-th iteration of CCCP
Ionosphere 78.75 76.50 77.70 72.36 79.03
min 1 ||w˜ ||2 + C1 ˜f T L˜ ˜f + C2ξ (25) Letter - - 92.80 94.47 95.56
2 nk Satellite - - 96.82 98.48 99.79
w˜ ,ξir≥0 Text-1 - - 96.82 95.00 98.34
Text-2 - - 93.99 96.28 98.21
s.t. ∀i = 1, . . . , n, r = 1, . . . , k, cir ∈ {0, 1} : Digits - - 98.18 99.38 100.00

1 cir w˜ T x˜ip(∗t) − w˜ T x˜ir ≥ 1 cri − ξ, MNIST - - 92.41 95.71 97.43
nk nk USPS - - - 94.12 95.32
i,r i,r
20Newsgroup - - - 70.63 73.45
nn RCVI - - - 61.97 65.02

∀p, q ∈ {1, · · · , k} : −l ≤ w˜ T x˜ip − w˜ T x˜iq ≤ l

i=1 i=1

Then similar to two-class MRMMC, we can apply the cutting B. Experimental Setups and Comparisons
plane method to solve problem (25), where at the s-th step,
we can compute the most violated constraint Cts by We have conducted comprehensive performance evalua-
tions by testing our method and comparing it with 5 other
1 cirts w˜ T x˜ip(∗t) − w˜ T x˜ir ≥ 1 cri ts −(ξ + ) (26) representative data clustering methods using the same data
nk nk corpora. The algorithms that we evaluated are listed below.
i,r i,r
(1). Maximum Margin Clustering (MMC) [16]. The
where 1, iofthex˜riwp(∗its)e−x˜ir < 1 implementation is the same as in [16]. For multi-class MMC,
0, we just follow the experimental settings in [17]. The width
cri s = (27) of the Gaussian kernel is also set by grid search from
{0.1σ0, 0.2σ0, · · · , σ0} with σ0 being the range of distance
IV. EXPERIMENTS between any two data points in the data set.

In this section we will present a set of experiments to (2). Generalized Maximum Margin Clustering
validate the effectiveness of our MRMMC method. (GMMC) [14]. The implementation is the same as in [14].

A. Datasets (3). Iterative Support Vector Regression (IterSVR)10
[20]. The initialization is based on k-means with randomly
We use three categories of data sets in our experiments, selected initial data centers, and the width of the Gaussian
which are selected to cover a wide range of properties. kernel is set in the same way as in MMC.
Specifically, those data sets include:
(5). Cutting Plane Maximum Margin Clustering
(1). UCI data. We perform experiments on four UCI data (CPMMC)11[21][22]. The implementation is the same as
sets: ionosphere, digits, letter and satellite7. For the digits in [21] and [22].
data, we follow the experimental setup of [20] and focus
on those pairs (3 vs 8, 1 vs 7, 2 vs 7, and 8 vs 9) that For our MRMMC algorithm, we set = 0.01, α = 0.01
are difficult to differentiate. For the letter and satellite data in our experiments, and w0 is randomly initialized. The
sets, we use their first two classes only [20]. class imbalance parameter l is set by grid search from
the grid [0, 20] with granularity 1. The parameter C1, C2
(2). Text Data. We perform experiments on four text data
sets: 20-newsgroup8, and RCVI [8]. For 20-newsgroup, 9Available from http://www.kernel-machines.org/data.html
we choose the topic rec which contains autos, motorcycles, 10The implementation code is downloaded from http://www.cse.ust.hk/
∼twinsen/itMMC code.zip.
6If there is multiple p∗ satisfying Eq.(23), we just randomly select one. 11The implementation code is downloaded from http://binzhao02.
7http://mlearn.ics.uci.edu/MLRepository.html googlepages.com/Code MMC v1.rar
8http://people.csail.mit.edu/jrennie/20Newsgroups/.

1032

103 103 REFERENCES

102 102 [1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regulariza-
CPU−time (sec.) tion: A geometric framework for learning from labeled and
CPU−time (sec.) 101 unlabeled examples. Journal of Machine Learning Research,
101 7:2399–2434, 2006.

101002 MNIST 101002 RCV1 [2] O. Chapelle, B. Scho¨lkopf, and A. Zien, editors. Semi-
USPS 20 Newsgroup Supervised Learning. MIT Press, Cambridge, MA, 2006.
O(n) O(n)
[3] O. Chapelle and A. Zien. Semi-supervised classification by
103 104 103 104 105 low density separation. In Proceedings of AISTATS, 2005.
number of samples number of samples
[4] P. M. Cheung and J. T. Kowk. A regularization framework for
(a) Digits Data Sets (b) Text Data Sets multiple-instance learning. In Proceedings of ICML, 2006.

Figure 1. Times vs. Data Set Size Plots. [5] B. Heisele, P. Ho, and T. Poggio. Face recognition with
support vector machines: Global versus component-based
are searched from the exponential grid 2[−8:1:6]. The graph approach. pages 688–694, 2001.
Laplacian is constructed in its standard way as in [15] using
the Gaussian similarities. Note that the GMMC and IterSVR [6] Y. Hu, J. Wang, N. Yu, and X.-S. Hua. Maximum margin
methods can only handle two-class problems. clustering with pairwise constraints. In Proceedings of ICDM,
2008.
C. Clustering Results
[7] J. E. Kelley. The cutting-plane method for solving convex
The same as in previous papers [14][16][21][22], we programs. Journal of the Society for Industrial Applied
also use clustering accuracy to evaluate the final clustering Mathematics, 8:703–712, 1960.
performance. The clustering results of all the algorithms
are summarized in table I, where the results of the k- [8] D. D. Lewis, Y. Yang, T. Rose, and F. Li. Rcv1: A new
means algorithm are averaged over 10 independent runs with benchmark collection for text categorization research. Journal
random initializations. The “-” in table I means that wither of Machine Learning Research, 5:361–397, 2004.
the algorithm is unable to handle multi-class cases (e.g.,
GMC and SVR for digits and text data sets, or the data set is [9] Y. Li and D. Huttenlocher. Learning for stereo vision using
too large for the algorithm to work out. From the table we the structured support vector machine. In Proceedings of
can clearly see that our algorithm works better than other CVPR, 2008.
methods.
[10] S. T. Roweis and L. K. Saul. Nonlinear dimensionality
Besides, we also test the speed of our algorithm with reduction by locally linear embedding. Science, 290:2323–
respect to the scale of the data sets12. Fig.1 shows the log-log 2326, 2000.
plots of speed vs. data set size on relatively large scale data
sets. From the figure we can see that the computational time [11] B. Scho¨lkopf, A. J. Smola, and K. R. Mu¨ller. Kernel principal
of our algorithm scales approximately linear with respect to component analysis. Advances in kernel methods: support
the data set size, which is similar to CPMMC but much vector learning, pages 327–352, 1999.
faster than traditional methods.
[12] A. J. Smola, S. Vishwanathan, and T. Hofmann. Kernel
V. CONCLUSIONS methods for missing variables. In Proceedins of AISTATS,
2005.
In this paper we propose a novel manifold regularized
maximum margin clustering (MRMMC) method, which (1) [13] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global
extends the classical MMC framework by incorporating the geometric framework for nonlinear dimensionality reduction.
manifold information; (2) employs an improved solution Science, 290:2319–2323, 2000.
method with better theoretical guarantee. The experimental
results show that our algorithm can get better performances [14] H. Valizadegan and R. Jin. Generalized maximum margin
without the loss in speed. clustering and unsupervised kernel learning. In Proceedings
of NIPS, pages 1417–1424, 2007.
ACKNOWLEDGEMENT
[15] F. Wang and C. Zhang. Label propagation through linear
The work is partially supported by NSF grants NSF grants neighborhoods. In Proceedings of ICML, 2006.
DMS-0844513 and CCF-0830659.
[16] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum
12All experiments are implemented on MATLAB under MS Windows margin clustering. In Proceedings of NIPS, 2004.
with CPU 2.2 GHz and 4G RAM
[17] L. Xu and D. Schuurmans. Unsupervised and semi-supervised
multi-class support vector machines. In National Conference
on Artificial Intelligence (AAAI), 2005.

[18] A. Yuille and A. Rangarajan. The concave-convex procedure.
Neural Computation, 15:915–936, 2003.

[19] H. Zhang, A. C. Berg, M. Maire, and J. Malik. Svm-
knn: Discriminative nearest neighbor classification for visual
category recognition. In Proceedings of CVPR, pages 2126–
2136, 2006.

[20] K. Zhang, I. W. Tsang, and J. T. Kowk. Maximum margin
clustering made practical. In Proceedings of ICML, pages
1119–1126, 2007.

[21] B. Zhao, F. Wang, and C. Zhang. Efficient maximum margin
clustering via cutting plane algorithm. In Proceedings of The
8th SDM, pages 751–762, 2008.

[22] B. Zhao, F. Wang, and C. Zhang. Efficient multiclass
maximum margin clustering. In Proceedings of ICML, pages
751–762, 2008.

[23] W. Zhu, S. Wang, R.-S. Lin, and S. Levinson. Tracking of
object with svm regression. In Proceedings of CVPR, pages
240–245, 2001.

1033


Click to View FlipBook Version