Vol.13, No.8 ©2002 Journal of Software 软 件 学 报 1000-9825/2002/13(08)1382-13
Analyzing Popular Clustering Algorithms from Different
Viewpoints
á
�
QIAN Wei-ning, ZHOU Ao-ying
(Department of Computer Science, Fudan University, Shanghai 200433, China)
(Laboratory for Intelligent Information Processing, Fudan University, Shanghai 200433, China)
E-mail: {wnqian,ayzhou}@fudan.edu.cn
http://www.cs.fudan.edu.cn/ch/third_web/WebDB/WebDB_English.htm
Received September 3, 2001; accepted February 25, 2002
Abstract: Clustering is widely studied in data mining community. It is used to partition data set into clusters so
that intra-cluster data are similar and inter-cluster data are dissimilar. Different clustering methods use different
similarity definition and techniques. Several popular clustering algorithms are analyzed from three different
viewpoints: (1) clustering criteria, (2) cluster representation, and (3) algorithm framework. Furthermore, some new
built algorithms, which mix or generalize some other algorithms, are introduced. Since the analysis is from several
viewpoints, it can cover and distinguish most of the existing algorithms. It is the basis of the research of self-tuning
algorithm and clustering benchmark.
Key words: data mining; clustering; algorithm
Clustering is an important data-mining technique used to find data segmentation and pattern information.
Clustering technique is widely used in applications of financial data classification, spatial data processing, satellite
photo analysis, and medical figure auto-detection etc.. The problem of clustering is to partition the data set into
segments (called clusters) so that intra-cluster data are similar and inter-cluster data are dissimilar. It can be
formalized as follows:
Definition 1. Given a data set V{v1,v2, … ,vn}, in which vi’s (i=1,2, … ,n) are called data points. The process of
partitioning V into {C1,C2, … ,Ck}, CiÍV (i =1,2, … ,k), and ∪i=1k Ci = V, based on the similarity between data points
are called clustering, Ci’s (i =1,2,…,k) are called clusters.
The definition does not define the similarity between data points. In fact, different methods use different
criteria.
Clustering is also known as unsupervised learning process, since there is no priori knowledge about the data
set. Therefore, clustering analysis usually acts as the preprocessing of other KDD operations. The quality of the
clustering result is important for the whole KDD process. As other data mining operations, high performance and
scalability are other two requests beside the accuracy. Thus, a good clustering algorithm should satisfy the following
á Supported by the National Grand Fundamental Research 973 Program of China under Grant No.G1998030414 (国家重点基础研究
发展规划 973项目); the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.99038 (国
家教育部博士点基金)
QIAN Wei-ning was born in 1976. He is a Ph.D. candidate at the Department of Computer Science, Fudan University. His research
interests are clustering, data mining and Web data management. ZHOU Ao-ying was born in 1965. He is a professor and doctoral
supervisor at the Department of Computer Science, Fudan University. His current research interests include Web data management, data
mining, and object management over peer-to-peer networks.
钱卫宁 等:从多角度
现有聚类算法 1383
requests: Independent of in-advance knowledge; Only need easy-to-set parameters ; Accurate; Fast; Having good
scalability.
Much research work has been done on building clustering algorithms. Each uses novel techniques to improve
the ability of handling certain characteristic data sets. However, different algorithms use different criteria as
mentioned above. Since there is no benchmark for clustering methods, it is difficult to compare these algorithms by
using a common measurement. However, a detailed comparison is necessary. This is because that: (1) The
advantages and disadvantages should be analyzed, so that improvement can be developed on existing algorithms. (2)
The user should be able to choose right algorithm for a certain data set, so that the optimal result and performance
can be obtained. (3) The detailed comparison is the basis for building a clustering benchmark.
In this paper, we analyze several existing popular algorithms from some different aspects. It is different with
some other survey work[1~3] in that we compare these algorithms universally from different viewpoints, while others
try to generalize some methods to a certain framework, such as in Refs.[1,2], which can only cover limited
algorithms, or just introduce clustering algorithms one by one as tutorial[3], so that no comparison among algorithms
is analyzed. Since different algorithms use different criteria and techniques, those surveys can only cover some of
the algorithms. Furthermore, some algorithms cannot be distinguished since they use a same technique so that they
fall into the same category in a certain framework.
The rest of this paper is organized as follows: Section 1 to 3 analyze the clustering algorithms from three
different viewpoints, namely, clustering criteria, algorithm framework and cluster representation. Section 4
introduces some methods, which are mixture or generalization of other algorithms. Section 5 introduces research
focus on auto-detection of clusters. Finally, Section 6 is for conclusion remarks.
It should be note that from each viewpoint, although we try to classify as many algorithms as we can, someone
is still missing. And some algorithms may fall into the same category. However, while we observing these
algorithms from all these viewpoints, different algorithms can be distinguished. This is the motivation of our work.
1 Criteria
The basis of clustering analysis is the definition of similarity. Usually, the definition of similarity contains two
parts: (1) The similarity between data points; (2) The similarity between sets of data points. Not all clustering
methods need both of them. Some algorithms only use one.
The clustering criteria can be classified into three categories: distance-based, density-based, and linkage-based.
Distance-based and density-based clustering is usually applied to data in Euclidean space, while linkage-based
clustering can be applied to data in arbitrary metric space.
1.1 Distance-Based clustering
The basic idea of distance-based clustering is that a cluster is the data points close to each other. The distance
between two data points is easy to define in Euclidean space. The widely used distance definitions include
Euclidean distance, and Manhattan distance.
However, there are several choices for similarity definition between two sets of data points, as follows:
),(),(rep jiji reprepdistanceCCSimilarity = (1)
or
å
Îδ
=
jjii CvCv
ji
ji
ji vvdistancenn
CCSimilarity
,
avg ),(
1
),( (2)
or
1384 Journal of Software 软件学报 2002,13(8)
},|),(max{),(max jjiijiji CvCvvvdistanceCCSimilarity ÎÎ= (3)
or
},|),(min{),(min jjiijiji CvCvvvdistanceCCSimilarity ÎÎ= (4)
In (1), repi and rep j are representatives of Ci and C j, respectively. The representative of a data set is usually the
mean, such as in k-means [4]. Single representative methods usually employ Definition (1). It is obvious that the
complexity of (2), (3), and (4) are all O(|Ci|*|C j|), which are inefficient for large data sets. Although they are more
global definitions, they are usually not directly applied on similarity definition for sub-clusters or clusters. The only
exception is BIRCH[5], in which CF-vector and CF-tree are employed to accelerate the computation. Some trade-off
approaches are taken, as it will be discussed in Section 2.1, in which the detailed analysis of single representative
methods is also given.
The advantage of distance-based clustering is that distance is easy for computing and understanding. And
distance-based clustering algorithms usually need parameters of K, which is the number of final clusters user wants,
or the minimum distance to distinguish two clusters. However, the disadvantage of them is also distinct that they are
noise-sensitive. Although some techniques are
introduced in some of them, they result in other
serious problems. CURE[6] uses representative-
shrinking techniques to reduce the impact of noises.
However, it invites the problem that it fails to identify
the clusters in hollow shapes, as the result in our
experiment shown in Fig.1. This shortcoming
counteracts the advantage of multi-representatives
that the algorithm can identify arbitrary-shaped
clusters. BIRCH, which is the first clustering
algorithm considering noises, introduces a new parameter T, which is substantially a parameter related to density.
Furthermore, it is hard for user to understand this parameter unless the page storage ability of CF-tree is
known(Page_size/entry_size/T is an approximation of density in that page). In addition, it may cause loss of small
clusters and long-shaped clusters. Since lack of space, the detailed discussion is omitted here.
1.2 Density-Based clustering
Other than distance-based clustering methods, density-based clustering stands for that clusters are dense areas.
Therefore, the similarity definition of data points is based on whether they belong to connected dense regions. The
data points belonging to the connected dense region belong to the same cluster. Based on the different computation
of density, density-based clustering can be further classified into Nearest-Neighbor (called NN in the rest of this
paper) methods and cell-based methods. The difference between them is that the former define density based on
data set, and the latter define it based on data space. No matter which kind a density-based clustering algorithm
belongs to, it always needs a parameter of minimum-density threshold, which is the key to define dense region.
1.2.1 NN methods
NN methods only treat points, which have more than k neighbors in hyper-sphere whose radius is å, as data
points in clusters. Since the neighbors of each point should be counted, the index structures which support region
query, such as R*-tree, or X-tree, are always employed. Because of the curse of dimensionality[7], these methods
don’t have good scalability for dimensionality. Furthermore, NN methods will result in frequent I/O when the data
…
Fig. 1 Hollow-Shaped cluster identified by CURE
钱卫宁 等:从多角度分析现有聚类算法 1385
sets are very large. However, for most multi-dimensional data sets, these methods are efficient. In short, the
shortcoming of this kind of methods is the shortcoming of the index structures they based-on.
Traditional NN methods, such as DBSCAN and its descendants[8~10], need parameters of density threshold and
å. Recently, OPTICS[11], whose basic idea is the same as DBSCAN, focuses on automatically identification of
cluster structures. Since the novel techniques in OPTICS do not belong to the topic of this sub-section, we will
discuss them in Section 5.
1.2.2 Cell-Based methods
Cell-based methods count density information based on the units. STING [12], WaveCluster[13], DBCLASD[14],
CLIQUE[15], and OptiGrid [16] all fall into this category. Cell-based methods have the shortcoming that cells are only
pproximation of dense areas. Some methods introduce techniques to solve this problem, as will be introduced in
Section 2.3.
Density-based clustering methods all meet problem when data sets contain clusters or sub-clusters whose
granularity is smaller than the granularity of units for computing density. A well-known example is the
dumbbell-shaped clusters, as shown in our experimental result, Figure 2. However, for density-based clustering
methods, it is easy to remove noises, if the parameters are properly set. That is to say, it is robust to noises.
1.3 Linkage-Based clustering
Other than distance-based or density-based clustering, linkage-based clustering can be applied to arbitrary
metric spaces. Furthermore, since in high-dimensional space, the distance information and density information is
not sufficient for clustering, linkage-based clustering is often employed. Algorithms belonging to this kind include
ROCK[17], CHAMELEON[18], ARHP[19,20], STIRR[21], CACTUS[22], etc.
Linkage-based methods are based on graph or hyper-graph model. They usually map the data set into a
graph/hyper-graph, then cluster the data points based on the edge/hyper-edge information, so that the highly
connected data points are assigned to the same cluster. The difference between graph model and hyper-graph model
is that the former reflects the similarity of pair of nodes, while the latter usually reflects the co-occurrence
information. ROCK and CHAMELEON use graph model, while ARHP, PDDP, STIRR, and CACTUS use
hyper-graph model. Although the developers of CACTUS didn’t state that it is a hyper-graph-model-based
algorithm, it belongs to that kind.
The quality of linkage-based clustering result depends on the definition of link or hyper-edge. Since it is
impossible to handle a complete graph, the graph/hyper-graph model always eliminates the edges/hyper-edges
whose weight is low, so that the graph/hyper-graph is sparse. However, to gain the efficiency, it may reduce the
accuracy.
The algorithms fall in this category use different frameworks. ROCK and CHAMELEON are hierarchical
clustering methods, while ARHP is divisive method, and STIRR uses dynamical system model. Furthermore, since
the co-occurrence problem is similar to association rule mining problem, ARHP and CACTUS both borrow Apriori
Fig.2 Dumbbell-Shaped clusters identified by density-based algorithm (DBSCAN)
1386 Journal of Software 软件学报 2002,13(8)
algorithm[23] to find the clusters. Another algorithm employ Apriori-like algorithm is CLIQUE. However, the
monotonicity lemma is used to find high-dimensional clusters based on clusters find in subspaces. CLIQUE is not
linkage-based clustering methods, which is the difference between it with other algorithms discussed in this
subsection. The detailed discussion of algorithm framework will be given in Section 3. And since CHAMELEON
uses both link and distance information, it will be discussed standalone in Section 4.1.
2 Cluster Representation
The purpose of clustering is to identify the data clusters, which are the summary of the similar data. Each
algorithm should represent the clusters and sub-clusters in some forms. Although labeling each data point with a
cluster identity is a straightforward idea, most methods don’t employ this approach. This may be because that: (1)
The summary, which should be easily understandable, is more than (data-point, cluster-id) pairs; (2) It is time- and
space-expensive to label all the data points in the process of clustering; (3) Some methods employ accurate compact
cluster representatives, which make the time-consuming process of labeling unnecessary. We classify the cluster
representation techniques into four kinds, as discussed in the following:
2.1 Representative points
Most distance-based clustering methods use some points to represent clusters. These points are called
representative points. The representatives may be data points, or some other points that do not exist in database,
such as means of some sets of data points. The data representation techniques falling into this category can be
further classified into three classes:
2.1.1 Single representative
The simplest approach is to use one point as the representative of each cluster. Each data point is assigned to
the cluster whose representative is the closest one. The representative point may be the mean of the cluster, like
k-means[4] methods do, or the data point in the database, which is the closest point to the center, like k-medoids
methods do. Other algorithms fall into this kind include BIRCH[5], CLARA[24], and CLARANS [25]. The different
affect of k-means and k-medoids methods on clustering result is introduced in detail in Ref.[25]. Since it is not
related to the motivation of this paper, we don’t survey it here.
The shortcoming of single representative approach is obvious: (1) only sphere clusters can be identified; and
(2) large clusters with small cluster beside will be split, while some data points in the large cluster will be assigned
to the small cluster. These two conditions are shown in Fig.3 (The right part of this Figure is borrowed from Ref.[6],
Fig.1(b)). Therefore, this approach will fail when processing data sets with arbitrary shaped clusters or clusters with
great difference.
2.1.2 All data points
Using all the data points in a cluster to represent it is another straightforward approach. However, it is
time-expensive since: (1) the data sets are always large so that the label information cannot fit in memory, which
leads to frequent disk access, and (2) while computing information intra- and inter- clusters, it will access all data
points. Furthermore, the label information is hard to understand. Therefore, no popular algorithms take this
approach.
2.1.3 Multi-Representatives
Multi-representatives approach is introduced in CURE, which is the trade-off between single-point and
all-points methods. The first representative is the data point, which is the farthest to the mean of the cluster. And
next, the data point, whose distance to the nearest existing representative is the largest, is chosen each time, until the
number of representatives is large enough. In Ref.[6], the experiments show that for most data sets, 10
钱卫宁 等:从多角度分析现有聚类算法 1387
representatives will lead to satisfied result. In the long version of Ref.[26], the authors who developed CURE also
mentioned that for complex data sets, more representatives are needed.
However, before clustering, the complexity of the clusters is unknown. Furthermore, the relationship between
complexity of clusters and number of representatives is not clear. This forces the user to choose a large number of
representatives. Since the time complexity of CURE is O(n2log n), in which n is the number of data points in the
beginning, the existence of large number of representatives in the initial sub-clusters will affect the efficiency (there
exists sub-clusters because that a simple partitioning technique is used in CURE[6]. The time-complexity according
to number of representatives is O(c*log c), if the number of initial sub-clusters is a fixed number), as shown in our
experimental result, Fig.4. Furthermore, along with the technique they handling outliers (the shrinking of
representatives), it fails to identify clusters of hollow shape, as it has already been discussed in Section 1.1 and
shown in Fig.1. However, it outperforms single-point and all-points approaches when both effectiveness and
efficiency are considered.
2.2 Dense area
Some density-based clustering algorithms use dense area to denote clusters and sub-clusters. DBSCAN[8], its
descendants [9,10], and OPTICS[11] belong to this category. Dense area representation method is similar to
all-data-points methods except that only core points are used. Core points are those data points whose neighbors
within a certain region are more than the threshold. Therefore, only core points are used to expand a sub-cluster, and
it will stop when no further expansion can be applied on core points.
Dense area can figure arbitrary-shaped clusters besides the dumbbell-shaped clusters. However, the cost for
computing core points is exp