为了正常的体验网站,请在浏览器设置里面开启Javascript功能!
首页 > 从多角度分析现有聚类算法_英文_钱卫宁

从多角度分析现有聚类算法_英文_钱卫宁

2013-12-23 13页 pdf 1MB 15阅读

用户头像

is_874745

暂无简介

举报
从多角度分析现有聚类算法_英文_钱卫宁 Vol.13, No.8 ©2002 Journal of Software 软 件 学 报 1000-9825/2002/13(08)1382-13 Analyzing Popular Clustering Algorithms from Different Viewpoints á � QIAN Wei-ning, ZHOU Ao-ying (Department of Computer Science, Fudan University, Shanghai 200433, China) ...
从多角度分析现有聚类算法_英文_钱卫宁
Vol.13, No.8 ©2002 Journal of Software 软 件 学 报 1000-9825/2002/13(08)1382-13 Analyzing Popular Clustering Algorithms from Different Viewpoints á � QIAN Wei-ning, ZHOU Ao-ying (Department of Computer Science, Fudan University, Shanghai 200433, China) (Laboratory for Intelligent Information Processing, Fudan University, Shanghai 200433, China) E-mail: {wnqian,ayzhou}@fudan.edu.cn http://www.cs.fudan.edu.cn/ch/third_web/WebDB/WebDB_English.htm Received September 3, 2001; accepted February 25, 2002 Abstract: Clustering is widely studied in data mining community. It is used to partition data set into clusters so that intra-cluster data are similar and inter-cluster data are dissimilar. Different clustering methods use different similarity definition and techniques. Several popular clustering algorithms are analyzed from three different viewpoints: (1) clustering criteria, (2) cluster representation, and (3) algorithm framework. Furthermore, some new built algorithms, which mix or generalize some other algorithms, are introduced. Since the analysis is from several viewpoints, it can cover and distinguish most of the existing algorithms. It is the basis of the research of self-tuning algorithm and clustering benchmark. Key words: data mining; clustering; algorithm Clustering is an important data-mining technique used to find data segmentation and pattern information. Clustering technique is widely used in applications of financial data classification, spatial data processing, satellite photo analysis, and medical figure auto-detection etc.. The problem of clustering is to partition the data set into segments (called clusters) so that intra-cluster data are similar and inter-cluster data are dissimilar. It can be formalized as follows: Definition 1. Given a data set V{v1,v2, … ,vn}, in which vi’s (i=1,2, … ,n) are called data points. The process of partitioning V into {C1,C2, … ,Ck}, CiÍV (i =1,2, … ,k), and ∪i=1k Ci = V, based on the similarity between data points are called clustering, Ci’s (i =1,2,…,k) are called clusters. The definition does not define the similarity between data points. In fact, different methods use different criteria. Clustering is also known as unsupervised learning process, since there is no priori knowledge about the data set. Therefore, clustering analysis usually acts as the preprocessing of other KDD operations. The quality of the clustering result is important for the whole KDD process. As other data mining operations, high performance and scalability are other two requests beside the accuracy. Thus, a good clustering algorithm should satisfy the following á Supported by the National Grand Fundamental Research 973 Program of China under Grant No.G1998030414 (国家重点基础研究 发展规划 973项目); the National Research Foundation for the Doctoral Program of Higher Education of China under Grant No.99038 (国 家教育部博士点基金) QIAN Wei-ning was born in 1976. He is a Ph.D. candidate at the Department of Computer Science, Fudan University. His research interests are clustering, data mining and Web data management. ZHOU Ao-ying was born in 1965. He is a professor and doctoral supervisor at the Department of Computer Science, Fudan University. His current research interests include Web data management, data mining, and object management over peer-to-peer networks. 钱卫宁 等:从多角度现有聚类算法 1383 requests: Independent of in-advance knowledge; Only need easy-to-set parameters ; Accurate; Fast; Having good scalability. Much research work has been done on building clustering algorithms. Each uses novel techniques to improve the ability of handling certain characteristic data sets. However, different algorithms use different criteria as mentioned above. Since there is no benchmark for clustering methods, it is difficult to compare these algorithms by using a common measurement. However, a detailed comparison is necessary. This is because that: (1) The advantages and disadvantages should be analyzed, so that improvement can be developed on existing algorithms. (2) The user should be able to choose right algorithm for a certain data set, so that the optimal result and performance can be obtained. (3) The detailed comparison is the basis for building a clustering benchmark. In this paper, we analyze several existing popular algorithms from some different aspects. It is different with some other survey work[1~3] in that we compare these algorithms universally from different viewpoints, while others try to generalize some methods to a certain framework, such as in Refs.[1,2], which can only cover limited algorithms, or just introduce clustering algorithms one by one as tutorial[3], so that no comparison among algorithms is analyzed. Since different algorithms use different criteria and techniques, those surveys can only cover some of the algorithms. Furthermore, some algorithms cannot be distinguished since they use a same technique so that they fall into the same category in a certain framework. The rest of this paper is organized as follows: Section 1 to 3 analyze the clustering algorithms from three different viewpoints, namely, clustering criteria, algorithm framework and cluster representation. Section 4 introduces some methods, which are mixture or generalization of other algorithms. Section 5 introduces research focus on auto-detection of clusters. Finally, Section 6 is for conclusion remarks. It should be note that from each viewpoint, although we try to classify as many algorithms as we can, someone is still missing. And some algorithms may fall into the same category. However, while we observing these algorithms from all these viewpoints, different algorithms can be distinguished. This is the motivation of our work. 1 Criteria The basis of clustering analysis is the definition of similarity. Usually, the definition of similarity contains two parts: (1) The similarity between data points; (2) The similarity between sets of data points. Not all clustering methods need both of them. Some algorithms only use one. The clustering criteria can be classified into three categories: distance-based, density-based, and linkage-based. Distance-based and density-based clustering is usually applied to data in Euclidean space, while linkage-based clustering can be applied to data in arbitrary metric space. 1.1 Distance-Based clustering The basic idea of distance-based clustering is that a cluster is the data points close to each other. The distance between two data points is easy to define in Euclidean space. The widely used distance definitions include Euclidean distance, and Manhattan distance. However, there are several choices for similarity definition between two sets of data points, as follows: ),(),(rep jiji reprepdistanceCCSimilarity = (1) or å Îδ = jjii CvCv ji ji ji vvdistancenn CCSimilarity , avg ),( 1 ),( (2) or 1384 Journal of Software 软件学报 2002,13(8) },|),(max{),(max jjiijiji CvCvvvdistanceCCSimilarity ÎÎ= (3) or },|),(min{),(min jjiijiji CvCvvvdistanceCCSimilarity ÎÎ= (4) In (1), repi and rep j are representatives of Ci and C j, respectively. The representative of a data set is usually the mean, such as in k-means [4]. Single representative methods usually employ Definition (1). It is obvious that the complexity of (2), (3), and (4) are all O(|Ci|*|C j|), which are inefficient for large data sets. Although they are more global definitions, they are usually not directly applied on similarity definition for sub-clusters or clusters. The only exception is BIRCH[5], in which CF-vector and CF-tree are employed to accelerate the computation. Some trade-off approaches are taken, as it will be discussed in Section 2.1, in which the detailed analysis of single representative methods is also given. The advantage of distance-based clustering is that distance is easy for computing and understanding. And distance-based clustering algorithms usually need parameters of K, which is the number of final clusters user wants, or the minimum distance to distinguish two clusters. However, the disadvantage of them is also distinct that they are noise-sensitive. Although some techniques are introduced in some of them, they result in other serious problems. CURE[6] uses representative- shrinking techniques to reduce the impact of noises. However, it invites the problem that it fails to identify the clusters in hollow shapes, as the result in our experiment shown in Fig.1. This shortcoming counteracts the advantage of multi-representatives that the algorithm can identify arbitrary-shaped clusters. BIRCH, which is the first clustering algorithm considering noises, introduces a new parameter T, which is substantially a parameter related to density. Furthermore, it is hard for user to understand this parameter unless the page storage ability of CF-tree is known(Page_size/entry_size/T is an approximation of density in that page). In addition, it may cause loss of small clusters and long-shaped clusters. Since lack of space, the detailed discussion is omitted here. 1.2 Density-Based clustering Other than distance-based clustering methods, density-based clustering stands for that clusters are dense areas. Therefore, the similarity definition of data points is based on whether they belong to connected dense regions. The data points belonging to the connected dense region belong to the same cluster. Based on the different computation of density, density-based clustering can be further classified into Nearest-Neighbor (called NN in the rest of this paper) methods and cell-based methods. The difference between them is that the former define density based on data set, and the latter define it based on data space. No matter which kind a density-based clustering algorithm belongs to, it always needs a parameter of minimum-density threshold, which is the key to define dense region. 1.2.1 NN methods NN methods only treat points, which have more than k neighbors in hyper-sphere whose radius is å, as data points in clusters. Since the neighbors of each point should be counted, the index structures which support region query, such as R*-tree, or X-tree, are always employed. Because of the curse of dimensionality[7], these methods don’t have good scalability for dimensionality. Furthermore, NN methods will result in frequent I/O when the data … Fig. 1 Hollow-Shaped cluster identified by CURE 钱卫宁 等:从多角度分析现有聚类算法 1385 sets are very large. However, for most multi-dimensional data sets, these methods are efficient. In short, the shortcoming of this kind of methods is the shortcoming of the index structures they based-on. Traditional NN methods, such as DBSCAN and its descendants[8~10], need parameters of density threshold and å. Recently, OPTICS[11], whose basic idea is the same as DBSCAN, focuses on automatically identification of cluster structures. Since the novel techniques in OPTICS do not belong to the topic of this sub-section, we will discuss them in Section 5. 1.2.2 Cell-Based methods Cell-based methods count density information based on the units. STING [12], WaveCluster[13], DBCLASD[14], CLIQUE[15], and OptiGrid [16] all fall into this category. Cell-based methods have the shortcoming that cells are only pproximation of dense areas. Some methods introduce techniques to solve this problem, as will be introduced in Section 2.3. Density-based clustering methods all meet problem when data sets contain clusters or sub-clusters whose granularity is smaller than the granularity of units for computing density. A well-known example is the dumbbell-shaped clusters, as shown in our experimental result, Figure 2. However, for density-based clustering methods, it is easy to remove noises, if the parameters are properly set. That is to say, it is robust to noises. 1.3 Linkage-Based clustering Other than distance-based or density-based clustering, linkage-based clustering can be applied to arbitrary metric spaces. Furthermore, since in high-dimensional space, the distance information and density information is not sufficient for clustering, linkage-based clustering is often employed. Algorithms belonging to this kind include ROCK[17], CHAMELEON[18], ARHP[19,20], STIRR[21], CACTUS[22], etc. Linkage-based methods are based on graph or hyper-graph model. They usually map the data set into a graph/hyper-graph, then cluster the data points based on the edge/hyper-edge information, so that the highly connected data points are assigned to the same cluster. The difference between graph model and hyper-graph model is that the former reflects the similarity of pair of nodes, while the latter usually reflects the co-occurrence information. ROCK and CHAMELEON use graph model, while ARHP, PDDP, STIRR, and CACTUS use hyper-graph model. Although the developers of CACTUS didn’t state that it is a hyper-graph-model-based algorithm, it belongs to that kind. The quality of linkage-based clustering result depends on the definition of link or hyper-edge. Since it is impossible to handle a complete graph, the graph/hyper-graph model always eliminates the edges/hyper-edges whose weight is low, so that the graph/hyper-graph is sparse. However, to gain the efficiency, it may reduce the accuracy. The algorithms fall in this category use different frameworks. ROCK and CHAMELEON are hierarchical clustering methods, while ARHP is divisive method, and STIRR uses dynamical system model. Furthermore, since the co-occurrence problem is similar to association rule mining problem, ARHP and CACTUS both borrow Apriori Fig.2 Dumbbell-Shaped clusters identified by density-based algorithm (DBSCAN) 1386 Journal of Software 软件学报 2002,13(8) algorithm[23] to find the clusters. Another algorithm employ Apriori-like algorithm is CLIQUE. However, the monotonicity lemma is used to find high-dimensional clusters based on clusters find in subspaces. CLIQUE is not linkage-based clustering methods, which is the difference between it with other algorithms discussed in this subsection. The detailed discussion of algorithm framework will be given in Section 3. And since CHAMELEON uses both link and distance information, it will be discussed standalone in Section 4.1. 2 Cluster Representation The purpose of clustering is to identify the data clusters, which are the summary of the similar data. Each algorithm should represent the clusters and sub-clusters in some forms. Although labeling each data point with a cluster identity is a straightforward idea, most methods don’t employ this approach. This may be because that: (1) The summary, which should be easily understandable, is more than (data-point, cluster-id) pairs; (2) It is time- and space-expensive to label all the data points in the process of clustering; (3) Some methods employ accurate compact cluster representatives, which make the time-consuming process of labeling unnecessary. We classify the cluster representation techniques into four kinds, as discussed in the following: 2.1 Representative points Most distance-based clustering methods use some points to represent clusters. These points are called representative points. The representatives may be data points, or some other points that do not exist in database, such as means of some sets of data points. The data representation techniques falling into this category can be further classified into three classes: 2.1.1 Single representative The simplest approach is to use one point as the representative of each cluster. Each data point is assigned to the cluster whose representative is the closest one. The representative point may be the mean of the cluster, like k-means[4] methods do, or the data point in the database, which is the closest point to the center, like k-medoids methods do. Other algorithms fall into this kind include BIRCH[5], CLARA[24], and CLARANS [25]. The different affect of k-means and k-medoids methods on clustering result is introduced in detail in Ref.[25]. Since it is not related to the motivation of this paper, we don’t survey it here. The shortcoming of single representative approach is obvious: (1) only sphere clusters can be identified; and (2) large clusters with small cluster beside will be split, while some data points in the large cluster will be assigned to the small cluster. These two conditions are shown in Fig.3 (The right part of this Figure is borrowed from Ref.[6], Fig.1(b)). Therefore, this approach will fail when processing data sets with arbitrary shaped clusters or clusters with great difference. 2.1.2 All data points Using all the data points in a cluster to represent it is another straightforward approach. However, it is time-expensive since: (1) the data sets are always large so that the label information cannot fit in memory, which leads to frequent disk access, and (2) while computing information intra- and inter- clusters, it will access all data points. Furthermore, the label information is hard to understand. Therefore, no popular algorithms take this approach. 2.1.3 Multi-Representatives Multi-representatives approach is introduced in CURE, which is the trade-off between single-point and all-points methods. The first representative is the data point, which is the farthest to the mean of the cluster. And next, the data point, whose distance to the nearest existing representative is the largest, is chosen each time, until the number of representatives is large enough. In Ref.[6], the experiments show that for most data sets, 10 钱卫宁 等:从多角度分析现有聚类算法 1387 representatives will lead to satisfied result. In the long version of Ref.[26], the authors who developed CURE also mentioned that for complex data sets, more representatives are needed. However, before clustering, the complexity of the clusters is unknown. Furthermore, the relationship between complexity of clusters and number of representatives is not clear. This forces the user to choose a large number of representatives. Since the time complexity of CURE is O(n2log n), in which n is the number of data points in the beginning, the existence of large number of representatives in the initial sub-clusters will affect the efficiency (there exists sub-clusters because that a simple partitioning technique is used in CURE[6]. The time-complexity according to number of representatives is O(c*log c), if the number of initial sub-clusters is a fixed number), as shown in our experimental result, Fig.4. Furthermore, along with the technique they handling outliers (the shrinking of representatives), it fails to identify clusters of hollow shape, as it has already been discussed in Section 1.1 and shown in Fig.1. However, it outperforms single-point and all-points approaches when both effectiveness and efficiency are considered. 2.2 Dense area Some density-based clustering algorithms use dense area to denote clusters and sub-clusters. DBSCAN[8], its descendants [9,10], and OPTICS[11] belong to this category. Dense area representation method is similar to all-data-points methods except that only core points are used. Core points are those data points whose neighbors within a certain region are more than the threshold. Therefore, only core points are used to expand a sub-cluster, and it will stop when no further expansion can be applied on core points. Dense area can figure arbitrary-shaped clusters besides the dumbbell-shaped clusters. However, the cost for computing core points is exp
/
本文档为【从多角度分析现有聚类算法_英文_钱卫宁】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索