Journal of
Engineering and Technology Research

  • Abbreviation: J. Eng. Technol. Res.
  • Language: English
  • ISSN: 2006-9790
  • DOI: 10.5897/JETR
  • Start Year: 2009
  • Published Articles: 181

Full Length Research Paper

An improved frequency based agglomerative clustering algorithm for detecting distinct clusters on two dimensional dataset

Madheswaran M.
  • Madheswaran M.
  • Department of Electronics and Communication Engineering (ECE), Mahendra Engineering College, Mallasamudram-637503, Tamilnadu, India.
  • Google Scholar
Sreedhar Kumar S.
  • Sreedhar Kumar S.
  • Department of Computer Science and Engineering (CSE), KS School of Engineering and Management, Bangalore-560062, India.
  • Google Scholar


  •  Received: 12 July 2017
  •  Accepted: 11 October 2017
  •  Published: 31 December 2017

 ABSTRACT

In this study, a frequency based Dynamic Automatic Agglomerative Clustering (DAAC) is developed and presented. The DAAC scheme aims to automatically identify the appropriate number of divergent clusters over the two dimensional dataset based on count of distinct representative objects with higher intra thickness and lesser intra separation. The Distinct Representative Object Count (DROC) is introduced to automatically trace the count of distinct representative objects based on frequency of object occurrences. It also identifies the distinct number of highly comparative clusters based on the count of distinct representative objects through sequence of merging process. Experimental result shows that the DAAC is suitable for instinctively identifying the K distinct clusters over the different two dimensional datasets with higher intra thickness and lesser intra separation than existing techniques.
 
Key words: Dynamic automatic agglomerative clustering, clusters, intra thickness, intra separation, distinct representative object count.


 INTRODUCTION

Agglomerative hierarchical clustering is an unsupervised clustering technique to cluster the dataset into a hierarchical tree structure form through a sequence of merging based on similarity metrics (Han and Kamber, 2006). In recent years, this clustering approach is applied to Machine Learning, Pattern Recognition, Data Mining, Text Mining, Spatial Data Base Application, Web Application, Dig Data, Image Analysis, Information Retrieval and Bioinformatics (Douglass et al., 1992; Martin et al., 2000; Cadez et al., 2001; Fogs et al., 2001). In general, the agglomerative hierarchical clustering scheme is classified into divisive and agglomerative categories (Pakhira, 2009; Jain, 2010; Jain et al., 1999; Frigui and Krishnapuram, 1997). The divisive method continuously divides the dataset into smaller clusters until each cluster consists of a single object.
 
The agglomerative technique starts with  clusters, each containing exactly one data object. Afterward, it follows a series of merging operations that ultimately forces all clusters into the same single cluster. The limitation in the existing agglomerative clustering techniques is the identification of the predetermined number of distinct clusters over the large dataset and the entire result quality is based on the number of clusters which is predetermined by user. In this paper, a Dynamic Automatic Agglomerative Clustering (DAAC) is proposed to automatically identify appropriate number of discrete clusters in the two dimensional dataset based on count of distinct representative objects in the dataset without user input.


Here, literatures related to the present clustering scheme are presented. Some of the popular traditional agglomerative clustering techniques UPGMA, WARDS, SLINK, CLINK and PNN were designed to identify the distinct number of clusters over the dataset based on similarity measures. A simple agglomerative hierarchical clustering scheme called Unweighted Pair Group Method with Arithmetic Mean (UPGMA) was reported by Murtagh (1984). This method constructs a rooted tree to reflect the structure present in a pair wise similarity matrix. At each step, the nearest two clusters are combined into a higher level cluster. The distance between any two clusters is taken to be the average of all distances between pairs of, that is, the mean distance between elements of each cluster.
 
Fionn and Legendre (2014), reported a general agglomerative clustering technique with minimum variance method. In this method, each step finds a pair of clusters that can lead to minimum increase in total within-cluster variance after merging. This increase is weighted square distance between cluster centers. Another technique namely Ward p was reported by De Amorim (2015), as an improved version of Ward’s method. This method uses subspace feature weighting to take into consideration the different degrees of relevance of each feature. Sibson (1973) reported a single linkage (SLINK) method for grouping clusters in bottom-up fashion, which at each step combines two clusters that enclose the closest pair of objects not yet belonging to the same cluster as each other.
 
Defays (1977) reported an agglomerative clustering technique complete linkage (CLINK) method. In this method, initially, each object is considered to be a cluster of its own and the clusters are serially combined into larger clusters until all objects end up within the same cluster. At each step, two clusters that are separated by the shortest distance are combined. Franti et al. (2000) reported a fast and memory efficient implementation of the exact Pair-wise Nearest Neighbor (PNN) technique. It is claimed that this technique could improve the results with reduced memory and computational complexity of exact PNN technique. The fast agglomerative clustering using k-nearest neighbor graph scheme was reported by Chih-Tang et al. (2010). This scheme is intended to reduce the number of distance calculation and time complexity for identifying the distinct number of clusters in the dataset.
 
Recently, some popular agglomerative clustering techniques called DKNNA, KnA, NNB, etc., identify the distinct number of clusters over the dataset and reduce the computational complexity. Lai and Tsung-Jen (2011) presented a hierarchical clustering technique called Dynamic K-Nearest Neighbor Algorithm (DKNNA). This scheme is used to identify the distinct number of clusters based on k-nearest neighbor graph to reduce the number of distance calculations and time complexity. The advantage of this approach is that it is faster and simultaneously produces better clustering result than Double Linked Algorithm (DLA) and Fast Pair-wise Nearest Neighbor (FPNN) techniques. Qi et al. (2015) reported an agglomerative hierarchical clustering to construct a cluster hierarchy based on a group of centroids. It followed a group of centroids instead of raw data points to build cluster hierarchies, where centroid was indicated as a group of adjacent points in the data space. The authors claimed that this approach reduced the computational cost without compromising clustering performance.
 
Another approach, Nearest Neighbor Boundary (NNB) to reduce the time and space complexity of standard agglomerative hierarchical clustering based on nearest neighbor search was designed by Wei et al. (2015). First, it divided the dataset into independent subsets and then groups the closest data points together among each of the individual subset based on nearest neighbor search. Afterward, it joins the closest subsets based on nearest data points in the boundary between the subsets. The authors declared that the merit of their method was that it consumed lower space and computational complexity for grouping the nearest data points. Lin and Chen (2005) reported a two phase clustering algorithm called Cohesion-based Self Merging (CSM). The first phase, it partitioned the input dataset into several small sub- clusters and in the second phase, it continuously merged the sub-clusters based on cohesion in a hierarchical way. This CSM approach is claimed to be robust and possesses excellent tolerance to outlier in various datasets. The detail of the DAAC algorithm is presented in the next section.


 PROPOSED APPROACH

Here, a detail of the DAAC approach is presented. It consists of two stages DROC and clustering. In the Distinct Representative Object Count (DROC) stage, the approach traces the count of distinct representative objects over the input dataset based on occurrence of each individual object in the dataset. In the clustering stage, it partitions the input dataset into maximum number of discrete clusters based on count of distinct representative objects. The stages are involved in the DAAC approach as shown in Figure 1.
 
 
 
 
 
 
 
 
 


 RESULTS AND DISCUSSION

The DAAC scheme experimented with more than 100 2D UCI datasets of different sizes is presented here. Among these 100 2D UCI datasets, a subset of nine sample benchmark datasets (http://www.archive.ics.uci.edu/ml/), viz. White-Wine, Image-Seg, Heart-Diseases, Red-Wine, WBDC, Wisconsin, Iris and Wine including its size and dimensional are presented in Table 1. The DROC method traces count of distinct representative objects of nine datasets with three different MO’s as 5, 10, and 15, respectively and the computed results are obtained in Table 2. For the MO value of 5, that the DROC is identified K distinct representative objects over the nine UCI datasets of 43, 13, 17, 19, 23, 12, 5, and 12, respectively and the results are presented in Table 2. Similarly, the DROC found count of K distinct objects of MO’s values 10 and 15 in same UCI datasets as 40, 6, 8, 17, 17, 9, 5, 7 and 36, 3, 7, 16, 10, 7, 2, 2.
 
 
Then the clustering process is followed and partitions the datasets into K discrete clusters based on sequence of merging process with distance metric. The DAAC clustering scheme has produced three different clustering results on nine UCI datasets based on count of representative objects of three MO’s {5, 10, 15} which are obtained in Table 2. The results of DAAC scheme with three MO’s are incorporated in Table 3. Next, the three different results of these nine sample UCI datasets are validated based on ECVM scheme. Initially, the intra closeness  and intra separation are computed among the each individual cluster in the results of UCI datasets in percentage as expressed in Equations 8 and 11. The estimated measures of these three clustering results of DAAC scheme with three MO’s are presented in Tables 4 to 6, respectively.
 
 
 
Thereafter, the overall intra closeness measure  in percentage is estimated over the three different results of nine UCI datasets as 99.0, 91.7, 93.9, 96.9, 92.5, 92.5, 99.92, 82.46; 98.79, 82.22, 84.28, 95.76, 90.33, 85.09, 99.72, 66.70 and 98.79, 77.06, 73.10, 93.19, 86.75, 80.72, 88.0, 53.67, respectively. The estimated results are incorporated in Tables 7 to 9, respectively. Similarly, the overall intra separation  in percentage is calculated on three different results of sample eight UCI datasets Image_Seg, Wine, Red_Wine, White_Wine, WBDC, Wisconsin, Heart-Diseases and Iris as 0.99, 8.20, 6.009, 3.02, 17.49, 7.34, 0.27, 17.53; 1.20, 17.7, 15.71, 4.2, 9.66, 14.90, 0.27, 33.29 and 1.20, 22.93, 26.89, 6.80, 14.24, 19.27, 12.0, 46.32, respectively. The estimated measures of these three clustering results of eight UCI datasets are presented in Tables 7, 8, and 9.
 
 
The experiments are conducted for nine UCI sample datasets with different MO’s values and the validation results are obtained with the proposed DAAC scheme as illustrated in Figure 2a, b and c respectively. It is clearly indicated in Tables 3, 4 and 5, respectively that the DAAC scheme automatically identified the K distinct highly relative clusters with higher ICS and lower ICD based on K distinct representative objects without user input. It is demonstrated in the experimental results, that the MO acts as a major key element in the proposed clustering scheme and directly affects the performance of the proposed scheme.
 
 
Comparison with existing schemes
 
 
Here, the result of the DAAC approach is compared to existing schemes DKNNA (Lai and Tsung-Jen, 2011) and KnA (Qi et al., 2015). For comparison purposes, these existing schemes are implemented and tested over the same seven UCI datasets. The existing schemes are tested over the seven UCI datasets and subsequently these results are incorporated in Table 10. Similarly, the performance measures intra cluster similarity and intra cluster dissimilarity are estimated over the results of existing schemes based on ECVM technique. The estimated results are incorporated in the Tables 11 and 12. The overall performance measures shown in Tables 10 to 12 reveal that the DAAC scheme has produced better results with higher intra cluster similarity, lower intra cluster dissimilarity and limited number of iterations compared to existing techniques DKNNA and KnA.
 
 
Based on the experimental results and performance measures, it is found that the existing techniques identified predetermined number of distinct clusters. The technique DKNNA is identified number of dissimilar clusters around the dataset, where  is the number of clusters which determined by user. Similarly, the KnA scheme follows two predetermined parameters and respectively, where  is the number of predetermined distinct partitions. The comparison results reveal that the DAAC scheme produced much better results with higher intra cluster similarity, lower intra cluster dissimilarity and maximum number of clusters identified, compared to existing cluster techniques.


 CONCLUSION

A simple two stage Dynamic Automatic Agglomerative Clustering scheme that could robotically produce clusters for two dimensional dataset is proposed in this paper. In the first stage, the DAAC scheme traces the count of distinct representative objects over the input dataset based on DROC method. In the second stage, a distance based clustering process instinctively partitions the input dataset into K discrete clusters based on count of distinct representative objects. The novelty of the DAAC is the automatic production of distinct number of dissimilar clusters, which is a contradiction to the existing schemes, where it is a user input. The DAAC can be better utilized as a pre-process to determine the maximum number of discrete clusters with higher intra similarity and be augmented compared to existing works with outstanding results.


 CONFLICT OF INTERESTS

The authors have not declared any conflict of interests.



 REFERENCES

Cadez I, Smyth P, Mannik H (2001). Probabilistic modeling of transactional data with applications to profiling, visualization and prediction, Proceedings of the Seventh ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. pp. 37-46.

 

Chih-Tang C, Lai JZC, Jeng MD (2010). Fast agglomerative clustering using information of k-nearest neighbors. Pattern Recogn. 43(12):3958-3968.
Crossref

 
 

De Amorim RC (2015). Feature Relevance in Ward's Hierarchical Clustering Using the Lp Norm. J. Classif. 32(1):46-62.
Crossref

 
 

Defays D (1977). An efficient algorithm for a complete link method. Comput. J. (British Computer Society) 20(4):364-366.
Crossref

 
 

Douglass RC, David RK, Jan OP, John WT (1992). Scatter / Gather: A Cluster-based approach to Browsing Large Document Collections, Proceedings of the 15th annual international ACM SIGIR Conference on Research and Development in Information Retrieval pp. 318-329.

 
 

Fionn M, Legendre P (2014). Wards hierarchical agglomerative clustering method: which algorithms implement wards criterion. J. Classif. 31(3):274-295.
Crossref

 
 

Fogs A, Warg W, Zaane O (2001). A non-parametric approach to web log analysis, First SAMICDM Workshop on Web Mining, Chicago pp. 41-50.

 
 

Franti P, Kaukoranta T, Sen DF, Chang KS (2000). Fast and memory efficient implementation of the exact PNN, IEEE Trans. Image Process. 9(5):773-777.
Crossref

 
 

Frigui H, Krishnapuram R (1997). Clustering by competitive agglomeration, Pattern Recogn. 30(7):1109-1119.
Crossref

 
 

Han J, Kamber M (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA.

 
 

Jain AK (2010). Data clustering: 50 Years beyond K-means. Pattern Recogn.Lett. 31(8):651-666.
Crossref

 
 

Jain AK, Murty MN, Flynn PJ (1999). Data Clustering: A Review, ACM Computer Surveys 31(3):264-323.
Crossref

 
 

Krishnamoorthy K, Sreedhar KS (2016). An Improved Agglomerative Clustering Algorithm for Outlier Detection. Appl. Math. Inform. Sci. 10(3):1125-1138.

 
 

Lai JZC, Tsung-Jen H (2011). An agglomerative clustering algorithm using a dynamic k-nearest-neighbor list. Inform. Sci. 181(9):1722-1734.
Crossref

 
 

Lin CR, Chen MS (2005). Combining partitional and hierarchical algorithms for robust and efficient data clustering with chesion self-merging, IEEE Trans. Knowl. Data Eng. 17(2):145-159.
Crossref

 
 

Pakhira KAM (2009). Modified k- means Algorithm to avoid empty clusters. Int. J. Recent Trends Eng. 1:1-8.

 
 

Martin E, Alexander F, Hans-Peter K, Jörg S (2000). Spatial Data Mining: Database primitives, Algorithms and Efficient DBMS Support. Data Min. Knowl. Discov. 4(2-3):193-216.

 
 

Murtagh F (1984). Complexities of Hierarchic Clustering Algorithms: the state of the art. Comput. Stat. Q. 1:101-113.

 
 

Qi Y, Xumin L, Xiangmin Z, Andy S (2015). Efficient agglomerative hierarchical clustering. Expert Syst. Appl. 42(5):2785-2797.
Crossref

 
 

Sibson R (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. Comput. J. (British Computer Society) 16:30-34.
Crossref

 
 

Wei Z, Gongxuan Z, Yongli W, Zhaomeng Z, Tao L (2015). NNB: An Efficient Nearest Neighbor Search Method for Hierarchical Clustering on Large Datasets. IEEE International Conference on Semantic Computing (ICSC). pp. 405-412.

 

 




          */?>