An improved frequency based agglomerative clustering algorithm for detecting distinct clusters on two dimensional dataset

In this study, a frequency based Dynamic Automatic Agglomerative Clustering (DAAC) is developed and presented. The DAAC scheme aims to automatically identify the appropriate number of divergent clusters over the two dimensional dataset based on count of distinct representative objects with higher intra thickness and lesser intra separation. The Distinct Representative Object Count (DROC) is introduced to automatically trace the count of distinct representative objects based on frequency of object occurrences. It also identifies the distinct number of highly comparative clusters based on the count of distinct representative objects through sequence of merging process. Experimental result shows that the DAAC is suitable for instinctively identifying the K distinct clusters over the different two dimensional datasets with higher intra thickness and lesser intra separation than existing techniques.


INTRODUCTION
Agglomerative hierarchical clustering is an unsupervised clustering technique to cluster the dataset into a hierarchical tree structure form through a sequence of merging based on similarity metrics (Han and Kamber, 2006).In recent years, this clustering approach is applied to Machine Learning, Pattern Recognition, Data The agglomerative technique starts with n clusters, each containing exactly one data object.Afterward, it follows a series of merging operations that ultimately forces all clusters into the same single cluster.
The limitation in the existing agglomerative clustering techniques is the identification of the predetermined number of distinct clusters over the large dataset and the entire result quality is based on the number of clusters which is predetermined by user.In this paper, a Dynamic Automatic Agglomerative Clustering (DAAC) is proposed to automatically identify appropriate number of discrete clusters in the two dimensional dataset based on count of distinct representative objects in the dataset without user input.

RELATED WORK
Here, literatures related to the present clustering scheme are presented.Some of the popular traditional agglomerative clustering techniques UPGMA, WARDS, SLINK, CLINK and PNN were designed to identify the distinct number of clusters over the dataset based on similarity measures.A simple agglomerative hierarchical clustering scheme called Unweighted Pair Group Method with Arithmetic Mean (UPGMA) was reported by Murtagh (1984).This method constructs a rooted tree to reflect the structure present in a pair wise similarity matrix.At each step, the nearest two clusters are combined into a higher level cluster.The distance between any two clusters is taken to be the average of all distances between pairs of, that is, the mean distance between elements of each cluster.Fionn and Legendre (2014), reported a general agglomerative clustering technique with minimum variance method.In this method, each step finds a pair of clusters that can lead to minimum increase in total withincluster variance after merging.This increase is weighted square distance between cluster centers.Another technique namely Ward p was reported by De Amorim (2015), as an improved version of Ward's method.This method uses subspace feature weighting to take into consideration the different degrees of relevance of each feature.Sibson (1973) reported a single linkage (SLINK) method for grouping clusters in bottom-up fashion, which at each step combines two clusters that enclose the closest pair of objects not yet belonging to the same cluster as each other.Defays (1977) reported an agglomerative clustering technique complete linkage (CLINK) method.In this method, initially, each object is considered to be a cluster of its own and the clusters are serially combined into larger clusters until all objects end up within the same cluster.At each step, two clusters that are separated by the shortest distance are combined.Franti et al. (2000) reported a fast and memory efficient implementation of the exact Pair-wise Nearest Neighbor (PNN) technique.It is claimed that this technique could improve the results with reduced memory and computational complexity of exact PNN technique.The fast agglomerative clustering using k-nearest neighbor graph scheme was reported by Chih-Tang et al. (2010).This scheme is intended to reduce the number of distance calculation and time complexity for identifying the distinct number of clusters in the dataset.
Recently, some popular agglomerative clustering techniques called DKNNA, KnA, NNB, etc., identify the distinct number of clusters over the dataset and reduce the computational complexity.Lai and Tsung-Jen (2011) presented a hierarchical clustering technique called Dynamic K-Nearest Neighbor Algorithm (DKNNA).This scheme is used to identify the distinct number of clusters based on k-nearest neighbor graph to reduce the number of distance calculations and time complexity.The advantage of this approach is that it is faster and simultaneously produces better clustering result than Double Linked Algorithm (DLA) and Fast Pair-wise Nearest Neighbor (FPNN) techniques.Qi et al. (2015) reported an agglomerative hierarchical clustering to construct a cluster hierarchy based on a group of centroids.It followed a group of centroids instead of raw data points to build cluster hierarchies, where centroid was indicated as a group of adjacent points in the data space.The authors claimed that this approach reduced the computational cost without compromising clustering performance.
Another approach, Nearest Neighbor Boundary (NNB) to reduce the time and space complexity of standard agglomerative hierarchical clustering based on nearest neighbor search was designed by Wei et al. (2015).First, it divided the dataset into independent subsets and then groups the closest data points together among each of the individual subset based on nearest neighbor search.Afterward, it joins the closest subsets based on nearest data points in the boundary between the subsets.The authors declared that the merit of their method was that it consumed lower space and computational complexity for grouping the nearest data points.Lin and Chen (2005) reported a two phase clustering algorithm called Cohesion-based Self Merging (CSM).The first phase, it partitioned the input dataset into several small subclusters and in the second phase, it continuously merged the sub-clusters based on cohesion in a hierarchical way.This CSM approach is claimed to be robust and possesses excellent tolerance to outlier in various datasets.The detail of the DAAC algorithm is presented in the next section.

PROPOSED APPROACH
Here, a detail of the DAAC approach is presented.It consists of two stages DROC and clustering.In the Distinct Representative Object Count (DROC) stage, the approach traces the count of distinct representative objects over the input dataset based on occurrence of each individual object in the dataset.In the clustering stage, it partitions the input dataset into maximum number of discrete clusters based on count of distinct representative objects.The stages are involved in the DAAC approach as shown in Figure 1.

DROC Stage
This stage aims to trace the count of distinct representative objects over the two dimensional dataset.It consists of three steps.In the first step, it represents each of the object in the dataset x X  based on a statistical mean operation and is defined in Equation 1 as: where if x represents the th f feature in th i object that belongs to the input dataset X .In the second step, the proposed DROC scheme measures the count of each object occurrence where i x and j x represent th i and th j object that belong to the input dataset X , n denotes the size of X and T is the external parameter (threshold) which predetermined by user which used to limit the dissimilarity difference between th i and th j objects.If the difference of th i and th j objects is lesser than T , it means the th j object is similar to th i object that belongs to the dataset X .
The predetermined value of T could contrast based on dataset nature.Final step, it estimates the count of distinct representative objects over the dataset X based on maximum occurrence of objects in dataset X and is defined in Equation 3 as: where n denotes the number of clusters in the input cluster set    7and is defined as: Next, updates the merged cluster using Equation 1(2) Measure the count of occurrence of each individual object as described in Equation 2(3) Identify representative objects in X based on count of object , where n represents the size of the dataset X and K is the count of distinct number of representative objects in dataset X .The second stage in the proposed approach, consumes time in dataset X , where C denotes the resulting cluster.
Overall, the proposed approach requires time  

Cluster validation
Cluster validation presents the result of Dynamic Automatic Agglomerative Clustering scheme validated based on Effective Cluster Validation Method (ECVM) scheme reported by Krishnamoorthy and Sreedhar (2016).The ECVM scheme is slightly modified and to estimate the intra tightness and intra separation among the objects with D features within each individual cluster in cluster set of DAAC scheme.It contains two measures: Intra Cluster Similarity and Intra Cluster Dissimilarity that are described subsequently.

Intra cluster similarity measure
This measure computes the intra similarity among the each individual cluster in the cluster set l c C  of DAAC approach for K l ,.., 1 , 0  .This method is expressed in the equation. where C and is defined in Equation 9:  12:

RESULTS AND DISCUSSION
The DAAC scheme experimented with more than 100 2D UCI datasets of different sizes is presented here.Among these 100 2D UCI datasets, a subset of nine sample benchmark datasets (http://www.archive.ics.uci.edu/ml/),viz.White-Wine, Image-Seg, Heart-Diseases, Red-Wine, WBDC, Wisconsin, Iris and Wine including its size and dimensional are presented in Table 1.
The DROC method traces count of distinct representative objects of nine datasets with three different MO's as 5, 10, and 15, respectively and the computed results are obtained in Table 2.For the MO value of 5, that the DROC is identified K distinct representative objects over the nine UCI datasets of 43, 13, 17, 19, 23, 12, 5, and 12, respectively and the results are presented in Table 2. Similarly, the DROC found count of K distinct objects of MO's values 10 and 15 in same UCI datasets as 40, 6, 8, 17, 17, 9, 5, 7 and 36, 3, 7, 16, 10, 7, 2, 2. Then the clustering process is followed and partitions the datasets into K discrete clusters based on sequence of merging process with distance metric.The DAAC clustering scheme has produced three different clustering results on nine UCI datasets based on count of representative objects of three MO's {5, 10, 15} which are obtained in Table 2.The results of DAAC scheme with three MO's are incorporated in Table 3.

Comparison with existing schemes
Here, the result of the DAAC approach is compared to existing schemes DKNNA (Lai and Tsung-Jen, 2011) and KnA (Qi et al., 2015).For comparison purposes, these existing schemes are implemented and tested over the same seven UCI datasets.The existing schemes are tested over the seven UCI datasets and subsequently these results are incorporated in Table 10.Similarly, the performance measures intra cluster similarity and intra cluster dissimilarity are estimated over the results of existing schemes based on ECVM technique.The estimated results are incorporated in the Tables 11 and 12.The overall performance measures shown in Tables 10 to 12 reveal that the DAAC scheme has produced better results with higher intra cluster similarity, lower intra cluster dissimilarity and limited number of iterations compared to existing techniques DKNNA and KnA.Based on the experimental results and performance measures, it is found that the existing techniques identified predetermined number of distinct clusters.The technique DKNNA is identified M number of dissimilar clusters around the dataset, where M is the number of clusters which determined by user.Similarly, the KnA scheme follows two predetermined parameters k and M respectively, where k is the number of predetermined distinct partitions.The comparison results reveal that the DAAC scheme

Conclusion
A simple two stage Dynamic Automatic Agglomerative Clustering scheme that could robotically produce clusters for two dimensional dataset is proposed in this paper.In the first stage, the DAAC scheme traces the count of distinct representative objects over the input dataset based on DROC method.In the second stage, a distance based clustering process instinctively partitions the input dataset into K discrete clusters based on count of distinct representative objects.The novelty of the DAAC is the automatic production of distinct number of dissimilar clusters, which is a contradiction to the existing schemes, where it is a user input.The DAAC can be better utilized as a pre-process to determine the maximum number of discrete clusters with higher intra similarity and be

Figure 1 .
Figure 1.Functional diagram of proposed approach.
count of occurrence of th i object in X and MO represents the maximum occurrence threshold that limits the count of K distinct representative objects with maximum occurrence over the X .For instance, if the MO is too small, a large numbers of clusters are generated as the final result.On the other hand, if the MO is too large, only lesser numbers of clusters are generated.Clustering stageIn the clustering stage, it first, computes the upper triangular distance matrix ij Ud for input cluster set metric as calculated by cost  and then compares the number of clusters not exceeding the count of representative objects as described earlier.If the number of clusters i does not exceed the K into a single cluster ij x which subsequently computes the centroid over the new cluster i x using Equation

.
status of the th i cluster and subsequently it modifies the size of merged cluster number of related objects in th i and th j clusters, respectively.After, it deletes the th j cluster in the input cluster set X including its status j C and size j N respectively and reduces the input cluster set size by one.This process is repeated until the size of the cluster set is equal to K and afterward the results with K district clusters are This stage involved in the proposed DAAC technique is presented as an algorithm hereunder.
dataset X size by one.(14) Repeat steps 6 to 13 until the size of the cluster set n is equal to K (15) Obtain the final clustering result in CEndComplexity analysisComplexity analysis discusses in detail the computational complexity of the proposed approach.The first stage in the proposed approach requires time   It intentions to calculate the intra separation among the each individual cluster in the cluster set of DAAC approach.This measure is defined in Equation11.whereN represents the number of clusters in the cluster set C for K l ,.., 1  , l Is denotes s the intra separation measure of th l individual cluster in C and is defined in Equation are computed among the each individual cluster in the results of UCI datasets in percentage as expressed in Equations 8 and 11.The estimated measures of these three clustering results of DAAC scheme with three MO's calculated on three different results of sample eight UCI datasets Image_Seg, Wine, Red_Wine, White_Wine, WBDC, Wisconsin, Heart-Diseases and Iris as 0

Figure 2 .
Figure 2. Comparison results of DAAC scheme with different MO's tested on UCI datasets.(a) Comparison of resulting clusters of DAAC with various MO.(b) Comparison of intra similarity measure on results of DAAC with various MO's.(c) Comparison of intra dissimilarity measure on results of DAAC with various MO's.

Table 1 .
Description of sample UCI datasets.

Table 2 .
DROC scheme tested on UCI dataset with different MO's.

Table 3 .
DAAC scheme tested on UCI dataset with different MO's.

Table 4 .
Result of intra cluster validation obtained with ECVM scheme on results of DAAC scheme with (MO=5).

Table 5 .
Intra Cluster Validation obtained with ECVM scheme on result of DAAC scheme with (MO=10).

dataset Intra cluster validation of DAAC with (MO=10) (%) Intra tightness measure
l Is

Table 6 .
Result of intra cluster validation obtained with ECVM scheme on results of DAAC Scheme with (M0=15).

Table 7 .
Performance measures of the result of DAAC scheme when (MO=5).

Table 8 .
Performance measures of the result of DAAC scheme when (MO=10).

Table 9 .
Performance measures of the result of DAAC scheme when (MO=15).