Scientific Research and Essays

  • Abbreviation: Sci. Res. Essays
  • Language: English
  • ISSN: 1992-2248
  • DOI: 10.5897/SRE
  • Start Year: 2006
  • Published Articles: 2768

Full Length Research Paper

An efficient hybrid distributed document clustering algorithm

J. E. Judith
  • J. E. Judith
  • Department of CSE, Noorul Islam Centre for Higher Education, Kumaracoil, India.
  • Google Scholar
J. Jayakumari
  • J. Jayakumari
  • Department of ECE, Noorul Islam Centre for Higher Education, Kumaracoil, India
  • Google Scholar


  •  Received: 03 September 2014
  •  Accepted: 23 December 2014
  •  Published: 15 January 2015

Abstract

Recent advances in information technology have led to an increase in volumes of data thereby exceeding beyond petabytes. Clustering distributed document sets from a central location is difficult due to the massive demand of computational resources. So there is a need for distributed document clustering algorithms to cluster documents using distributed resources. The greatest challenge in this area of distributed document clustering is the clustering quality and speedup associated with increase in document sets. The proposed clustering algorithm uses a hybrid algorithm which comprises of Particle Swarm Optimization (PSO), K-Means clustering and Latent Semantic Indexing (LSI) algorithm (PKMeansLSI), and uses MapReduce framework for distributed computation. The resultant of this is that it ultimately promotes clustering quality of the algorithm. The MapReduce framework and its corresponding implementation Hadoop is used as a distributed programming model which stresses on the improvement factor of the speedup of algorithm. The execution time is dramatically reduced as the dimensionality of documents is reduced. Experiment results show improved quality and effectiveness of the hybrid algorithm with varying increase in document size.

 

Key words: Distributed document clustering, Hadoop, K-Means, particle swarm optimization (PSO), latent semantic indexing (LSI), MapReduce.