A hybrid approach for text categorization by using x2 statistic, principal component analysis and particle swarm optimization

Harun Uuz

doi:10.5897/SRE10.1214

Scientific Research and Essays

Abbreviation: Sci. Res. Essays
Language: English
ISSN: 1992-2248
DOI: 10.5897/SRE
Start Year: 2006
Published Articles: 2768

Full Length Research Paper

A hybrid approach for text categorization by using x2 statistic, principal component analysis and particle swarm optimization

Harun UÄŸuz

Department of Computer Engineering, Selçuk University, Konya, Turkey.
Email: [email protected]

Article Number - D6EC93B32462
Vol.8(37), pp. 1818-1828 , October 2013
https://doi.org/10.5897/SRE10.1214

Accepted: 03 October 2013
Published: 04 October 2013

Copyright © 2024 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0.

Abstract

Today, the number of text documents in digital form is progressively increasing and text categorization becomes the key technology of dealing with organizing text data. A major problem of text categorization is a huge-scale number of features. Most of those are useless, irrelevant or redundant for text categorization. Therefore, these features can decrease the classification performance. In order to eliminate this deficiency, feature selection is often used in text categorization for the purpose of reducing the dimensionality of the feature space and improving the performance of text categorization. In this study, in order to improve the performance of text categorization, a hybrid approach is suggested based on x² statistic, particle swarm optimization (PSO) and principal component analysis (PCA). In this context, initially, each term within the document is ranked depending on their importance for the classification using x² statistic method and, particle swarm optimization (PSO) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately on the terms of which importance are ranked in decreasing order and dimension reduction is carried out. In this way, during the text categorization, less importance terms are ignored, feature selection and feature extraction methods are applied on the highest importance terms, and cost of computational time and complexity to be occurred in the course of the application are reduced. To evaluate the effectiveness of purposed model, experiments were conducted using K-nearest neighbor (KNN) and C4.5 decision tree algorithm on Reuters-21578 and Classic3 datasets collection for text categorization. The experimental evaluation showed that the proposed model was effective for text categorization.

Key words: Text categorization, feature selection, particle swarm optimization, principal component analysis, x² statistic.

This article is published under the terms of the Creative Commons Attribution License 4.0

Back to Vol. 8 No. 37

Back to articles

Views: 0
Downloads: 0

Related Articles:
On Google
On Google Scholar

Articles on Google by: