Today, the number of text documents in digital form is progressively increasing and text categorization becomes the key technology of dealing with organizing text data. A major problem of text categorization is a huge-scale number of features. Most of those are useless, irrelevant or redundant for text categorization. Therefore, these features can decrease the classification performance. In order to eliminate this deficiency, feature selection is often used in text categorization for the purpose of reducing the dimensionality of the feature space and improving the performance of text categorization. In this study, in order to improve the performance of text categorization, a hybrid approach is suggested based on x2 statistic, particle swarm optimization (PSO) and principal component analysis (PCA). In this context, initially, each term within the document is ranked depending on their importance for the classification using x2 statistic method and, particle swarm optimization (PSO) and principal component analysis (PCA) feature selection and feature extraction methods are applied separately on the terms of which importance are ranked in decreasing order and dimension reduction is carried out. In this way, during the text categorization, less importance terms are ignored, feature selection and feature extraction methods are applied on the highest importance terms, and cost of computational time and complexity to be occurred in the course of the application are reduced. To evaluate the effectiveness of purposed model, experiments were conducted using K-nearest neighbor (KNN) and C4.5 decision tree algorithm on Reuters-21578 and Classic3 datasets collection for text categorization. The experimental evaluation showed that the proposed model was effective for text categorization.
Key words: Text categorization, feature selection, particle swarm optimization, principal component analysis, x2 statistic.
Copyright © 2021 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0