Intrusion detection using feature subset selection based on MLP

1 Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 31750 Tronoh, Perak, Malaysia. 2 Department of Software Engineering, College of Computer and information Sciences, P.O. Box 51178, Riyadh 11543, King Saud University, Saudi Arabia. 3 Department of Computer Science, College of Computer and information Sciences, P.O. Box 51178, Riyadh 11543, King Saud University, Saudi Arabia.


INTRODUCTION
The previous techniques of intrusion detection have concentrated on the problems of feature extraction and classification.But, somewhat less attention has been given to the significant subject of feature selection.The prime trend in feature extraction has been representing the data into another feature space (the PCA space) using principal component analysis (PCA).In this process of selecting features on the basis of highest eigenvectors is not fitting because the features corresponding to the highest eigenvalues may not have the best sensitivity for the classifier due to ignoring many sensitive features (Ahmad et al., 2010(Ahmad et al., , 2011a(Ahmad et al., , 2011b)).Consequently, there must be an efficient method to select a suitable set of features in the PCA space.This will lead the classifier to work in a competent way and enhance the overall *Corresponding author.E-mail: wattoohu@gmail.com.performance of the intrusion analysis engine.Since, the redundant and irrelevant features increase overheads as well as confuse the classifier.Therefore, in this paper, we argue that feature selection is an imperative dilemma in intrusion detection and exhibit that genetic algorithms (GAs) provide a simple, general, and potent framework for selecting first-class subsets of features that advance detection rates (Sun et al., 2002).
Furthermore, we considered PCA for features transformation and MLP for classification.The goal is searching the PCA space using GA to select a subset of principal components.This is in contrast to conventional methods selecting some percentage of the top principal components to represent the target concept, independently of the classification task.We have tested the proposed framework on intrusion detection.Our experimental results demonstrate important performance improvements.A number of approaches have been described in the area of intrusion detection but the key In one existing research work by Liu and his colleagues, PCA is applied for classification and neural networks are used for online computing.They selected 22 principal components as features subset selection to obtain the best performance.But there is a possibility to miss many vital principal components having sensitive information for intrusion detection during selection phase (Liu et al., 2007).
Horng and his co-workers observed the important features based on the accuracy and the number of false positives of the system with and without the feature.In other words, the feature selection of is "leave-one-out"; remove one feature from the original dataset, redo the experiment, then compare the new results with the original result, if any case of the described cases occurs.The feature is regarded as significant; otherwise it is regarded as insignificant.Since there are 41 features recommended in the KDD-cup99, the experiment is repeated 41 times to certify that each feature is either essential or insignificant.This process involved complication as well as overheads on massive dataset (Horng et al., 2010).
One of the most important works is done by Tong and his associates in which they employed the radial basis function (RBF) network as a real-time pattern classification and the Elman network is applied to reinstate the memory of past events.They used full featured KDD-cup dataset.This increases training and testing overheads on the system (Tong et al., 2009).PCA method is used by Zargar and his colleagues to determine an optimal feature set.An appropriate feature set helps to build efficient decision model as well as to reduce the population of the feature set.Feature reduction will speed up the training and the testing process for the attack identification system considerably but this will be a compromise between training efficiency (few PCA components) and the accurate results (a large number of PCA components) (Zargar et al., 2010).
In one of the research works by Kim and his team, the fusions of Genetic Algorithm (GA) and Support Vector Machines (SVM) are described for optimization of both features and parameters for detection models.This method was able to minimize amounts of features and maximize the detection rates but the problem is features uniformity.The features in original forms are not consistent so these must be transformed in new feature space in order to well organized form (Kim et al., 2005).

PROPOSED MODEL
The model consists of different parts; dataset used for experiments, feature transformation and organization, optimal feature subset selection, MLP classification architecture, training and testing, and results.The block diagram of model is shown in the Figure 1.

Dataset used for experiments
We used kddcup99 dataset for our experiments.The selection of this dataset is due to its standardization, content richness and it helps to evaluate our results with existing researches in the area of intrusion detection.The raw dataset consists of 41 features: (1) Where n = 41 After selection of the dataset, first, we pre-processed on the raw dataset so that it can be given to the selected classifiers; MLP.The raw dataset is pre-processed.First of all, we discarded three symbolic values (for example, udp, private and SF) out of 41 features of the dataset.The resultant features are: (2) Where m = 38

Feature transformation and organization
We applied PCA on 38 features of the dataset.The PCA flow chart is shown in Figure 2.
Step 1: Find mean: Step 2: Calculate deviation from mean: Subtract the mean: Where i=1, 2 … M. any vector x or actually , can be written as a linear combination of the eigenvectors): Step 6: Arranged eigenvalues and eigenvectors in descending order.
Step 7: The dimensionality reduction step (based on largest eigenvalues) is skipped as the selection of principal components is dine using GA.Mostly, PCA is used for data reduction, but here, we used it for feature transformation into principal components feature space and then organized principal components in descending order: (3) Where l=38

Feature subset selection
We applied genetic algorithm (GA) for optimal features subset selection from principal components search space.

GA Algorithm
Step 1. (Start) Generate random population of n chromosomes.
Step 2. (Fitness) Evaluate the fitness f (x) of each chromosome x in the population.

Ahmad et al. 6807
Step 3 (Replace) Use new generated population for a further run of algorithm.
Step 4 (Test) If the end condition is satisfied, stop, and return the best solution in current population.
Step 5 (Loop) Go to step 2. The working flow of GA is shown in Figure 3.We used the fitness function shown to combine the two terms: Where Accuracy corresponds to the classification accuracy on a validation set for a particular subset of principal components and zeros corresponds to the number principal components not selected.
The accuracy term ranges roughly from 0.50 to 0.99, thus, the first term assumes values from 5000 to 9900.The zeros term ranges from 0 to L − 1 where L is the length of the chromosome, thus, the second term assumes values from 0 to 37 (L = 38).

Classification architecture
A multilayer perceptron (MLP) is a feedforward neural network that maps sets of input data onto a set of appropriate output.Here, we used a MLP architecture consists of three layers; input, hidden and output.In this architecture, hidden layer and output layer consist of neurons (processing elements) and each neuron has a nonlinear activation function.The layers are fully connected from one layer to the next.MLP is an amendment of the standard linear perceptron, which can discriminate data that is not linearly separable.The architecture we used here is shown in Figure 4.The overall performance of MLP with 12, 20 and 27 features are shown in Table 4.

Training and testing of the system
The aim of training is the adjustment of networks weights on base of the difference between the output produced by the system and the desired output.The training dataset consists of five thousand (5000) labelled connections (network packets with label as normal or intrusive) that are randomly selected from 20,000 connections.Further, we divide the training dataset (five thousand) into three parts; (i) cross validation dataset (1000), (ii) test dataset ( 1500) and (iii) training dataset (2500).
We used confusion matrix to verify the training.When the training is completed then weights of the system are frozen and performance of the system is evaluated.Testing the system involves two steps; (i) verification step, and (ii) generalization step (Ahmad et al., 2011c).In the verification step, the system is tested against the data which are used in training.Aim of the verification step is to test how well trained system learned the training patterns in the training dataset.In generalization step, testing is conducted with data which is not used in training.Aim of the generalization step is to measure generalization ability of the trained network.We used a dataset of fifteen thousand (15,000) as a production dataset.We also tested our system performance on total dataset (20,000) that consist of both training dataset and production dataset.
a. (New population) Create a new population by repeating following steps: b. (Selection) Select two parent chromosomes from a population.c. (Crossover) With a crossover probability cross over the parents to form a new offspring (children).If no crossover was performed, offspring is an exact copy of parents.d. (Mutation) With a mutation probability, mutate new offspring at each locus (position in chromosome).e. (Accepting) Place new offspring in a new population.