Prediction of bacterial toxins by an improved feature extraction and IB 1 algorithm fusion

Correctly identifying bacterial toxin is of great benefit to cell biology and medical research. In order to improve predictive accuracy, based on the concept of pseudo amino acid composition, combined with the methods of approximate entropy and IB1 algorithm, a new method is proposed to predict bacterial toxins in this paper. The improved method gives comprehensive consideration of amino acid composition, side-chain mass of the amino acid, hydrophilic, and hydrophobic characteristics of a protein sequence. The total prediction accuracy of our method was 97.52% for bacterial toxin and nontoxin, and 97.33% for discriminating endotoxins from exotoxins, which were much higher than that of the previous methods.


INTRODUCTION
The whole world can be considered to be consisted of various systems such as economic system, biological system, etc., (Backlund, 2000) and all the systems in the existent physical reality, from the cosmologic phenomena to immunology and comportments of the subatomic particles, seem to be characterized with the presence of various patterns (Steen, 1988) in their structure (Pullan and Bhadeshia, 2000) and behavior (Dusenbery, 2009).The existent physical reality is not so complicated, in that, mathematically it is possible, but contrarily, the diversity of the systems can be restricted to a very little number of mathematically possible variations, in biology too.Various research methods and mathematical instruments, such as, "thinking machines" (Turing, 1950) are developed for understanding this "reduced complexity".
The recognition of patterns (Gibson, 2003) is a major problem today, when the text recognition and transformation from the scanner's image that is directly in an editable text is commonly used.The other problems (that is, from the screening of the human biometrical characteristics in mass-accessed places for security considerations, to the recognition of biological molecules, such as proteins, DNA and RNA) from the untreatable great quantity of data to the instruments of measurement (like the chips for DNA recognition) are in dynamic development.For physicians today, they use new and efficient methods for patients, like the selection and determination of tumor markers directly on the mRNA.An otherwise untreatable great quantity of data from the instruments of measurement, equipped with moleculerecognition chips, are used for the direct examination of biopsy material from tumors and metastases (or tumor cell lines developed from this), via software based on mathematical algorithms (Ochs et al., 2009).
Thus, it is possible, for example, to have a quick, INDIVIDUALISED (tendency of the medicine in future) and repeated change in the cancer therapy drug combination schema, weekly if necessary, and anytime using the most efficient combination and preventing, in this way, the development of the drug resisting the tumor.Finally, the success of the therapy, or long-time survival of the patient, in comparison with the others used, empirically proposed combinations.The methods of recognition of patterns have important mathematical implications, like the Bayesian algorithms (Howson and Urbach, 1989), Support Vector Machines (Cortes and Vapnik, 1995), IB1 algorithm (Aha et al., 1991), and others, like Artificial Learning Systems (Akerkar and Sajja, 2009), Artificial Intelligence and Neural Networks (Russell and Norvig, 2003).However, the development of this area influences positively the last mentioned areas, which are important for other applications too, outside the pattern recognition (for example: neurobiology and physics).
The bacterial toxins are a major cause of diseases during infection (B¨ohnel and Gessler, 2005), and can be classified into exotoxins and endotoxins.These two types of toxins have different role and mechanism in the body and correctly identifying bacterial toxin, is of great benefit to mankind.In fact, some of these powerful diseasecausing toxins have been exploited to further basic knowledge of cell biology or for medical purposes.For example, cholera toxin and the related labile-toxin of E. coli, as well as B. pertussis toxin, have been used as biologic tools to understand the mechanism of adenylate cyclase activation (Harnett, 1994;Bokoch et al., 1983;Neer, 1995), and the strong mucosal adjuvants have been used in experimental models (Bagley et al., 2002).Though bacterial toxins can be identified by experimental methods, it is costly and time-consuming.So, how to economically, rapidly and accurately identify bacterial toxins becomes a very important problem.
Recently, some researches have been made in this field and achieved inspiring results, using support vector machines (SVM) and dipeptides composition.Saha and Raghava (2007) achieved an accuracy of 96.07 and 92.50% for bacterial toxins and non toxins, respectively, and an accuracy of 95.71 and 92.86% for discriminating endotoxins and exotoxins, respectively.Yang and Li (2009) achieved higher MCC in the same dataset by using increment of diversity and support vector machines.Encouraged by their research, in this study, we attempted to develop a new method to predict bacterial toxins and their class (exotoxin or endotoxin).

MATERIALS AND METHODS
The software used for working the data MATLAB (Gilat, 2004) is a high-level language and interactive environment that enables computational tasks to be performed faster than that of traditional programming languages such as C, C++ and FORTRAN.It has been widely used in various application areas, such as computational biology and pattern recognition.All calculations done in this paper were realized by programming, under MATLAB 2007.

Dataset
The data that we used in this paper were collected from Swiss-Prot database (Boeckmann et al., 2003) and from the dataset used by Saha and Raghava (2007).We freely downloaded them from http://www.imtech.res.in/raghava/btxpred/supplementary.html.Using the cd-hit soft (Li, 2006) to remove sequences with more than 90% sequence identity, and using it to delete the sequences whose length is 100, we obtained two datasets.One contained 141 bacterial toxins and 303 non-toxins, while the other contained 73 exotoxins and 77 endotoxins.

Schemes of sequence feature
The pseudo amino acid composition of a sequence includes a lot of information about the sequence, such as the main feature of amino acid composition, and the sequence order correlation (Chou, 2001).So, in this paper, we constructed the feature vectors of a protein sequence with the concept of Chou's pseudo amino acid composition.
Suppose a protein chain X with length l amino acid residues is: , ϖ ϖ is the weight factor for sequence order effect, i ApEn is the approximate entropy of protein sequences (Pincus, 1991), which describes the complexity of protein sequences, and j θ is the jtier sequence correlation factor, which reflects the sequence order correlation among the most contiguous residues of the jth.i ApEn could be computed by the following equations: are the protein subsequences that begin at component i within X .
N is the component number of the given X , while r and m are the filter parameter and mode dimension, respectively.In computing, we select 2,3, 4 m = and 0.1, 0.15, 0.2, 0.25 r = , and then we obtain 12 approximate entropies, that is, 12 s = j θ was computed by the following formulae: Here ( , ) H i are the corresponding original hydrophobicity and hydrophilicity values of the ith amino acid (Argos et al., 1982;Hopp-Woods, 1981), respectively, and 0 3 ( ) H i is the side chain mass of the ith amino acid that can be obtained easily from any biochemistry text book.Generally, we used number to represent the 20 native amino acids from 1 to 20, according to their alphabetical order.

IB1 algorithm
IB1 algorithm is a classification algorithm characterized by incremental, supervised learning (Aha, 1990).It achieves effective results usually by the steps such as normalization, similarity and prediction.For some given numeric protein sequences, we first normalize them by the following formulae: Where min a and max a are the lowest and highest values of attribute a , respectively, while a x is the attribute's a value of sequence x .
Then, the similarity between a new sequence and the entire test sequences is calculated according to the similarity function.Using the similarity, we can describe the degree that a new sequence is similar to all sequences.Usually, we select the following function as the study's similarity function: Where , x y are two protein sequences, and 2 ( , ) ( ) , we believe that the sequence, x , belongs to the same class of y .

Evaluation of the performance
In order to easily compare the performance with other methods, we also use sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (MCC) and the overall prediction accuracy (Ac) as indicators (Baldi et al., 2000;Carugo, 2007) for evaluating the correct prediction rate and reliability of the study's method.Here: Where M is the total number of protein sequences.TP denotes the number of the correctly recognized positives, FN denotes the number of the positives recognized as negatives, FP denotes the number of the negatives recognized as positives, and TN denotes the number of correctly recognized negatives.
In order to explain the study's method, here is an example: 2ABA_DROME: MGRWGRQSPVLEPPDPQ……AATNNLFIFQDKF is a protein sequence.
However, an introduction will be done first on how to extract the feature of the protein.
Step 1 Calculate t f , the frequency of the 20 amino acids, in the aforementioned protein sequence.
(0.0501, 0.0581, 0.0681, 0.0160, 0.0701, 0.0481, 0.0561, 0.0180, 0.0842, 0.0601, 0.0220, 0.0301, 0.0401, 0.0240, 0.0681, 0.0541, 0.0741, 0.0641, 0.0681, 0.0261) Step 2 Calculate i ApEn approximate entropy of protein sequences.First, represent the protein sequence as a time series X by replacing every amino acid of protein sequences by the relevant value of its hydrophobic amino acids; then, calculate the number of similar subsequences which begin at component i within X .As such, with length m, ( ) m i C r can be obtained.At last, we can calculate i ApEn , following the formula in "schemes of sequence feature".Consequently, the entire ApEn sequences can be used to construct the following vector: (1. 3942, 1.4781, 1.4530, 1.4349, 0.6445, 0.8158, 0.9543, 0.9704, 0.1157, 0.1966, 0.3658, 0.4216) Step 3 Calculate j θ and the j -tier sequence correlation factor, but first calculate ( ), 1, 2,3 (the value of hydrophobicity, hydrophilicity and side-chain mass of each amino acid), then through the correlation function, we could obtain j θ .All sequence correlation factors can be used to construct the following vector: (0.0040, 0.0043, 0.0040, 0.0043, 0.0040, 0.0041, 0.0040, 0.0042, 0.0037, 0.0040, 0.0040 0.0038, 0.0040, 0.0041, 0.0039, 0.0040, 0.0040, 0.0040, 0.0039, 0.0037) Step 4 Merge the aforementioned three vectors into a vector as the formula in "schemes of sequence feature", and standardize it.In this research, where 1 2 , ϖ ϖ change in a certain range, they are 0.022 and 0.34, corresponding to the best prediction result, and then we can obtain the feature vector of this protein sequence.The second problem is how to predict the sequence.For the dataset, we first extract the feature of all protein sequences by using the aforementioned steps, before using the IB1 method classification to calculate them.A specific calculation process is that first, we select one sequence as the tested object and the others as the test set, and then use the IB1 algorithm to find the minimum similarity between the tested object and the others.We believe that the tested object is in the same type as that of the sequence which has the minimum similarity with the tested object.According to these steps, each sequence in the dataset is forecasted, in turn, after which we obtain the value of TP, FN, FP and TN, before we could calculate Sn, Sp and MCC.

RESULTS AND DISCUSSION
For the uniformity of comparison, in this paper, Jackknife test was used on the dataset.By programming and calculating, the performance of our method proposed for discriminating the bacterial toxins from non-toxins was shown in Table 1.The performances of other previous methods were also shown in Table 1.It was clear that our method with improved feature extraction and IB1 algorithm fusion was able to predict toxins with the total accuracy of 97.52% and 0.9437 MCC, which were higher than that of the previous results (Table 1).The study's method was also used to predict whether a bacterial toxin was an exotoxin or an endotoxin.The total accuracy and MCC of this method achieved 97.33% and 0.9469, respectively, which were also higher than that of any other existed results (Table 2).
In order to further analyze the effectiveness of the algorithm and the effectiveness of feature extraction of the method proposed in this paper, we used IB1 algorithm to predict bacterial toxins based on the amino acid composition alone, and the results are listed in Table 3. From Table 3, we could see the difference between two performances of two feature extraction methods.It is obvious that the improved feature extraction, proposed in this paper, was indeed better than amino acids alone, which showed that our feature extraction method much effectively reflect the characteristics of bacterial toxins, and was more suitable for predicting bacterial toxins.Comparing Tables 1 and 2, we could see that the performance of IB1 algorithm is much better than that of SVM for bacterial toxins and non toxins with amino acids composition alone.Although IB1 algorithm was poor for discriminating exotoxins and endotoxins with amino acids composition alone, it was perfect when it was connected with the improved feature extraction.It showed that the combination of IB1 algorithm and the improved feature extraction method proposed in this paper could significantly improve the prediction accuracy of bacterial toxins.
we denoted the protein sequence as a vector in

Table 1 .
Performances of various methods in the prediction of bacterial toxins.

Table 3 .
Comparison of two kinds of feature extraction methods for IB1 algorithm.