Prediction of presynaptic and postsynaptic neurotoxins by bilayer support vector machine with multi-features

Much benefit to biology research and drug design, prediction of neurotoxin gradually became a necessary and popular task in recent year. In this paper, based on multi-feature extraction strategies from primary sequences and support vector machine, a novel Multi-classifier system named bi-layer support vector machine was proposed to predict presynaptic and postsynaptic neurotoxins, and obtained satisfactory results with 98.5% prediction accuracies for presynaptic neurotoxins and 99.18% for postsynaptic neurotoxins, the Matthew’s correlation coefficient was 0.9767. The satisfactory results showed that, the current method might play a complementary role to other existing methods for predicting presynaptic and postsynaptic neurotoxins.


INTRODUCTION
There are nearly 3000 species of spiders and 2340 species of snakes living in the world, and hundreds of them are poisonous.Studies have found that, the main components of the venoms of these animals are proteins with various molecular masses that consist mainly of enzymes and toxins (Dolimbek et al., 1998).Although, the roles of the most of these enzymes are still in distinction, but the toxins are the active principles of the venoms, which are expressed in their binding to the elements of the presynaptic or postsynaptic membranes (Afifiyan et al., 1998).Presynaptic acts on nerve endings; it inhibits neurotransmitter by blocking the release of acetylcholine or damaging the cell membrane.Postsynaptic binds specially to the nicotinic acetylcholine receptor resulting in the prevention of nerve transmission, leading to death from asphyxiation.Nearly 100 postsynaptic neurotoxins have been found.Some of these neurotoxins are very important in the research of biological science and medicine design, For example, presynaptic neurotoxins had been used for the treatment of migraine headache and cerebral palsy.Obtaining the information about these neurotoxins provides more information on the function of neurotoxin and makes more application of them.Although, the information about neurotoxins can be obtained by experimental technology, but computer aided prediction is less time consuming and costly, so computer aided prediction of presynaptic and postsynaptic neurotoxins would be very helpful in obtaining these information.
In fact, based on computer aided methods, there are many encouraging results in the field of predicting various toxins.For example, Saha and Raghava (2007) achieved an accuracy of 96.07 and 92.50% for predicting bacterial toxins and non toxins by using support vector machines (SVM).Song (2011) enhanced the predictive accuracy by means of an improved feature extraction and IB1 algorithm fusion method.Using the pseudo amino acid composition, (Lin and Li, 2007) provided a new algorithm of increment of diversity combined with modified Mahalanobis discriminant to predict five conotoxin superfamilies.Directly based on fusing different kinds of sequential features by using modified one-versus-rest SVMs, (Fan et al., 2011) developed a novel approach called PredCSF for predicting the conotoxin superfamily, and obtained an overall accuracy of 90.65%.Zaki et al. (2011) proposed a SVM-Freescore method, which featured an improved sensitivity and specificity by approximately 5.864 and 3.76%, respectively.
For presynaptic and postsynaptic neurotoxins, Yang and Li (2009) used an algorithm of increment of diversity to predict them, and obtained the encouraging prediction accuracies with 90.23% for presynaptic neurotoxins and 89.40% for postsynaptic neurotoxins; their Matthew's correlation coefficient was 0.7963.
In general, a successful computer aided method is decided by two factors, which are the choice of classifier and the feature extraction method of protein sequence.There are a lot of studies in these fields, as an effective feature extraction method, Pseudo amino acid compositions (PseAA) are usually used to represent a protein sequence with a discrete model yet without completely losing its sequence-order information (Chou, 2001).According to the different composition, there is a variety of pseudo amino acid composition (Chou, 2005;Chou and Cai, 2002;yang and Li, 2009), which had been used for enhancing the prediction quality of protein attributes.
Support vector machine (SVM) is an effective tool for classification and prediction, which has been used in various fields related to protein function prediction (Huang and Shi, 2005;Zhang et al., 2006;Zhou et al., 2008;Shi et al., 2008;Lin et al., 2009), but methods only using a single classifier have some limitations in the prediction (Chou and Shen, 2006a, c).Recently, multiclassifier systems have been proposed to enhance the prediction quality and it obtained satisfactory results (Park and Kanehisa, 2003;Chou and Cai, 2004;Yu et al., 2004;Chou andShen, 2006a, b, c, 2007;zhou et al., 2008).Using the advantages of SVM and advantages of multi-classifier system, and constructing multiple SVM classifiers to enhance the prediction quality, it will be a fresh idea for neurotoxin prediction.
In this study, a multi-classifier named "bi-layer SVM" was built to further improve the prediction accuracy of presynaptic and postsynaptic neurotoxins, and a relatively good predictive result was obtained.

Dataset
Presynaptic and postsynaptic neurotoxins used in this study where downloaded from the dataset (Boeckmann et al., 2003), in order to ensure enough protein sequences in experimental data, we selected those presynaptic neurotoxins and postsynaptic neurotoxins with no more than 90% identity.Finally, we got 132 presynaptic neurotoxin sequences and 241 postsynaptic neurotoxin sequences.

Feature extraction
In this paper, besides basic amino acid composition, two kinds of sequence feature were used to construct pseudo amino acid composition of neurotoxin sequences, that is, the approximate entropy of protein sequences, and the dipeptide composition of protein sequences.

The approximate entropy representation of protein sequences
The approximate entropy is generally a measure of system complexity (Pincus, 1991); it has been widely used to deal with physiological signal (Richman and Moorman, 2001) and protein prediction (Song 2011).The algorithm for computing approximate entropy of a protein sequence can be briefly described as follows: first, represent a protein sequence as a time series XN by replacing every amino acid of a protein sequence by the relevant value of its hydrophobic amino acids, suppose the sequence XN consisting of N components, namely computing the approximate entropy ( , , ) N ApEn X m r of the sequence, we should choose values of two input parameters which are pattern length m and the criterion of similarity, r.Denote a subsequence with m components, beginning at component i within XN by the vector Pm(i).For two subsequence Pm(i) and Pm(j), if the difference between any pair of corresponding components in the subsequences is less than r, that is, if We think they are similar.


Where ui is the mean of points in Si.
In simple terms, the steps of K-means clustering can be describes  as the following: first, randomly select k observations as the initial center of each cluster, then each one of other observations are assign to the center with the closest distance measure.Afterwards, compute a new center for each cluster by averaging the feature vectors of all observations assigned to it, repeat this process until each center keeps unchangeable.
The dipeptide components are important parameters for protein structure and function, it has been widely used to protein bioinformatics (Lin et al., 2007(Lin et al., , 2008;;Lin and Ding, 2011;Lin and Li, 2011).In this study, we first extracted dipeptides features of a neurotoxin sequence according to the hydrophilicity value of the corresponding residue, and obtained 400 vectors of dipeptides.In order to be easy to calculate, we sorted these 400 vectors into k clusters by k-means clustering.Where the distance measured in kmeans clustering, we chose Euclidean distance, and the selection of k was by the histogram: firstly, we obtained the histogram of these 400 vectors by the soft linkage, from the histogram, we could estimate the value scope of k, and for all possible k, we selected the k whose prediction effect is the best one, here 16 k  .

Bi-layer support vector machines (SVM) classifier
SVM is a popular algorithm for pattern recognition and protein predicting protein (Chen et al., 2010;Lin and Chen, 2011).But the single classification prediction efficiency is always affected by noise and complex datasets, in order to reduce these effects, in this study; we built a bi-layer SVM classifier, which consists of four SVM classifiers, three in the first layer, denoted as SVM1, SVM2 and SVM3, and one in the second layer, denoted as SVM4.The sequence feature used in SVM1 is only the occurrence frequencies of the 20 amino acids (denoted them as ( 1, 2, , 20) i fi  ).In SVM2 and SVM3, we selected the pseudo amino acid composition as the sequence feature, which were constructed by the occurrence frequencies of the 20 amino acids combined with approximate entropy (denoted them as ( 1, 2, ,12) i ei  ) with the weight 0.12 in SVM2, and the approximate entropy combined with 16 dipeptide composition (denoted them as ( 1, 2, ,16) i di  ) with the weight 0.02, respectively.Training parameters are  =2, c =4 in SVM1,  =1, c =2 in SVM2 and SVM3, respectively.The value of these parameters is taking from the corresponding values which have the best effect in the training prediction.Then, the corresponding dimensions of input vector of these three SVM classifiers are 20, 32, and 28, respectively.The second layer was trained with the output (denoted them as ( 1, 2,3) i ri  ) generated by 3 classifiers in the first layer (here we also set, γ=1, c=2) (Table 1).The classifiers used here are OSU-SVM (http://www.ece.osu.edu/~maj/osu_svm).

Evaluation of the performance
In order to compare with other prediction performance, we also adopt the sensitivity (Sn), specificity (Sp), Matthew's correlation coefficient (Mcc) and accuracy (Acc) as appraisal targets to estimate the performance of our method, these appraisal targets can be calculated by the following formulae, respectively: Where, TP denotes the numbers of the correctly recognized positives, FN denotes the numbers of the positives recognized as negatives, FP denotes the numbers of the negatives recognized as positives, and TN denotes the numbers of correctly recognized negatives (yang and Li, 2009).

RESULTS AND DISCUSSION
In this paper, we provided rough comparison of the performance between our method and the other method (Table 2).From Table 2, we could see that, for presynaptic neurotoxins, by 10 fold cross validation, the results of the sensitivity, specificity and MCC value were appreciably improved, and the increments were 11.54, 4.97 and 18.33%, respectively.For postsynaptic neurotoxins, these three appraisal targets were also improved, and with 7.07, 12.5 and 18.33% increments, respectively.These satisfactory results were enough to show that our method was effective for predicting presynaptic and postsynaptic neurotoxins.
In order to further study the prediction performance of bi-layer support vector machine, we predicted presynaptic and postsynaptic neurotoxins only by singlelayer SVM based on the feature which was the same as the ones which was used in SVM2 and SVM3, the results were listed in Table 3. From the Table 3, we could see that the prediction performance were not satisfactory, but based on the same feature by bi-layer SVM, we could obtain fairly good results, which might be that bi-layer SVM built in this paper could make full use of multiple features information, and could take better advantage of the sequence information of a protein than that of the single-layer SVM based on individual feature.
The successful prediction showed that bi-layer support vector machine could make full use of multiple features information and fairly improved the sensitivity, specificity and MCC value; it was quite suitable to predict presynaptic and postsynaptic neurotoxins.It was also evidence that bi-layer SVM was a promising classifier; we hoped this method would be helpful for the analysis of possible functions of new neurotoxins.

X
Pm that are similar to Pm(i) for the given similarity criterion r, then: the given similarity criterion r and the length m , we define the approximate entropy of XN as: is an m-dimensional real vector, the aim of k-means clustering is to partition these n observations into k clusters

Table 1 .
Selection of the feature and parameters for each classifier.

Table 2 .
Comparison of prediction performance for presynaptic and postsynaptic neurotoxins.

Table 3 .
Prediction performance for presynaptic and postsynaptic neurotoxins by single-layer SVM.