Prediction of substituent types and positions on skeleton of eudesmane-type sesquiterpenes using generalized regression neural network ( GRNN )

Sesquiterpenes are formed from countless biogenetic pathways and are therefore a constant challenge to spectroscopists in structure elucidation. In this study, we explore the ability of generalized regression neural network (GRNN), an architecture of artificial neural networks (ANNs), to predict the substituent types on eudesmanes, one of the most representative skeletons of sesquiterpenes. Carbon13 ( 13 C) nuclear magnetic resonance (NMR) chemical shift values of skeletons of 291 eudesmane sesquiterpenes were used as the input data used for the network. Each substituent type on the skeleton of the different compounds were coded and used as the output data for the network. These data were used to train the network. After training, the network was simulated using 34 test compounds. The results showed that the GRNN had between 73.33 to 100% recognition rates of the test compounds. GRNN could therefore be a powerful aid in the structural elucidation of organic compounds.


INTRODUCTION
Many phytochemical research efforts are directed at isolation of the compounds responsible for the activities displayed by plants.Elucidation of structures of the isolated compounds from their proton nuclear magnetic resonance ( 1 H NMR) and Carbon-13 ( 13C) NMR spectra is often a difficult task.Computer-assisted structure elucidation (CASE) methods have been developed to help in this regard.CASE seeks to find, within a given solution space, the single structure that best fits a set of chemical and spectral boundary conditions.Structural elucidation involves finding, from structural information of an unknown compound derived from chemical and/or spectra evidence, the fittest structural formula that satisfies all the constraints (Yongquan, 2003).The input information consists of molecular formula derived from mass spectrometry or elemental analysis, and routine 1 D and 2 D NMR spectra.
The starting point for structure elucidation is molecular formula derived from Mass Spectrometry (MS), 1 D and 2 D NMR spectra.The collective spectral information is interpreted as a set of substructures predicted to be present or absent in the unknown.The deduced information, together with its molecular formula, is the usual input in structure generation.A high-quality reference library containing both structures and complete spectra or substructures and subspectra being representative of the types of compounds encountered in the laboratory, is an invaluable component for a CASE system (Elyashberg et al., 2002;Strokov and Lebedev, 1999).The premise implicit in the spectrum interpretation is that if the spectrum of the unknown and a reference library spectrum have a subspectrum in common, then the corresponding reference substructure is also present in the unknown.The components generated by spectra interpretation are fed into the structure generator, which will exhaustively generate all possible structures from these components.Examples of structure generators include MOLGEN, GENIUS and COCON.Their applications are described elsewhere (Meiler and Kock, 2004).
A structure elucidation problem is equivalent to a combinatorial optimization problem if the spectra-based structural information of the unknown is treated as constraints to be satisfied.The central task is thus to prune the size of the search space to a computationally acceptable extent.The methods mentioned above attempt to reduce the size of the search by taking advantage of problem-specific information.Nevertheless, pruning heuristics are not always enough because the incompleteness of chemical and/or spectroscopic evidence as the existence of vague information makes the actual search space expand drastically (Yongquan, 2003).
Artificial Neural Networks (ANNs) are defined as computational models with structures that are derived from a simplified concept of the brain, in which a number of nodes, called neurons, are interconnected in a network-like structure (Scotti et al., 2012).Due to its parallel nature, ANNs, could speed up the process of structural elucidation as the time-consuming sequential search (especially for large spectra library) and matching procedures (sequential comparison of an unknown target spectrum with the set of library spectra) employed by the conventional databases is avoided (Rufino et al., 2005).ANNs are employed in pattern recognition problems, especially those associated with prediction, classification or control.The technique has been applied to the prediction of biological activity of natural products or congeneric compounds (Wrede et al., 1998;Fernandes et al., 2008), the identification, distribution and recognition of patterns of chemical shifts from 1 H-NMR spectra (Aires- de-Sousa et al., 2002;Binev and Aires-de-Sousa, 2004) and identification of chemical classes through 13 C-NMR spectra (Fraser and Mulholland, 1999).
Neural networks are nonlinear processes that perform learning and classification.ANNs consist of a large number of interconnected processing elements known as neurons that act as microprocessors.Each neuron accepts a weighted set of inputs and responds with an output.Figure 1  where P and w i are the number of elements and the interconnection weight of the input vector x i , respectively, and b is the bias for the neuron.The knowledge is stored as a set of connection weights and biases.The sum of the weighted inputs with a bias is processed through an activation function, represented by f, and the output that it computes is as follows: There are many ways to define the activation function such as the threshold function, sigmoid function, and the hyperbolic tangent function.The type of activation function depends on the type of the neural network to be designed.A neural network can be trained to perform a particular function by adjusting the values of connections that is, weighting coefficients, between the processing elements.
In general, neural networks are adjusted/trained to reach from a particular input a specific target output until the network output matches the target.Hence, the neural network can learn the system.This type of learning is known as supervised learning.The learning ability of a neural network depends on its architecture and applied algorithmic method during the training.Training procedure ceases if the difference between the network output and desired/actual output is less than a certain tolerance value.Thereafter, the network is ready to produce outputs based on the new input parameters that are not used during the learning procedure.A neural network is usually divided into three parts: the input layer, the hidden layer and the output layer.The information contained in the input layer is mapped to the output layers through the hidden layers.Each unit can send its output to the units on the higher layer only and receive its input from the lower layer.This structure is known as multilayer perceptron.Rufino et al. (2005) showed that ANNs methods give fast and accurate results for identification of skeletons and for assigning unknown compounds among distinct fingerprints (skeletons) of aporphine alkaloids.The computation method is much faster than the utilization of traditional methods for skeleton prediction, which makes neural networks ideal for selecting results for structure generators or checking the entries of a database.If a large number of skeletons have to be predicted or a fast and easy check of a structure is necessary, this approach is advantageous.
In the present work, we show that where the skeleton of a class of compounds has been identified, the substituents positions and types on the skeleton can be predicted using generalized regression neural network (GRNN), one of the architectures of ANNs.We focus on eudesmane-type compounds, one of the most representative skeletons of sesquiterpenes.Sesquiterpenes are formed from countless biogenetic pathways and therefore produce several types of carbon skeletons (Oliveira et al., 2000;Ferreira et al., 2004).This makes elucidation of their structure very challenging.In a previous work, Olievera et al. (2000) described the use of the expert system, SISTEMAT, as an auxiliary tool in the process of structure elucidation of eudesmanes.Eudesmane-type sesquiterpenoids and their biological activities have been the focus of numerous phytochemical, pharmacological and synthetic studies.Since sesquiterpenes exhibit a wide range of biological activities, and include compounds that are plant growth regulators, insect antifeedants, antifungals, anti-tumour compounds and antibacterials, much efforts has been directed at relating their structures to function (Wu et al., 2006).
A GRNN is based on kernel regression networks (Celikoglu and Cigizoglu, 2007;Cigizoglu and Alp, 2005;Kim et al., 2004;Hannan et al., 2010).A GRNN does not require an iterative training procedure.It approximates any arbitrary function between input and output vectors, drawing the function estimate directly from the training data.In addition, it is consistent that as the training set size becomes large, the estimation error approaches zero, with only mild restrictions on the function (Kim et al., 2004;Hannan et al., 2010).
A GRNN consists of four layers: input layer, pattern layer, summation layer and output layer as shown in Figure 2. The number of input units in input layer depends on the total number of the observation parameters.The first layer is connected to the pattern layer and in this layer each neuron presents a training pattern and its output.The pattern layer is connected to the summation layer.The summation layer has two different types of summation, which are a single division unit and summation units.The summation and output layer together perform a normalization of output set.In training of network, radial basis and linear activation functions are used in hidden and output layers.Each pattern layer unit is connected to the two neurons in the summation layer, S and D summation neurons.Ssummation neuron computes the sum of weighted responses of the pattern layer.On the other hand, Dsummation neuron is used to calculate un-weighted outputs of pattern neurons.The output layer merely divides the output of each S-summation neuron by that of each D-summation neuron, yielding the predicted value   ′ to an unknown input vector x as (Jang et al., 1997;Hannan et al., 2010): where y i is the weight connection between the  ℎ neuron in the pattern layer and the S-summation neuron, n is the number of the training patterns, D is the Gaussian function, m is the number of elements of an input vector,   and   are the  ℎ element of x and , respectively, and  is the spread parameter, whose optimal value is determined experimentally.
Compared to other ANN models such as the backpropagation (BP) neural network model, the GRNN needs only a fraction of the training samples a BP neural network would need.Therefore, it has the advantage that it is able to converge to the underlying function of the data with only few training samples available (Specht, 1991).Furthermore, since the task of determining the best values for the several network parameters is difficult and often involves some trial and error methods, GRNN  models require only one parameter (the spread constant) to be adjusted experimentally.This makes GRNN a very useful tool to perform predictions and comparisons of system performance in practice.Previous works relating the predictive capability of GRNN to BP neural network and other nonlinear regression techniques highlighted the advantages of GRNN to include excellent approximation ability, fast training time, and exceptional stability during the prediction stage (Sun et al., 2008;Mahesh et al., 2014).

MATERIALS AND METHODS
The structure of any natural product is conventionally divisible into three sub-units: (i) the skeletal atoms; (ii) heteroatoms directly bonded to the skeletal atoms or unsaturations between them; and (iii) secondary carbon chains, usually bound to a skeletal atom through an ester or ether linkage (Rodrigues et al., 1997).For identification purposes and for structural elucidation of new compounds, it is necessary to have access to extensive list of their structural data.In the present study, we made use of structural (skeletal) 13 C data, substituents and stereochemical information of 325 (out of the total 350) eudesmane compounds published by Olievera et al. (2000).This information can be extracted from data of eudesmane sesquiterpenes published in literature by isolating 13 C values of the skeletal (carbon) from those of the substituents.The compounds left out were those whose substituents were not stated explicitly due to structural complexity.ANNs work through learning method, their training must, therefore, be done with the use of well detailed and correct data to avoid an erroneous learning process.Of the 325 compounds used, 34 were reserved for use as test cases (these were not used in training the neural network).The structure of the eudesmane skeleton with the numbering of each carbon atom is shown in Figure 3.
Three Excel worksheets containing coded information on the input and target data for the training and test compounds were prepared.On the first row of the first sheet, the compounds were assigned codes 1-291.In the first column of the same sheet, the positions of each carbon atoms on the skeleton (as shown in Figure 3) were coded as 1-15.The 13 C chemical shift data for each Carbon at each of the 15 positions was recorded for each compound.These represent the input data subsequently used in training of the net.Another Excel sheet in the format just described was prepared except that it contained 13 C chemical shift data for the test compounds (coded 1-34).The 13 C chemical shift data for skeletons of the test compounds are presented in Table 1.Since ANNs learn through examples, the test compounds were selected based on the representativeness of their substitution patterns in the table of structural information published by Oliveira et al. (2000).This was done largely by visual inspection.These represent the input data for the test compounds.
In preparing the target data, each substituent type (on first encounter) was assigned 3 number codes.These codes serve to identify the substituent, while also taking into account its possible stereochemistry (α or β) in various positions of the skeletons in other compounds.Carbon positions without substituents were assigned a code of 0 while α and β positions without substituent(s) were assigned codes of 1 and 2, respectively.For example, OH group was given a code of 3, an α-OH is given a code of 4 while a β-OH was assigned a code of 5.
After the construction of the worksheets, the data were transferred into the neural network toolbox of MATLAB 7.8.0 (MATLAB and Statistics Toolbox Release, 2009a).From the command window, the 'nntool' command was used to designate the imported data appropriately as 'input' or 'target' and to select the appropriate network for training.The network types employed in Table 1. 13  training and therefore unknown to the network).The aim was to ascertain whether the neural network would be able to predict correctly the substituents and their positions on the eudesmane skeleton.After trying several neural network types and network parameters, the GRNN at a spread constant of 1.0 was found to give the best results.

RESULTS AND DISCUSSION
Eudesmanes may or may not be oxygenated.
Oxygenated eudesmanes may be alcohols, ethers, epoxides, peroxides, aldehydes, ketones, carboxylic acids and lactones.These different functional group substituents are important in determining the individual biological activities of the various sesquiterpenoids, hence the need to correctly predict the substituent types and their positions on the skeleton.The results obtained after training of the neural network and simulating with the test data using GRNN are presented in Table 2. Percentage (%) recognition of the compounds was calculated from the number of correctly predicted points relative to the total number of positions on each compound (15).This ranged between 73.33 and 100% except for test compounds 8 and 12 where 33.33 and 40%, respectively were obtained.Results for test compounds 10, 14, 15, 16, 17, 18, 19, 31, 32 and 34 are not shown because the network presented all the positions on the skeleton as un-substituted.This may be due to the non-existence of precise rules for these compounds.From the results presented in Table 2, there is 100% recognition of the un-substituted positions (designated as '-') on the eudesmane skeleton in all the compounds tested The results obtained when perceptron and feed-forward BP neural networks (employing varying network parameters) were used are not presented since the substituents predicted to be on the eudesmane skeleton for all the test compounds, are largely inaccurate.

Conclusion
Neural networks learn from examples and acquire their 'knowledge' by induction.They can generalize, provide flexible non-linear models of input/output relationships can cope with noisy data and are fault-tolerant (Schneider and Wrede, 1998).From this study, it could be seen that the predictions obtained using the GRNN were in good agreement with the actual substituents on the skeletons of the test compounds.This is despite the large variations in the nature of substituents on the eudesmane skeleton of the various compounds used in the study.Where the skeleton type of a natural product has been ascertained by sequential comparison of unknown target spectrum with a set of library spectra or using ANNs, GRNN could be an excellent complimentary tool to use in predicting the nature of substituents attached to eudesmane skeletons.Moreover, it would also be possible to perform the training of the networks interactively, so that every researcher dealing with the identification of substituents on skeletons of natural products could create a network specialized in groups of such complex substances.

Table 1 .
C NMR chemical shift data for test compounds.Contd.