The rapid growth and availability of whole genome sequences of Streptococcus pyogenes M Group A Streptococcus strains which is a spherical gram-positive bacteria that causes important human diseases ranging from mild superficial skin infections to life-threatening systemic diseases have initiated the need to analyze these sequences. The motivation of this paper is to adopt content based gene prediction method along with the machine learning techniques of Artificial Neural Networks - Back Propagation Network, specific for predicting Lac genes of S. pyogenes M Group A Streptococcus strains. We first obtained Lac genes from the genome sequences of S. pyogenes M Group A Streptococcus strains and calculated the mean gene content. The mean gene content had 70 parameters indicating the mean of the percentages of the frequencies of occurrences of 64 possible codons, 4 nucleotides, purines and pyrimidines. We constructed three–layer feed-forward neural network with 70 input units, 20 hidden units and 1 output unit. After being trained in a supervised manner with the Error Back–Propagation Algorithm by mean gene content, the network is examined by testing the algorithm for the mean gene content vector and 99 sample Lac gene vectors to get a range of values for the output that the Lac gene vector falls. The values obtained ranged from 0.9857 to 0.9901 and these ranges of values are used in classifying whether a given sequence is a Lac gene or not. SpyMGASLacGenePred is a tool that has been developed. It accepts a DNA sequence. It finds all possible ORFs in 6 reading frames. It calculates gene content and runs the testing algorithm of the network for all ORFs to confirm whether they are Lac genes or not. For the set of ORFs that the neural network classifies as a Lac gene, the tool determines and displays the position, length, frame information, GC content and translated sequence. The calculated performance measures for evaluation of the developed tool SpyMGASLacGenePred showed that it has a sensitivity of 100% and specificity of 76.9%. Since every Lac gene used for training is taken into consideration by the Back Propagation Neural Network program for testing, the tool has 100% sensitivity. However, if Lac genes of the other strains of S. pyogenes which are not used for training is tested, then sensitivity might drop to a certain extent. The tool has a specificity of 76.9% and this indicates that the tool is above an acceptable threshold level to predict the correct Lac genes out of a total of Lac genes. The tool also showed a correlation coefficient of 0.733 which is near +1 and thus can be considered as near perfect prediction. Thus the adopted Back Propagation Algorithm of Artificial Neural Network method has been useful for the development of the SpyMGASLacGenePred tool to identify the Lac gene structures in S. pyogenes M Group A Streptococcus strains.
Key words: Back Propagation Algorithm, Artificial Neural Network, Lac gene prediction, Streptococcus pyogenes M Group A Streptococcus strains.
Copyright © 2021 Author(s) retain the copyright of this article.
This article is published under the terms of the Creative Commons Attribution License 4.0