BengaSaVex : A new computational genetic sequence extraction tool for DNA repeats

The scourge of infectious diseases is one of the problems contending with humanity. All infectious diseases are caused by pathogens. A major problem in biological research is the creation of enormous and redundant amounts of genomic data. From this large volume of generated data, biologists select a subset of each sequence known as DNA nucleotide subsequences “words”, for extended scientific analysis. Computational biology aids this pruning process by providing computerized tools to generate vital information with biological significance from these data. This research aimed to develop new tools for extracting DNA repeats from the gene sequences and also to perform a comparative analysis with existing tools having similar or closely-related functions. We were able to develop BengaSaVex (GBenga Samuel Victor genetic sequence extraction tool) and provide a sequential in-silico geneticsequence-filtering functionality to identify repeated DNA nucleotide subsequences within the genes of some microorganisms, evaluated the potential benefits and applications of identifying such repeated sequences, and finally, performed an in-silico comparative analysis between BengaSaVex and tandem repeat finder.


INTRODUCTION
Over the years, biologists and computational biologists have conducted experiments related to the sequences of some pathogens and other micro organisms.One of the major problems in biological research is the creation of enormous and redundant amounts of genomic data from DNA sequencing projects performed (Baxevanis, 2003;Wang and Zhang, 2005;Myers et al., 2006;Lathe et al., 2008;Oluwagbemi and Omonhinmin, 2008;Oluwagbemi, 2012).Biologists select a subset of each sequence also known as DNA nucleotide subsequences "words", for extended scientific analysis.Computational biology complements this pruning process by providing repeat finding programs to help analyze and provide useful information about interesting words, with the assumption that under or over-represented words have significant biological functions.
The biological significance of DNA repeats cannot be underestimated.DNA repeats play a significant role in the biological sciences (Jurka, 1998).Transposable elements are hidden in many repetitive DNA sequences.Experimental research and analysis on these repetitive sequences can help reveal transposable elements that are associated with genomic evolution.
The aim of this research was to develop a useful extraction tool (BengaSaVex), for in-silico analysis on the gene sequences of some microorganisms.Some pathogens are only being used as an example of how the program works.The objectives of this research were: (i) to develop in-silico simultaneous genetic sequencefiltering tools for in-silico analysis, by using objectoriented programming languages in C++, (ii) to identify repeated DNA nucleotide subsequences within the genes of some microorganisms, (iii) to evaluate the potential benefits of (ii) and (iv) to conduct a comparative analysis between BengaSaVex -C++ version and tandem repeat finder (Benson, 1999).
The biological rationale for undertaking this research stems from the fact that prominent feature of DNA can be identified by the frequency with which repeated substrings exist.For instance, this seems to be true for eukaryotes (Lander et al., 2001).Some repeats have been found to aid the provision of structural mechanism (Huang et al.,1998), while others have been identified to affect bacterial virulence, among microbes which have the tendency to cause human infections (van Belkum et al., 1998).This makes a study on repeats a promising and interesting one.
In this paper, we devised a genetic subsequence extraction tool using the C++ programming language for its implementations.We named this tool as BengaSaVex.
The tool has the capability to extract repetitive DNA sequences from a collection of multiple gene sequences of microorganisms including infectious-disease causing organisms; then estimate the relationship that exists between the lengths of extracted repeated sequence and the computational time taken to extract these repeated sequences.Insight gained from the analysis of these duplicated sequences could help accelerate the pace of research in this domain by causing a motivation for the development of more efficient tools, especially, since there is a huge volume of sequence data available.
Several traditional repeat finding programs have been developed and applied to different gene sequences.They are as described in Table 1.
In summary, this paper details the algorithm underlying the development of BengaSaVex, describes the mechanism of data collection, explores the potential benefits of identifying DNA repeats in gene sequences of computa-tional biology related research, presents the results generated by the new tools and its comparative analysis with some of the existing tools with similar or closely related functions (Saha et al., 2008).

Data collection
Data for this research work was sourced from the National Center for Biotechnology Information (NCBI) (www.ncbi.nlm.nih.gov/) and also from the Sanger Institute (ftp://ftp.sanger.ac.uk/pub/pathogens/spn/).The sequence data of some microorganisms were sourced from various gene banks.Table 2 shows the sources of data used in the analysis.Each genome sequence data for respective organisms was simultaneously inserted into the input file of BengaSaVex.

Implementation
C++ programming language was used for the implementation of BengaSaVex.The multiple sequence data for different pathogens were stored inside an input file for BengaSaVex, for onward in-silico analysis.The input file (many.in.txt) contains multiple gene sequences of infectious disease-causing organisms to be analyzed, while the output file (many.out.txt)contains the results generated by BengaSaVex after running the executable version of the software (BengaSaVex.exe).BengaSaVex was developed using algorithms to compare sub-strings of gene sequences that are identical within genome sequence of pathogens as shown in (List 1).The algorithm depicted below shows its operations on repeat sequences.

RESULTS
BengaSaVex has the capability to perform sequential in-silico analysis on hundreds to thousands of large genome sequences.However, for the purpose of this manuscript, we only analyzed close to 15 large genome sequences.We present the results of eight of them as produced by BengaSaVex, based on in-silico analysis performed on the gene sequences of some organisms as shown in Table 2. Some of the repeats were found to be intergenic.We also provide the results of a comparative analysis of BengaSaVex with the tandem repeat finding program (Table 3).
Table 1.Tabulated literature review of some traditional repeat finding programs.

RepeatMasker
RepeatMasker, a prominent software, was developed to identify, classify and mask repetitive gene sequences.RepeatMasker finds repetitive sequence by performing an alignment of the input sequence against a library of known repeats (Smit and Green, 2002;Tarailo-Graovac and Chen, 2009).

RepeatScout
RepeatScout was another program developed to identify repetitive sequence in large genomic sequence (Price et al., 2005).

SAGRI
SAGRI (Spectrum Assisted Genomic Repeat Identifier), was a tool developed as a novel approach to detecting repeats in genomic sequences.SAGRI performs a double scan on the genome sequence (Do et al., 2008).It's a tool that was developed to efficiently locate possible ancient repeats in genomic sequences produced encouraging results (Singh et al., 2007).

RECON
RECON, an automated software for identifying repetitive sequences of newly sequenced genomes, was also developed (Bao and Eddy, 2002).

WindowMasker
WindowMasker was developed to identify and mask highly repetitive subsequences in the DNA sequence of a genome (Morgulis et al., 2006).

RepeatFinder
Algorithms such as RepeatFinder (Volfovsky et al., 2001) are also useful in in-silico analyses.

PILER
Recently, PILER (Edgar and Myers, 2005) have increasingly automated the identification of repeat families from genomic sequence

ReAs
ReAs algorithm was applied in recovering ancestral sequences from transposable elements (Li et al., 2005).

OMWSA
The OMWSA is another online tool for repeat finding and visualization (Du, 2007).

Tandem Repeat Finder
Tandem Repeat Finder is yet another repeat finding program (Benson, 1999).
BengaSaVex -C++ version was used for this analysis, because it provided extraction time (in milliseconds) for the frequency of each direct repeated sequence.Analysis was performed on whole genome sequences of Pseudomonas fluorescens (Von Graevenitz and Weinstein, 1971;Picot et al., 2001), Hippea maritime DSM 10411 (Miroshnichenko et al., 1999), Bartonella tribocorum CIP 105476 (Heller et al., 1998), Sinorhizobium meliloti BL225C (Audic et al., 2009), Brucella pinnipedialis B2/94 (Whatmore, 2009;Audic et al., 2011)    Their respective accession numbers were provided in the following section.These results (Tables 2 and 3) show that BengaSaVex can be used as a complementary tool with other existing repeat finding programs.REFIND did not work on long sequences, and so was not included in Table 3.
BengaSaVex GUI shows the functionalities of the tool for input-ting data, analyzing, outputting results of extracted repeats, frequency of extracted repeats, and time taken to extract the repeats (Figure 1).

DISCUSSION
Results produced show that BengaSaVex can be used as a complementary tool for repeat finding related researches.Research on repeated sequences can help provide interesting discoveries in the study of polymorphic patterns.Understanding the relationship between redundant gene filtering algorithms, programs and the corresponding genetic sequence they process, can help provide insight to developing programs with increased efficiency in carrying out this pruning process.This in turn, will help hasten or speed up the pace of research on DNA repeats, duplicated regions, sequence alignments and redundant genetic sequences of organisms and useful medicinal plants.
BengaSaVex has an added advantage to extract repeat sequences from multiple gene sequences of organisms, of which pathogens' are just one of the sample data.
BengaSaVex also provides the corresponding frequencies of extracted sequences, and the time taken.BengaSaVex finds repeats in gene sequence of organisms.

Multifaceted applications of repeat analysis
Computational analysis finds expression in the processing of DNA repeats.Scientific research has found that DNA repeats help enhance flexibility in genetic and phenotypic features of pathogens and microorganisms (van Belkum et al., 1998).Variability in DNA repeats could help provide information about functional and evolutionary information on genetic diversity of such organisms (van Belkum, 1999a).Van Belkum as well as Delihas (van Belkum et al., 1999;Delihas, 2011), discovered and revealed the vital role sequence repeats play with the regulation of microbial gene expression.The significance of sequence repeats in epidemiologic typing cannot be underestimated (van Belkum, 1999b).Sequence repeats were also detected in Escherichia coli' sequence (Gur-Arie et al., 2000).Other scientists identified the potentials of DNA repeats in detecting certain virulent genes in pathogenic bacteria such as H. influenza (Hood et al., 1996;Power et al., 2009).Jansen and colleagues conducted an in-depth research on prokaryotes by detecting genes that are related to DNA repeats (Jansen et al., 2002;Treangen et al., 2009).Other scientists, such as Godde and Bickerton conducted similar experiments (Godde and Bickerton, 2006).Other related works that have been done in this regard are those those of Cui as well as Bolotin (Cui et al., 2008;Bolotin et al., 2005).The application of DNA repeats have been emphasized in various infectious disease research over the years.Several functions of repeated sequences in MYCOPLASMA genomes have been highlighted in some studies (Ruland et al., 1990;Himmelreich et al., 1996;Himmelreich et al., 1997;Altshuler et al., 2000;Chambaud et al., 2001;Jaffe et al., 2004;Minion et al., 2004;Mrázek, 2006;Kassai-Jáger et al., 2008;Ma et al., 2008;Ma et al., 2012).DNA sequence repeats have also been found in enteric pathogens that are responsible for bacillary dysentery in humans (Jin et al., 2002;Wei et al., 2003;Yang et al., 2003;Phalipon and Sansonetti, 2007;Saurabh et al., 2011;Sun et al., 2011).Other studies have also revealed the significance of conducting comparative analyses and repeats in the genomes of various organisms (Powell et al., 1996;Chen et al., 2003;Ju et al., 2005;Rahim, 2008;Shikano et al., 2010;Labbe et al., 2011;Saker et al., 2011;Tyagi et al., 2011).Another study characterized repeats within sequences of exclusively prokaryotic genomes (Coenye and Vandamme, 2005).
A study has also shown the significance of repeated sequence in proteins and their relevance in network evolution (Hancock and Simon, 2005).Repeated sequences have the tendency of modifying other gene data to which they are associated, thus having the tendency of playing a role in the generation of genetic variation that underlies adaptive evolution (Kashi et al., 1997;Kashi and King, 2006).As stated above-genetic disorders do not cause disease; disease is defined as caused by an infectious agent (Clancy and Shaw, 2008).Research related to duplicated regions within gene sequences of microorganisms is of paramount interest in the field of computational biology and bioinformatics (Petes and Hill, 1988;Andersson and Hughes, 2009).Gene duplication has been found to be responsible for evolutionary mechanisms (Zhang, 2003).Duplicated regions in some organisms' chromosomes have also been found to play host to essential genes (Hillyard and Redd, 2007).Duplicated regions within the sequences of microorganisms like bacteria, play a significant role in their adaptation (Anderson and Roth, 1977).Scientists have also highlighted the relevance of duplicated regions within the sequence of certain pathogens (Larsson et al., 2005).

Conclusion
We developed BengaSaVex (a computational biology/bioinformatics tool) for identifying and extracting repeats in gene sequences.This tool will complement other existing repeat finding tools to provide support for biological research.Future work on BengaSaVex is to improve the efficiency and also develop an online version.

List 1: BengaSaVex Algorithm Begin
Input S1 ,…………., Sm: the m set of pathogen gene sequence While (!end of file) do Get next set of gene sequence for all i=1 to n do

Table 3 .
In-silico comparative analysis between BengaSaVex and some repeat finding programs (with respect to time only).