Prediction and analysis of the secreteomic in Corynebacterium glutamicum ATCC 13032

Corynebacterium glutamicum is an outstanding organism used for amino acid production. Its ability to secrete L-glutamate has been known for almost fifty years now. The complete nucleotide sequence of C. glutamicum ATCC 13032 genome was previously determined and allowed the reliable prediction of 3056 protein-coding genes within this genome using computational methods. The 3056 open reading frames (ORFs) of C. glutamicum ATCC 13032 were used for the prediction of secreted proteins by bioinformatics approaches, such as SignalP 3.0 and Proteome Analyst. 167 proteins were predicted to be secreted and contain signal peptides, whose amino residues were relatively conserved. Among them, 10 have RR-motif signal peptide and 46 have SignalPaseII signal peptide. Total of 167 secreted proteins have functional descriptions, many of which were enzymes that are involved in metabolism. This prediction method has given good insights into the whole secreted proteome of C. glutamicum and provided basis to further studies of its secretomic features at a genome level.


INTRODUCTION
Corynebacterium glutamicum is a Gram-positive, nonsporulating bacterium that was isolated by Kinoshita and co-workers in a screen for bacteria that secrete L-glutamate.It is used for the industrial production of amino acids such as glutamate and lysine that have been used in human food, animal feed and pharmaceutical products for several decades.In addition, recent studies have indicated the potential of C. glutamicum for production of other commercially relevant compounds, such as succinate or ethanol (Bott, 2007;Leuchtenberger, 1996).
Because of the importance of C. glutamicum in industrial biotechnology, its 3.3-Mb genome sequence has been determined several times independently.The establishment of a completely annotated C. glutamicum genome sequence is a big leap forward to the understanding of the biology of this organism (Kalinowski et al., 2003).The complete genome sequence is the basis for *Corresponding author.E-mail: yanming@njut.edu.cn.Tel: 86-25-8358-7355.
extensive expression analyses by proteome and transcriptome technologies, which will lead to a comprehensive systemic understanding of gene expression and regulatory networks (Becker et al., 2007).
It is well established that C. glutamicum can secrete certain proteins to high concentrations in the medium.However, until recently it was difficult to estimate the number of exported proteins belonging to the secretome of C. glutamicum.The completion of the C. glutamicum genome sequencing project and the availability of programs for the identification of signal peptides and transmembrane segments in large collections of protein sequences through worldwide web servers have now made it possible to predict the most likely location of all 3,056 annotated proteins (that is, the proteome) of this organism.Computer-assisted studies have indicated that approximately 15% of the proteome of a given organism, such as C. glutamicum, contains membrane sorting signals in the form of hydrophobic stretches of amino acids that can integrate in and span the membrane (Saleh et al., 2001).Some of these putative membrane proteins contain amino-terminal signal peptides (SPs) and may in fact be exported proteins.
Prediction of SPs using SignalP 3.0 (Bendtsen et al., 2004;Emanuelsson et al., 2000) SignalP 3.0 server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes and eukaryotes.The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models.Because of the length restrictions, we divided the sequence into 6 submissions to classify proteins as asignal peptides/nonsignal peptides.

Prediction of secreted proteins using proteome analyst 2.5
Proteome analyst is a publicly-available, high-throughput, Webbased system for predicting various properties of each protein in an entire proteome (Szafron et al., 2004).We uploaded a FASTA format file containing the 454 SPs sequences to be classified by SignalP 3.0.With the removal of the transmembrane proteins, a set of 167 secreted proteins were obtained and functionally classified.Motif search of secreted proteins uncharacted was carried out using MYHITS (http: //myhits.isb-sib.ch/cgi-bin/motif_scan).

Prediction of lipoprotein signal peptides using LipoP 1.0
The hidden Markov model (HMM) was able to distinguish between lipoproteins (SPaseII-cleaved proteins), SPaseI-cleaved proteins, cytoplasmic proteins and transmembrane proteins.The HMM was able to identify 92.9% of the lipoproteins included in a Grampositive test set.The results obtained were significantly better than those of previously developed methods (Juncker et al., 2003).

Prediction of twin-arginine signal peptides using Tap 1.0
The method is able to discriminate Tat signal peptides from cytoplasmic proteins carrying a similar motif, as well as from Sec signal peptides, with high accuracy (Nielsen et al., 1999).

Length distribution of predicted signal peptides
The 167 predicted signal peptides (Figure 1) had a length varying from 20 to 55 residues, with an average of 31 residues.

The frequency of 20 amino acids residues of signal peptides
The frequency of 20 amino acids residues of predicted signal peptides is shown in Figure 2. It has to be noted that nonpolar amino acid residues can be found more abundant, where alanine (Ala) residue is most abundant (18%).The frequency of positively charged residues arginine (Arg), lysine (Lys) and histidine (His) is respectively 5.0, 3.7 and 0.8%, while that of negatively charged residues asparagine (Asp) and glutamine (Glu) is 1.2 and 1.6% respectively.For uncharged residues, the frequency of serine (Ser) is the most (9.7%) while tyrosine (Tyr) is The frequency of 20 amino acids residues of signal peptides in secreted proteins.
Table 1.The frequency of 20 amino acid residues around the signal peptidase cleavage sites (%). the least (0.5%).The fact that the residues whose frequency is more than 5% are mostly aliphatic amino acids suggests that such residues are involved in the targeting of secreted protein to specific membrane locations in C. glutamicum.

Change of amino-acid residues in C-domain
Three distinct regions comprise the N-terminal signal sequence: the charged N-terminus (N-domain), the hydrophobic core (H-domain), and the C-terminal cleavage domain (C-domain) (M Akita, 1990;Paetzel et al., 1998).
The C-domain of the predicted signal peptides carries a type I SPase cleavage site, with the consensus sequence A-X-A at position -3 to -1 relative to the SPase I cleavage site (Table 1).The frequency of Ala at -3, -1, +1 position is respectively, 51.5, 74.9 and 21.0%.It is important to note that the C-domain must have an extended (βsheeted) structure for effective interaction with the active site of type I SPases.Based on the crystal structure of the type I SPase of Escherichia coli, the side chains of residues at the -1 and -3 positions are thought to be bound in two shallow hydrophobic substrate-binding pockets (S1 and S3) of the active site, whereas the side chain of the residue at position -2 is pointing outwards from the enzyme.It is presumably for this reason that residues tolerated at positions -3 and -1 of the signal peptide are generally small and uncharged, while almost all residues seem to be allowed at position -2.Nevertheless, a preference for Ser (17%) at position -2 of the signal peptide seems to exist in C. glutamicum.
According to the predictions, an Ala residue is most abundant (21%) at position +1 of the mature protein, but all other residues, with the exception of tryptophan (Trp), seem to be allowed at this position.

Lipoprotein signal peptides
Putative lipoprotein signal peptides were identified through similarity searches in the LipoP 1.0 database.Putative lipoprotein sorting signals identified by the method were combined with those identified by SignalP, resulting in a total number of 46.Signal from lipoproteins differ in several respects from those of secretory signals.

Twin-arginine signal peptides
Proteins containing a signal peptide with the RR-motif (R-R-X-#-#, where # is a hydrophobic residue) may be transported via the Tat pathway.Through TatP 1.0 database search for the presence of this motif in aminoterminal protein sequences, a total number of 10 putative RR-signal peptides were identified (Table 2).Notably, the RR-motif was also found in the signal peptides of two putative lipoproteins, suggesting that these proteins might also be substrates for the Tat pathway.

Functional classification of predicted secreted proteins
A total of 167 secreted proteins have functional descrip-tions, of which 51 are secreted proteins, while 35 are transport system proteins, 43 are enzymes, 18 are binding proteins and precursors (Figure 3).
In order to identify potential functions of 51 secreted proteins uncharacted, motif analysis was carried out using MYHITS.As a result, five sequences exist as related activity site (Table 3).

Secretion by C. glutamicum of heterologous proteins
C. glutamicum has been used for the industrial production of amino acids for several decades.However, there had been only few reports concerning heterologous protein secretion in C. glutamicum.Recently, in many studies it has being demonstrated that Streptomyces mobaraensis transglutaminase (Date et al., 2004;Kikuchi et al., 2003) another enzyme used in the food industry, and human epidermal growth factor can be efficiently secreted in active form by C. glutamicum; so this strain is a potential host for industrial-scale protein production.In addition, the Tat pathway in C. glutamicum has been demonstrated to specifically mediate the secretion of Arthrobacter globiformis isomaltodextranase and green fluorescent protein (GFP) carrying an E. coli TorA signal peptide.Furthermore, the Tat-pathway-dependent secretion of GFP has been shown to be far superior in C. glutamicum compared with two other Gram-positive bacteria, Bacillus subtilis and Staphylococcus carnosus (Meissner et al., 2007).More recently, there was report of secretion of Streptococcus bovis α-amylase using cspB promoter and signal sequence by C. glutamicum for the efficient utilization of raw starch, identifying C. glutamicum as a very useful host for the expression of heterologous proteins (Kikuchi et al., 2006;Tateno and Akihiko, 2007;Kikuchi et al., 2007).
In the present prediction, about 21% of secreted proteins were involved in transport system, while 10 putative RR-signal peptides were identified.Thus, the observations that C. glutamicum can efficiently release different heterologous proteins into the culture medium subsequently to their Tat-dependent translocation across Hao et al 1565

Binding Protein and Precursor 18
Others 20 Transferase 7 Permease of the major facilitator superfamily 3 Hydrolase 3 Isomerase 3 Protease 3

Transport system 35
Figure 3. Classification of predicted secreted proteins.
Table 3. Sequences with key motifs in secreted proteins with unknown function.

ID of protein Description of the motif Possible function gi_62389680
Serine proteases, trypsin family, histidine active site.

TonB-dependent receptor gi_62391027
Transcription factor TFIIB repeat signature.a target of gene-specific transcriptional activators gi_62391174 Aldehyde dehydrogenases glutamic acid active site.Aldehyde dehydrogenase the plasma membrane is somewhat unexpected, particularly in view of the fact that the proteins translocated via the Tat pathway usually arrive at the transside of the cytoplasmic membrane in a fully folded state.

Identified enzymes in the predicted secreted proteins
Extensive works have been done to investigate other functions of C. glutamicum besides amino acids production binding with the enzyme protein secreted, such as aromatic degradation.In their work, many secreted proteins were identified (Table 4), while some of which were also identified by our prediction.

DISCUSSION
In the present work, efforts were made to identify all genes in the C. glutamicum genome database whose deduced proteins would likely be soluble secreted proteins (the secretome).While certain C. glutamicum secretory proteins have been studied in detail, such as six mycolyltransferase genes and their gene products, more data on the entire secretome is needed.One approach to rapidly predict the functions of an entire proteome is to utilize genomic database information and prediction algorithms.The use of computer-based prediction algorithms is a powerful, systematic, and rapid tool to obtain preliminary functional information on gene products of an entire genome.Information can then be analyzed in global fashion to organize functional groupings of predicted proteins, or individually, in order to identify genes of particular interest for future experimental study.The C. glutamicum genome database was queried in an effort to identify all genes whose deduced proteins would likely be secreted proteins in order to: (a) obtain a global perspective on secreted proteins in C. glutamicum; and (b) identify previously uncharacterized genes for further experimental study.A series of prediction algorithms available was therefore used on internet-based servers to analyze the C. glutamicum genome database.In this study, identification was carried out on genes whose proteins have signal peptides and are known to be secreted extracellularly, including: cop1, cmt1, cmt2, cmt3, cmt4, and cmt5.Interestingly, 42 of secreted proteins predicted are of unknown function.In order to gain additional insight into the functional properties of these potential C. glutamicum secretory proteins in our dataset, we referred to the extensive motif search and got some related activity site, such as serine protease, Aldehyde dehydrogenase.
Important limitations of this approach are that it relies on prediction algorithms with a defined error rate which Trehalose corynomycolyl transferase 2.3.1.122(Brand et al., 2003) gi_62391716 Trehalose corynomycolyl transferase 2.3.1.122(Brand et al., 2003) gi_62391453 putative secreted protein, hypothetical endoglucanase (Brand et al., 2003) gi_62390420 secreted cell wall-associated hydrolase (Brand et al., 2003) gi_62389427 secreted protein (Brand et al., 2003) gi_62390276 putative secreted hydrolase (Brand et al., 2003) gi_62389528 ABC-type amino acid transport system, secreted component (Schluesener et al., 2007) gi_62390348 NADH Dehydrogenase 1.6.99.3 (Schluesener et al., 2007) gi_62390845 putative secreted or membrane protein (Schluesener et al., 2007) gi_62389441 Probable short-chain dehydrogenase, secreted (Huang et al., 2008) could potentially be greater in specific organisms.Furthermore, these prediction algorithms are useful for rapid preliminary analyses of large amounts of genomic data, but it must be emphasized that these are only predictions, which require experimental validation.The present approach was to be inclusive rather than exclusive; so overall, these results probably represent an overestimation of the actual C. glutamicum secretome, especially since many open reading frames (ORFs) in the genome database have not been confirmed experimentally and some ORFs may not be expressed.
In conclusion, for further experimental study, we would like to examine novel secreted proteins and identify their function using proteomics-based approaches to analyze C. glutamicum secreted proteins.

Figure 1 .
Figure 1.Distribution of signal peptide with different length.

Table 4 .
Identified enzymes in the predicted secreted proteins.