Computational identification of putative cytochrome P 450 genes in soybean ( Glycine max ) using expressed sequence tags ( ESTs )

Cytochrome P450 is a group of monooxygenase that exists as a gene superfamily and plays an important role in metabolizing physiologically important compounds in plants. However, to date only a limited number of P450s have been identified and characterized in soybean (Glycine max.). In this work, a computational study of expressed sequence tags (ESTs) of soybean was performed by data mining methods and bio-informatics tools and as a result 78 putative P450 genes were identified, including 57 new ones. These genes were classified into five clans and 20 families by sequence similarities and among those 57 new families, 18 new subfamilies were found which have not been observed previously in soybean. This work may provide a basis for further functional dissection of P450 genes in soybean and other legumes.

Plant P450s are also responsible for catabolizing some endogenous signaling molecules as well as enhancing exogenous compounds in the environment (Chaudry et *Corresponding author.E-mail: dyyu@njau.edu.cn.Tel/Fax: 86 25 84396410.al., 2002;Harvey et al., 2002).Due to the usefulness of oxygen in building complex molecules, cytochrome P450 enzymes are abundant in these pathways and comprise approximately 1% of the genes in plant genomes such as Arabidopsis, rice and poplar (Nelson, 2006).Complexities existing in the biochemical reactions and genomes of the more diverse phyla, which includes 230,000 named species in angiosperms (Margulis et al., 1998), that there is still much to be discovered, particularly in the proliferation of P450-mediated reactions already characterized in plants (Kahn and Durst, 2000;Werck-Reichhart et al., 2002).
P450s are widespread in the plant kingdom and constitute a gene superfamily.There are two main classes of P450s containing 10 clans and 62 families in plants, however, only clans 71, 85, 86, 74 and 97, containing a total of 22 families, have been identified in Fabales (Nelson et al., 2004).The Fabacea (legumes) are the third largest plant family of flowering plants, comprising more than 650 genera and 18,000 species.Economically, legumes represent the second most important family of crop plants after Poacea (grass family), accounting for around 27% of world crop production (Graham and Vance, 2003).
Soybean (Glycine max) is one of the most economically important species in legumes (VandenBosch and Stacey, 2003).The large scale of soybean expressed sequence tags (ESTs) database (NCBI) contains over 1,386,618 sequence entries from 73 non-normalized complementary deoxyribonucleic acid (cDNA) libraries, representing various vegetative and reproduction organs (20, February 2008).These resources are very helpful to get in acquiring a greater knowledge of genomes and their roles.Now these databases are offering an opportunity to identify previously uncharacterized genes, and to assess the frequency and tissue specificity of their expression in silico.
The genome of the plant kingdom is entering from genome sequencing into the post-genomic area.There are still large quantities of genes to be annotated.This study used ESTs which is a good approach for computational work.In silico resources and bio-informatics tools were applied to detect, identify and annotate putatively functional P450 encoding sequences in soybean.The deduced amino acid sequences which are based on phylogenetic analysis have allowed identification of paralogous genes and clusters of orthologous groups, allowing further characterization of P450 genes with both known and unknown functions.

Collection of putative P450 sequences from G. max
The National Center for Bio-technology Information (NCBI) database (http://www.ncbi.nlm.nih.gov/) was used to retrieve soybean nucleotide and 70 nucleotides were discovered.The strategy used to discover the soybean P450 family at NCBI is as following: Each nucleotide was retrieved at NCBI with the tblastn option.Unregistered accession numbers of ESTs were BLASTed using blastn with nt/rt option.The ESTs with 'no significant similarity found' were BLASTed using balstn with ESTs option.These ESTs were BLASTed at (http://www.phytozome.net)using soybean as target genome for organism genome.Genomic regions were selected with E-value less than 0.5, 5 to 10 Kb of genome sequence were obtained for each accession number by using download sequence file in reports and analysis with soybean in data source.These genomic sequences were researched at (http://www.softberry.com) by online software, using Gene finding in eukaryotes, FGENESH HMM based gene structure predication.The sequences were pasted and Medicago (legume plant) selected to obtain ORF and corresponding protein sequences.The protein sequences were then BLASTed at NCBI using pblast to detect P450 conserved domains.The resulting 78 new P450 protein sequences were discovered.A total of 133 protein sequences of soybean were collected with 55 already known and termed as P450.

Phylogenetic analysis of putative P450s
Predicted P450 protein sequences from soybean and representative members of known P450 families from other plants were used for alignment and phylogenetic analysis.The alignment of Chattha et al. 9 multiple sequence of P450 proteins was performed using the CLUSTALX program (version 1.81) (Thompson et al., 1997).The phylogenetic analysis was carried out by the neighbor-joining (N-J) method (Saitou and Nei, 1987) and a neighbor-joining tree was constructed using protein in CLUSTALX.The significance level of the neighbor-joining analysis was examined by bootstrap testing with 1000 repeats.The tree was represented using N-J plot in the CLUSTALX program.

Identification of putative P450 genes in soybean
We identified 78 putative P450 genes in soybean.All these putative P450s genes presented here were based on sequence similarity searches.Query proteins were from representative members of each plant P450 family, but functional identification in each family member was preferred.The sequence similarity based search was refined by multiple alignments and searched for conserved domains as mentioned in methods part.All the sequences analyzed possessed the structures typical of P450 family.Thus, we annotated 57 P450 genes according to the standardized system of P450 nomenclature (Nelson et al., 1996).All these P450 sequences were distributed among 20 P450 families and five clans (clans 71, 72, 74, 85 and clan 86) which suggested P450 genes existed as a superfamily in soybean.Approximately 73% (57 out of 78) of the putative P450s identified were annotated as P450, P450-like, or related to P450 without any indication of similarity to certain families and were given names related to P450s (Table 1).Therefore, they could not have been related to a certain families solely based on the description given by the database.The annotation of these "anonymous" P450s was improved in this study by identifying their similarities as characterized P450s that were not available at the time of annotation of sequences.ESTs were performed by similarity comparisons to previously identified genes and were improved in this study.

Phylogenetic analysis of predicated P450 families
A N-J phylogenetic tree for identified and predicated sequences from soybean and representative members of P450 families was constructed using MEGA4.1 (Kumar et al., 2008) (Figure 1).It can be seen from Figure 1, that plant P450s genes were classified into two branches, Atype and non-A-type.All the soybean putative P450 sequences were clustered into their corresponding branches.The proteins encoded by the A-type genes from a single clan containing 75% soybean sequences (43 out of 57); represent many of plant-specific enzymes functioning in the synthesis of secondary products (phenlypropanoids, glucosinolates, isoprenoids and

Clan
Family Name* EST Previous annotation alkaloids, etc.).The proteins encoded by the non-A-type genes represent multi-kingdom enzymes functioning in the synthesis of more general compounds (sterol, oxygenated fatty acids, etc.) and in the synthesis of hormones and other molecules.

DISCUSSION P450 genes exist as a superfamily in soybean
In this study, 11 families of clan 71 were discovered in soybean in which eight new subfamilies and three new families of clan 71, not reported previously (family 75, 706, 736) were designated as CYP75B40, CYP706A10, CYP706K1, CYP736A28, CYP736A29, CYP736A30, CYP736A31, CYP736A32, CYP736A32, CYP736A33 and CYP736A34.Constantly in the angiosperms, clan 71 is by far the biggest, with one third of the plant P450s in any genome.CYP71 clan sequences are noted in other plant genomes and seem to have begun early in plant lands (Durst and Nelson, 1995).Furthermore, four clans were identified in this study, namely clans 72, 74, 85 and 86.In clan 72, three families had not been explored previously, one was found to have new subfamily and two new families in soybean.Two new families annotated as CYP714A9 and CYP734A17.Clan CYP72 inactive brassinolide particularly found in CYP734A subfamily (Turk et al., 2005).During the identification, only one new subfamily was found in clan 74 for soybean, which was not reported previously and was named CYP74C16X which may be an incomplete sequence or pseudo gene.
It has been noted that the CYP74 family is important in modifying unsaturated fatty acid hydroperoxides derived from linolenic acid or α-linolenic acids and includes oxide synthases, hydroperoxide lyases and divinely synthesis (Nelson, 2006).
In clan 85, four families were discovered in which two are new subfamilies and the other two new families in soybean which were not previously reported.The annotations of the two new families are CYP716G1 and CYP722A1.The CYP85 clan has several functions, including synthesis of kaurenoic acid oxidase (Winkler and Helentjaris, 1995;Helliwell et al., 2000).During the investigation of clan 86, four new subfamilies of family CYP94 (named CYP94C18, CYP94C19, CYP94C20 and CYP94D24V2) were found which were not previously explored in soybean.The CYP86 clan had three families CYP86, CYP94 and CYP704.The functions of these families appear to have been established very early in plant evolution with land plants needing to protect themselves against water loss (Nelson, 2006).Further studies using multiple approaches like cloning, single nucleotide polymorphisms (SNPs) and functional association will be required to resolve function and divergence in this gene superfamily and there are still big knowledge gaps surrounding plant's P450s, in-spite of the model plants Arabidopsis and rice, as these can hardly be representative of their 170,000 dicot and 65,000 monocot relatives.

Phylogenetic analysis revealed new putative P450 families in soybean
This study found seven new families and 18 new subfamilies not previously explored in soybean (Table 1 and Figure 1).The five clans (clans 71, 72, 74, 85 and 86) can be found for the dicot plant clans with corresponding members in soybean.With the exception of the CYP71 clan, the other four clans seems involved in conserved functions that relate to sterol and isoprenoid biosynthesis (clan 85), fatty acid metabolism (clan 86), biosynthesis of oxylipids (some subfamilies of clan 74) and plant hormone homeostasis (clan 72) (Werck-Reichhart et al., 2002).Many of the families of clan 71 and some particular subfamilies appear to be species specific and represent the success in recruiting P450s for evolutionary novel (Li et al., 2007).With the completion of the soybean genome project, there is a speculation that more P450s will be discovered.This study may provide a basis for further functional dissection of P450 in soybean and other legumes.

Figure 1 .
Figure 1.Phylogenetic analysis of predicated and annotated P450s.Phylogenetic tree of the collected G. max P450s and the representative members of P450 families.The unrooted phylogentic tree of P450s was depicted by the CLUSTAL X (version 1.81) program and the neighbor-joining (N-J) method.A N-J tree was constructed using MEGA 4.1; the significance level of the N-J analysis was examined by bootstrap testing with 1,000 repeats.The numbers beside the branches represent bootstrap values based on 1,000 replications.At, Arabidopsis thaliana; Cj, Camellia japonica; Cr, Catharanthus roseus; Eg, Eustoma grandiflorm; Lj, Lotus japonica; Mt,Medicago truncatula; Nt, Nicotiana tabacum; Os, Oryza sativa; Ph, Petunia hybrid; Pi, Petunia inflata; Pr, Pinus radiata; Ps, Pisum sativa; Sm,Solanum melongena; St, Solanum tubersome; Zm, Zea mays.