Expressed sequence tags ( EST ) analysis of a normalized full-length cDNA library from the pinewood nematode ( Bursaphelenchus xylophilus )

The pinewood nematode (Bursaphelenchus xylophilus) infects pine trees and causes pine wilt disease. To clarify the functions and subcellular localization of B. xylophilus genes/proteins transcribed and predicted from mixed stages (egg, J1, J2, J3 J4 and adult), we prepared a normalized full-length B. xylophilus cDNA library and analyzed expressed sequence tags (ESTs) using the Pendant-Pro sequence analysis suite. Most cDNAs inserted into the library ranged from 0.9 to 1.8 kb (average 1.5 kb). The 1,902 ESTs from B. xylophilus consisted of 286 clusters and 1,273 singletons. EST sizes ranged from 9 to 743 bp with a mean of 336 bp. The predicted protein length from B. xylophilus ESTs revealed that most proteins ranged from 50 to 149 amino acids. Enzyme nomenclature (EC) numbers were classified into 133 (8.5%) of 1,559 contigs using UniProt database hits by the EC numbers method. Transmembrane regions of 1559 clusters were predicted using the TMpred algorithm. The 1,559 contigs with transmembrane regions were annotated; 481 (30.8%) contigs were assigned ‘above one’ domain and 1,078 (69.1%) were assigned ‘none.’ Additionally, taxonomy was classified for 672 (43.1%) of 1,559 contigs. Of the 1,559 contigs, 685 (43.9%) were assigned gene ontology terms using the gomerger method of contigs, including singletons. Thirty-one (31) contigs of predicted proteins grouped by BLASTP identity values had significant homology to genes expressed in subcellular structures (for example, mitochondrion, plasma membrane, endoplasmic reticulum, nucleus and golgi). B. xylophilus ESTs provide the foundation for research information on related plant parasite nematodes and contribute to finding an important novel parasite control strategy.


INTRODUCTION
The pine wood nematode (PWN), Bursaphelenchus xylophilus (Steiner and Buhrer, 1934) Nickle, a plant parasitic nematode, was first described in 1934 in Louisiana, and thus originated in North America.From there it was introduced into Japan, then spread to neighboring East Asian countries, such as China andKorea in 1982 and1988, respectively, and it was found in Portugal in 1999 (Mota et al., 1999;Mota and Vieira, 2008) and in Spain in 2008 (Abelleira et al., 2011).PWN is the causal agent of pine wilt disease (PWD) in pine trees (Linit, 1988;Yan et al., 2012).PWD typically occurs in mature pine trees 20 or more years old.Nematodes feed on the cells surrounding resin ducts, which causes resin to leak into tracheids, resulting in tracheid cavitation or air pockets in the water transport system.Next, transpiration of the tree cannot be sustained, which eventually leads to pine death (Myers, 1988;Donald et al., 2003).The beetle Monochamus alternatus (Hope) was immediately designed as a principal vector of the causal agent and spread nematodes from tree to tree (Shibata, 1987;Sakai and Yamasaki, 1990;Fan et al., 2007).Species of Monochamus from conifers are the principal vectors of B. xylophilus, and of these M. alternatus is the major vector in Japan, whereas M. carolinensis and M. scutellatus are the major vectors in North America and in Europe it is Monochamus galloprovincialis.The dispersal fourth-stage dauber larvae of the pinewood nematode cause primary transmission to occur during maturation feedings by the pine sawyer (Fielding and Evans, 1996).Also, nematode activity in pine trees is similar to natural infection by dispersal fourth-stage dauber larvae transmitted from pine sawyers when a pine tree is infected with nematodes isolated from fungal cultures (Linit 1988;Sriwati et al., 2007).
In recent years, the GenBank entries for B. xylophilus and B. mucronatus revealed 13,327 and 3,193 ESTs, respectively (Kikuchi et al., 2007).Additionally, the complete B. xylophilus genome (genome size of 74.5 Mb with a GC content of 40.4%) sequence was annotated by Kikuchi and colleagues (Kikuchi et al., 2011), as well as the complete mitochondrial genome of B. xylophilus (14,778 bp) (Sultana et al., 2013).Also, plant parasitic nematode ESTs have recently been generated for analyzing the function of genes from several plantparasitic nematode species (Scholl et al., 2003;Dubreuil et al., 2007;Kikuchi et al., 2007;Nagaraj et al., 2008;Rosso et al., 2008;Sultana et al., 2013).However, since the majority of ESTs were generated from nematode total RNA without normalization of a cDNA library, these EST libraries may contain multiplicative ESTs of the nematode.
In this study, we report an analysis of 1,902 ESTs from a normalized full-length cDNA library of B. xylophilus mixed stages (egg, J1, J2, J3 J4, and adult).Firstly, the ESTs from the B. xylophilus library were grouped into clusters that were analyzed by the most conserved nematode genes between B. xylophilus and other nematodes, classification of EC number, transmembrane regions, taxonomy, gene ontology (GO), and the identification of gene-related subcellular localization were done.These results provide the foundation for information studies focusing on understanding the gene function of B.
xylophilus nematode, as well as the development of a novel parasite control strategy.

Nematode propagation
B. xylophilus (Bx90) used in this study were isolated from the Gangneung area, Korea and was propagated on a lawn of Botrytis cinerea cultured on a potato dextrose agar plate at 28°C.Mixed stages (egg, J1, J2, J3, J4 and adult) were collected on 25 µm sieves.Eggs and nematodes were sterilized with 1% NaOCl and suspended in sterile deionized water.The samples were frozen in liquid nitrogen and ground with a mortar and pestle.

RNA preparation and cDNA library construction
Total RNAs from mixed stages were isolated using TRIzol reagent according to the manufacturer's instructions (Invitrogen, The Netherlands).A full-length cDNA library of B. xylophilus was prepared from total RNAs of mixed stages as described previously (Oh et al., 2003).
In brief, the pretreated total RNA with bacterial alkaline phosphatase (Roche Diagnostics, Switzerland) and tobacco acid pyrophosphatase (Wako, Japan) was ligated with a 5'oligoribonucleotide using RNA ligase (TaKaRa, Japan).
First-strand cDNA synthesis and amplification from purified mRNA were performed as described by Maruyama and Sugano (1994).Amplified PCR products were then digested as described by Kang et al. (2010).The ligated cDNA was then transformed into Escherichia coli Top10F' (Invitrogen, USA) by electroporation (Gene Pulser II; Bio-Rad, USA).The B. xylophilus cDNA library consisted of 4.8 × 10 6 colonies.To construct a normalized B. xylophilus cDNA library, a single-stranded DNA library was prepared as described previously (Vieira and Messing, 1987).Finally, we obtained a normalized library of 2 × 10 6 colonies.

Sequencing of plasmids
Each colony from the B. xylophilus cDNA libraries was picked into 96-well plates containing 0.5 mL of Luria Bertani (LB) medium containing 75 µg/mL ampicillin.Plates were incubated overnight at 37°C.Plasmids were purified using FB glass-fiber plates (Millipore, USA) to remove protein and cellular debris.Plasmid inserts were sequenced from the 5' end using the primer in the pCNS vector (5'-GGT CTA TAT AAG CAG AGC TC-3') and the BigDye terminator ver.3.1 kit (Applied Biosystems, USA) on an ABI 3730xl DNA sequencer (Applied Biosystems).

EST sequence clustering
Vector trimming and trimmed sequence cleaning for automated trimming, and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low complexity sequences were performed using the cross_match, SeqClean, and Lucy programs (Xie et al., 2010;Tae et al., 2012;Yang et al., 2012), respectively.The cleanup process included quality assessment, confidence reassurance, vector trimming, and vector removal.ESTs of at least 200 bp after both vector and low-quality trimming were regarded as "high-quality" ESTs.Clustering was performed using TGI clustering tools (TGICL), a software pipeline designed to automate clustering and assembly of a large EST/mRNA data set (Rensing et al., 2003;Menon et al., 2012).Sequence clustering was Figure 1.Evaluation of the cDNA library insert size on agarose gel.Colonies were picked at random and colony PCR was performed using T7 and SP6 promoter primers (lanes 1-48).M, DNA size marker (Invitrogen, USA) was performed with a slightly modified version of NCBI's megablast, and the resulting clusters were then assembled using the CAP3 assembly program.TGICL began with a large multi-FASTA file (an optional peer quality values file) and the output assembly files, as produced by CAP3.Both clustering and assembly phases could be parallelized by distributing the searches and assembly jobs across multiple CPUs since TGICL can take advantage of either SMP machines or parallel virtual machine (PVM) clusters (Lee et al., 2005).

Contig analysis
Each contig sequence was evaluated for similarity and annotation analysis of nucleotide and protein sequences using the Pendant-Pro Sequence Analysis Suite.The Pendant-Pro genome database provides an exhaustive pre-computed analysis using a large variety of established bioinformatics tools.To investigate protein function, BLAST similarity was searched against the complete nonredundant protein sequence database (Altschul et al., 1997).Motifs were searched against Pfam (Bateman et al., 2002), BLOCKS (Henikoff et al., 1999), PROSITE (Falquet et al., 2002), and InterPro (Apweiler et al., 2001) databases.Predictions of cellular roles and functions based on high-stringency BLAST were searched against protein sequences with manually assigned functional categories according to the Functional Catalogue (FunCat), developed by MIPS and Biomax Informatics AG.Enzyme nomenclature (EC) numbers were predicted based on similarity.Keywords and superfamilies were extracted by similarity-based assignments from the PIR-International sequence database (Barker et al., 2000).Sequences were assigned to known clusters of orthologous groups (COGS) (Tatusov et al., 2001).To investigate protein structure, transmembrane regions were predicted using the TMpred algorithm against TMbase (Hofmann and Stoffel, 1993).Local low similarity regions and entire regions were identified using non-globular domains based on the SEG algorithm (Wootton and Federhen 1993).Coiled-coil motifs were also predicted (Lupas et al., 1991).In addition, sequence similarities were searched on the NCBI GenBank and RefSeq databases using BLASTN.BLAST similarity hits were classified based on their taxonomic origin.

Evaluation of B. xylophilus cDNA library quality
To confirm and evaluate the insertion and quality of the B. xylophilus cDNA library, the range and average plasmid insert lengths were determined by PCR amplification.Forty-eight (48) cDNA clones were randomly picked from the library and colony PCR was performed with T7 and Sp6 promoter primers.Most cDNA inserts in the B. xylophilus libraries ranged from 0.9 to 1.8 kb with an average cDNA insert size of 1.5 kb (Figure 1) (1-48).These results are similar to those from our previous study in Meloidogyne incognita, as well as results from the library constructed from mixed-stage B. xylophilus vigorously growing on fungi (Kikuchi et al., 2007;Kang et al., 2010).2).Most clusters consisted of two to four ESTs, demonstrating the high quality of the normalized cDNA library.Also, these results show that abundant transcripts in total RNA during cDNA library preparation were removed and that the full-length cDNA library was successfully normalized.The quality of the cDNA library also showed more efficiency than the results from Kikuchi and colleagues (2011) due to gain probability of the contigs and singletons from analyzed ESTs of each library (our experiment and Kikuchi colleagues study were 82 and 50%, respectively; (Kikuchi et al., 2007;Kang et al., 2010).The sequence distribution of ESTs ranged from 9 to 743 bp with an average size of 336 bp (Figure 2A).The high frequency contigs ranged from 351 to 400 bp and represented 335 (21.5%) of 1,559 contigs.These results matched the consistent quality of the constructed B. xylophilus cDNA library in Figure 1.
In addition, protein length frequency values showed high-quality ESTs produced from B. xylophilus.The length of predicted proteins from B. xylophilus contigs revealed that most proteins ranged from 50 to 149 amino acids (88.1%).The longest proteins ranged from 150 to 299 amino acids and represented 113 (7.2%) of 1,559 total proteins (Figure 2B).In addition, the protein isoelectric point distribution indicated the high quality of ESTs produced from B. xylophilus.Putative proteins predicted from contig sequences were classified into nine groups with distinct isoelectric points: very strong basic proteins (>10.5 pI) represented 185 (11.9%) of 1,559 total proteins (Figure 2C); neutral proteins (6.5 < pI < 7.5) represented 153 (9.8%) of the total proteins; very strong acidic proteins (pI < 3.5) represented three (0.2%) of the total proteins (Figure 2C).However, these results only reflected isoelectric points predicted from part of the protein sequences.Furthermore, long-range sequencing of ESTs will need to be performed to determine isoelectric points of B. xylophilus full-length proteins.

Identification of highly conserved genes
Highly conserved genes between B. xylophilus and other nematodes were annotated by the BLASTP program using the UniProt database.Predicted proteins were classified by BLASTP identity values.The highlyconserved proteins (>90% identity) represented 1.2% of the total predicted proteins.Proteins without homology (0% identity) represented 56.3% of the total predicted proteins.

Functional classification of proteins based on GO assignments
Gene ontology (GO) has been widely used to address the need for consistent descriptions of gene products in different databases.It has presented three-structured and controlled ontologies that describe gene products in terms of their associated biological processes, cellular components, and molecular functions in a speciesindependent manner.Three individual memo methods, gobysimilarity, gofromeckw and gofrominterpro, were calculated to extract GO categories for a protein.The gomerger method integrates the results of the three algorithms (Martin et al., 2004).The gomerger method takes (as input) the GO term assignments for a given genetic element that are the outputs of the three predecessor methods: gofromeckw, gobysimilarity and gofrominterpro.Each GO assignment consists of the GO term number and quality score (BLAST E value or Interpro score).The latter is converted to a negative decimal logarithm.Each GO term number corresponds to a node on the GO tree.The gomerger method then  The sequence similarities of contigs were searched using the UniProt database.
searches the GO tree (which is kept in memory) upward and finds all ancestors of all GO nodes assigned.Scores from all preceding nodes are then added together, assigning GO categories to a protein sequence (Ashburner and Lewis, 2002;Martin et al., 2004).Using the gomerger method of contigs, we functionally assigned GO terms to 685 (43.9%) of 1,559 contigs (of 1,559 contigs, 1,105 align to InterPro domains and 58 to the keyword method).The 685 contigs with GO annotations from the B. xylophilus data set were further annotated, with 621 hits assigned a biological process (BP), 620 hits assigned molecular functions (MF), and 503 hits assigned a cellular component (CC) in GO terms.A summary of GO annotations by biological process, cellular component and molecular function is provided in Figure 3.Among the most common GO categories representing biological processes were cellular process (GO: 0009987) and metabolic process (GO: 0008152).
The largest number of GO terms in cellular component was 433 contigs in the cell (GO: 0005623), 433 contigs in a cell part (GO: 0044464), and 399 intracellular contigs (GO: 0005622).The largest number of GO terms in molecular function was 467 contigs in binding (GO: 0005488), 348 contigs in catalytic activity (GO: 0003824), and 328 contigs in protein binding (GO: 0005515).The largest number of GO terms in biological process was 465 contigs in cellular process (GO: 0009987), 415 contigs in metabolic processes (GO: 0008152), and 374 contigs in cellular metabolic processes (GO: 00044237).
These results were highly similar to a previous study analyzing ESTs in Meloidogyne incognita (Kang et al., 2010).However, in terms of reproduction, reproductive process, and response stimulate in biological processes, a higher frequency was found than in a previous study (Kang et al., 2009).The total transcriptome analysis of B. xylophilus will be needed for the complete GO analysis.

Enzymatic classification based on EC number
EC numbers were extracted by the EC numbers method from UniProt database hits using the BLASTPGP algorithm.EC numbers were classified for 133 (8.3%) of 1,599 contigs.For the B. xylophilus data set, 133 contigs with EC numbers were further annotated, with 44 assigned as oxidoreductases, 35 as transferases, 33 as hydrolases, 9 as lyases, 4 as isomerases, and 8 as ligases (Table 3).

Classification of membrane proteins based on the number of transmembrane domains
The TMpred algorithm predicts membrane-spanning regions and their orientation.The prediction is made  Acting on NADH or NADPH 5 1.6.5 With a quinone or similar compound as acceptor 3 1.6.5.3 NADH 2 dehydrogenase (ubiquinone) 3 1.13 Acting on single donors with incorporation of molecular oxygen (oxygenases) 3 1.13.11With incorporation of two atoms of oxygen 3 1.14 Acting on paired donors, with incorporation or reduction of molecular oxygen 4  The TMpred algorithm made a prediction of transmembrane helices in protein sequences.
using a combination of several weight matrices for scoring (Hofmann and Stoffel, 1993).Transmembrane regions of 1,599 clusters were grouped.For the B. xylophilus data set, 1,599 contigs with transmembrane regions could be further annotated, with 481 (30.9%) assigned to the 'above one' domain and 1,078 (69.1%) assigned as 'none' (Table 4).In the future, the longranged sequencing of insert cDNA will be performed to determine complete transmembrane regions of the fulllength protein.

Figure 2 .
Figure 2. Length and isoelectric point distributions of Bursaphelenchus xylophilus cDNA inserts and proteins.Frequency versus sequence length of B. xylophilus cDNA inserts (A).The distributions of protein length (B) and isoelectric point (C) were analyzed from a total of 1,559 EST sequences of B. xylophilus.

Figure 3 .
Figure 3. Percentage representation of gene ontology (GO) mappings for Bursaphelenchus xylophilus proteins.Biological process (A); Molecular function (B); Cellular component (C).Individual GO categories can have multiple mappings.Percentages shown represent the total categories annotated (not the total sequences annotated under each component).

Figure 4 .
Figure 4. Putative localization of Bursaphelenchus xylophilus proteins in the cell.B. xylophilus subcellular structures and components were depicted by a single animal cell.The respective thicknesses and size of subcellular components were not accurate.Putative B. xylophilus proteins were replaced by annotation of the corresponding protein in the UniProt database.

Table 3 .
Classification of selected EC numbers.

Table 3 .
Contd.EC numbers were extracted from UniProt database hits using the EC numbers method.Six representative EC classes were classified from B. xylophilus proteins.Each EC class is illustrated by a representative member.The number of genes associated with each EC class is indicated in the sorting column.The selected EC numbers have at least three hits.

Table 4 .
Classification of membrane proteins based on the number of transmembrane domains.