De novo assembly and characterization of transcriptome and microsatellite marker development for Taro (Colocasia esculenta (L.) Schott.)

Taro (Colocasia esculenta. (L.) Schott) is a genus of perennial plants that is widely distributed in the tropics or subtropics of Asia, Africa and America, which is the fourteenth most consumed vegetable of the world. However, molecular genetic research of Colocasia has been hindered by the insufficient genomic and transcriptome information. Here, the transcriptome of taro variety ‘Jingjiang Xiangsha’ from Jiangsu, China, was sequenced using the Illumina HiSeq TM 2000 platform in 2015. A total of 58,263,364 reads were generated, and assembly resolved into 65,878 unigenes with a N50 length of 1,357 bp. A total of 40,375 unigene sequences were successfully annotated based on searches against six public databases. Among the annotated unigenes, 14,753 were identified by gene Ontologyterms, 16,643 were classified to Clusters of Orthologous Groups categories, and 25,401 were mapped to 127 pathways in the Kyoto Encyclopedia of genes and genomes database. Also, 11,363 potential microsatellite loci were identified in 5,671 unigenes, and 150 primer pairs were randomly selected and amplified in 18 accessions of C. esculenta. A total of 100 primer pairs showed polymorphisms in repeat length. The number of alleles per locus ranged from 2 to 8. Across the 100 microsatellite loci, the polymorphism information content values ranged from 0.042 to 0.778. The transcriptomic data and microsatellite markers will play important roles in future functional gene analyses and genetic map construction of taro.


INTRODUCTION
Taro (Colocasia esculenta L. Schott) is a major root crop belonging to the family Araceae.This plant originates from tropical Asia and America, and has been cultivated and utilized as food source and a medicinal herb in China for 2,000 years, which is also the fourteenth most consumed vegetable worldwide as a staple source of diets with high yield and nutritional value for people around the world (Ramanatha et al., 2010).Until now, researches on C. esculenta mainly focus on the biological characteristics, morphological variations, resistance and economic performance (Ahmed et al., 2013;Hunt et al., 2013;Nath et al., 2014;Das and Das, 2014;Das et al., 2015;Doungous et al., 2015;Soulard et al., 2016;Oliveira et al., 2017).
Microsatellites are a special class of repetitive DNA sequences that are distributed throughout the genome, and gradually became preferred markers for many applications in genetics and genomics (Chaïr et al., 2016;Dai et al., 2016;Vandenbroucke et al., 2016).However, the applications of microsatellite markers require reference genome and transcriptome data, thus, the development of microsatellite markers for non-model species, such as C. esculenta, are blocked by high cost and technical difficulties.
RNA sequencing (RNA-Seq) is a high-throughput method to obtain large amounts of transcriptome sequence information, which is also effective for nonmodel organisms that lack a reference genome (Ellegren 2014;Waples et al., 2016).Transcriptome sequences include only encoding sequences, from which a high quality of functional information can help to reveal the molecular mechanisms and genetic maps (Fu et al., 2013;Hause et al., 2016).In addition, transcriptome data is feasible for a large-scale development of microsatellite markers.Compared with genomic simple sequence repeat (SSR) markers, genetic markers developed based on RNA-Seq technologies provide a high efficiency method to identify candidate functional genes.Moreover, the continuous improvements in next-generation sequencing (NGS) technologies have made it a simple, economical, and reliable approach for novel gene discovery, molecular mechanisms analysis and molecular marker-assisted selection in many non-model organisms (He et al., 2014).
Here, the Illumina HiSeq™ 2000 platform was used to characterize the transcriptome of C. esculenta.The transcriptome sequences were assembled and annotated based on searches against public databases.Microsatellites in the transcript sequences were detected in silico and SSR markers were randomly selected for validation experiment.The transcript sequence information and available SSR markers will provide valuable resources for further genetic and breeding research of C. esculenta.

Plant material and RNA extraction
For the RNA required for the transcriptome sequencing, leaves, stems and corms of taro variety "Jingjiang Xiangsha" (Figure 1) were sampled from three 150-day-old plants, which were planted in the greenhouse under a 14/10 h photoperiod at 25°C (day) and 20°C (night) in Institute of Industrial Crops, Jiangsu Academy of Agricultural Sciences.Fresh leaves of 18 accessions of C. esculenta were collected for the validation of polymorphic SSR markers (Table 1).All samples were snap-frozen in liquid nitrogen and stored at −70°C.A Total RNA isolation system (Takara, Japan) was employed to extract RNA from tissues of the 150-day-old plant, following the manufacturer"s instructions.The quality of the RNA Integrity Number (RIN) was verified using a 2100 Bioanalyzer RNA Nanochip (Agilent, Santa Clara, CA) and its concentration ascertained using a ND-1000 Spectrophotometer (NanoDrop, Wilmington, DE).The standards applied were 1.8≤ OD260/280≤2.2and OD260/230≥2.0.At least 20 µg of RNA was pooled in an equal amount from each leaves, stems and corms used.

cDNA library construction and sequencing
Illumina (San Diego, CA) sequencing was performed at the Genomics Institute (Wuhan, China; http://www.genomics.cn/index.php),following the manufacturer"s protocols.Poly (A) mRNA from the total RNA was isolated from generated fragments in the size range of 100-400 bp.The resulting fragments served as a template for the synthesis of the first strand cDNA.And then, second strand cDNA was synthesized and purified.The products were ligated through sequencing adapters, and sequenced using an Illumina HiSeq TM 2000 device.

De novo assembly and gene annotation
Image data output from the sequencing device were transformed into raw reads and stored in FASTQ format.After filtered, the adapter and low quality sequences and the assembly of the

Microsatellite marker development and primer design
Microsatellite loci were identified by the simple sequence repeat identification software MISA (MIcroSAtellite identification tool) (http://pgrc.ipk-gatersleben.de/misa/),applying the following parameters: a minimum of six repeats for dinucleotide motifs, of five for tri, of four for tetra, and of three for penta-and hexa-nucleotides.Appropriate primers of SSRs were designed through Primer 3.0 software (http://sourceforge.net/projects/primer3), based on the following criteria: primer length 18 to 22 bp (optimally 20 bp), Tm of 50 to 60°C (no more than a 4°C difference between the Tms of the forward and reverse primers) and an amplicon length in the range 100 to 400 bp.All primers were synthesized by Genscript (Nanjing, China).

SSR polymorphism validation and data analysis
Eighteen accessions of C. esculenta were selected for polymorphism validation of microsatellite loci.100 primer pairs were randomly selected.Genomic DNA was isolated from leaves using the standard phenol-chloroform protocol (Hunt et al., 2013).PCR amplification was performed on a gradient thermal cycler (Bio-Rad) with the following protocol: denaturation for 3 min at 95°C; 36 cycles of 94°C for 30 s, 60°C for 30 s, 72°C for 30 s; and finally 72°C for 7 min as an extension step.Finally, the PCR products were initially assessed for size polymorphisms on 6% denaturing polyacrylamide gels and then visualized by silver nitrate staining.The genetic information and indexing of polymorphic microsatellite loci were calculated using POPGENE 1.31 (Yeh et al., 2000) and PowerMarker v3.25 (Liu and Muse, 2005).Polymorphism information content (PIC) was derived using the following formula: Where qi and qj represent the frequency of the ith and jth alleles, and n is the total number of alleles detected for a given SSR marker (Shete et al., 2000).

RESULTS
Sequencing and de novo assembly RNA-Seq technology was used to sequence a pooled cDNA library of taro variety "Jingjiang Xiangsha".The sequencing yielded approximately 58,263,364 raw pairended reads with a length of 100 bp.The raw read files were deposited in the NCBI Sequences Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/) with project number PRJNA387094.In all, 58,263,364 sequence reads were generated, of which 54,180,410 were of acceptable quality.The de novo data assembly yielded 134,044 contigs of mean length 347 bp (Figure 2A), and these were resolved into 65,878 unigenes, of which ).Among them, 33,547 (50.9%) uni-genes were 301 to 500 bp long, 13,793 (20.9%) were 500-to 1,000 bp long, 12,918 (19.6%) were 1,000 to 2,000 bp long, and 5,620 (8.6%) were longer than 2,000 bp (Figure 2B).

SSR markers validation
To evaluate the applicability and polymorphisms of the potential SSR markers, 150 primer pairs were randomly selected and validated through 18 accessions of C. esculenta.Also, 112 primer pairs were successfully amplified and 100 exhibited polymorphisms.A total of 316 alleles were identified, with an average of 3.16 alleles per locus.The number of alleles per locus ranged from 2 to 8.For example, the polymorphism in the Ce0040 locus (Table 3) is shown in Figure 7. Across the 100 microsatellite loci, the PIC values ranged from 0.042 to 0.778 (mean 0.245).

DISCUSSION
A comprehensive transcriptome of C. esculenta was obtained by sequencing a mixed cDNA library of leaf, stem and corm samples.The de novo data assembly yielded 134,044 contigs and resolved into 65,878 unigenes.The N50 length (1,357 bp) of the 65,878   assembled unigene sequences was much larger than those in previous studies (Liu et al., 2015).The longer N50 length meant a better quality of assembly in de novo RNA-Seq, which could benefit the identification of protein-coding genes (Namiki et al., 2012).It was attributed that the fine C. esculenta transcriptome     assembly might be due to the application of 100 bp paired-end modes for RNA-Seq, which greatly improved transcript construction and scaffolding effects (Trapnell et al., 2013;Chen et al., 2016).
To predict the functions of the transcriptome sequences, the unigenes were blasted in six public databases.Total 40,375 unigenes (61.29%) were assigned at least one functional annotation with the distribution and composition of the assigned GO terms indicating the functional distribution and evolution of conserved genes.What"s more, a large percentage of the unigenes were mapped into KEGG pathways, and most were involved in folding, sorting and degradation, transcription, translation and signal transduction pathways.
Compared with previous reports, which only identified about 32.20% in data against the four public databases, the results indicated a greater number in the annotated unigenes.In addition, results showed that although a comprehensive transcriptome of one species could be obtained by NGS, genetic and functional resources for taro are still insufficient (Nguluta et al., 2016).In the study, the large number of the C. esculenta unigenes (approximately 40%) failed to find hits in any databases, perhaps due to the lack of public sequence resources for taro or presence of non-coding transcripts among unigenes.After all, many specific functional genes in taro or that displayed low similarity to homologous genes in non-model organisms increased the difficulty to find matches in public databases (Ellegren, 2014).
Traditional methods for microsatellite development mainly depended with the use of public resources, such as genetic/genomic information and EST data (Cloutier et al., 2009), and the utilization of transferable microsatellites from related species (Mathithumilan et al., 2013).The application of NGS supplies a new and easier shortcut for SSR markers development directly from transcript sequences (Edwards et al., 2013).Here, we predicted microsatellite loci among the 65,878 assembled unigenes and 11,363 potential microsatellites had been detected among 5,671 non-redundant unigenes.The percentage of genes possessing SSR markers was about 8.61% (5,671/65,878), and the di-nucleotides repeat motifs showed the most abundance.In addition, a large number of mono-nucleotide repeat motifs were detected, which had not been discovered in the previous study of taro (You et al., 2015), while the percentage of this type in the results was about 14.3%.Although mono-nucleotide repeats were discarded for difficulty to distinguish genuine mono-nucleotide repeats from polyadenylation products, it could be an important resource for further research (Fu et al., 2013).
Among the 150 potential SSR markers, 112 loci were successfully amplified and 100 exhibited polymorphisms.This success rate (66.7%) was higher than that reported in taro EST-SSR development (You et al., 2015).Thus, the results showed that more than half of the in silico identified microsatellites and was able to be validated and provide enough number of markers for future genetic studies in taro.The unamplified loci in the study could be caused by exists of chimeric primers, primer location across splice sites, or sequences missing (Hause et al., 2016).
In summary, the results provided valuable resources for future research of genetic diversity, linkage mapping, germplasm characterization and marker assisted selection in taro, which could be beneficial to breeders/geneticists and taro farmers.The transcriptomic data and microsatellite markers of taro could also be applied to the genetic researches in other species and genera of Araceae as the high transferability.

Figure 2 .
Figure 2. The distribution of contig and unigene sequence lengths.A. the length distribution of contig, B. the length distribution of unigene.

Figure 3 .
Figure 3. Functional classification of the C. esculenta unigenes according to COG criteria.

Figure 4 .
Figure 4.The distribution of C. esculenta unigenes among the GO functional classes.

Figure 5 .
Figure 5.The characteristics of SSR markers. A. the unit size, B. the number of microsatellite repeats, C. the microsatellite loci length.

Figure 6 .
Figure 6.Frequencies of the various repeat motifs present in the C. esculenta SSRs.

Figure 7 .
Figure 7. Representative example of the validation of in silico identified Ce0040 microsatellite loci (see Table 4) among 18 accessionsof C. esculenta.Each lane on the gel represents the individual genotype M, DNA marker.

Table 1 .
Summary of 18 accessions of C. esculenta.

Table 2 .
Top 20 KEGG pathways mapped to the transcriptome unigenes.

Table 3 .
PCR amplification of polymorphic SSR markers and polymorphisms in 18 accessions of C. esculenta.
Notes: Tm, the annealing temperature; Na, number of alleles; PIC, polymorphism information content; Ho, observed heterozygosity; He, expected heterozygosity.