Frequency and distribution of genome-based microsatellites in Verticillium dahliae

A total of 5,418 simple sequence repeats (SSRs) were identified in the 33.8 Mb genomic DNA sequence of Verticillium dahliae. SSR loci were classified by repeat types and frequency in different genomic regions. The results show that the SSRs in different repeat units exhibited differential or non-random distribution in different genomic locations. Whole genome analyses showed that the tri-nucleotide (nt) repeat was the most abundant microsatellite type. The number of tri-nt SSRs was 1,677 comprising 31.0% of the total number of SSRs, followed by hexa-nt, mono-nt, di-nt SSRs, tetra-nt and pentra-nt SSRs in that order. In the exonic regions of the genome, the tri-nt SSRs occurred more frequently than the other SSR types. A total of 1, 037 (61.8%) tri-nt SSRs were distributed in the exonic regions, an approximately two-fold higher number than in the intergenic regions (66.1 per Mb versus 32.3 per Mb respectively). Nearly half the hexa-nt SSRs were also distributed in the coding region while most of the mono-nt, di-nt, tetra-nt and penta-nt SSRs were predominantly present in the intronic and intergenic regions. The biased distribution of the SSRs may reveal the functional significance of SSRs in the V. dahliae genome.


INTRODUCTION
Verticillium dahliae Kleb. is the causal agent of vascular wilt diseases on plants worldwide, resulting in enormous economic losses.It attacks a wide range of plant hosts, including agriculturally important crops such as alfalfa, tomato, cotton, potato, eggplant, pepper, ornamental flowers, fruit trees, and shrubs (Pegg and Brady, 2002;Fradin and Thomma, 2006;Klosterman et al., 2009).In the Verticillium genus, it was usually only the isolates that produce microsclerotia that were classified as V. dahliae (Barbara and Clewes, 2003), however, sometimes the host specificity and its pathogenicity varies in different multi-allelic, have good genome coverage and can be multiplexed on semi-automated systems (Varshney and Graner, 2005;Zwart et al., 2008).SSRs are not only useful tools for genetic biodiversity, comparison among relatives and phylogenic analysis, but also for comparing isolates from different host plants (Giraud et al., 2002), strains or races (Barve et al., 2001).Therefore, microsatellite markers in phytopathogenic fungi can be used for population, evolutionary, biodiversity mapping, pathogenicity related genes, race, and ecological research.The development and use of SSR markers for V. dahliae lags behind the achievements in other major plant pathogens like, for example, Fusarium oxysporum, Microbotryum violaceum, Magnaporthe grisea, Microcyclus ulei and Plasmopara viticola (Barve et al., 2001;Giraud et al.,2002;Kaye et al.,2003;Le Guen et al., 2004;Delmotte et al.,2006).
With the rapid development of sequencing technology, the genome of the V. dahliae strain VdLs.17 on lettuce has been sequenced and a detailed physical map has been obtained (Klosterman et al., 2008).The current genome assembly of VdLs.17 comprises 52 sequence scaffolds with a total length of 33.8 Mb, and an N50 scaffold length of 1.27 Mb, indicating that 50% of all bases are contained in scaffolds of at least 1.27 Mb (Klosterman, et al., 2011).This high-quality draft V. dahliae genome sequence allowed high-throughput SSR markers to be successfully developed.These markers can be readily detected computationally and can be used for the screening of candidate genes, functional research and genetic diversity analysis.Thus, the mining of these resources provides an opportunity to greatly expand the database of molecular markers for V. dahliae at minimal cost.
In this study, we explore SSR markers based genome information of V. dahliae.The objectives of this research were: (1) to characterize the frequency and distribution of SSRs in the V. dahliae genome; and (2) to investigate the relative abundance of SSR types in different regions of the genome.

SSR scanning
The genome was scanned for SSRs loci with program software Li et al. 7505 SciRoKo 3.4 (SSR Classification and Investigation by Robert Kofler) (Kofler et al., 2007).The parameters were set for detection of mono-nt, di-nt, tri-nt, tetra-nt, penta-nt and hexa-nt motifs with a minimum of 15, 7, 5, 3, 3 and 3 repeats, respectively (under the perfect MISA search mode) (Kofler et al., 2007).In our study, we defined two genomic location categories as genic (exon and intron) and intergenic regions.Initially, each SSR was considered to be unique and was subsequently classified according to theoretically possible combinations.For example, a poly-A repeat is equivalent to a poly-T repeat on a complementary strand, and an AAG is equivalent to AGA and GAA in different reading frames and to CTT, TCT and TTC on a complementary strand.Thus, there are two possible combinations for mono-nt repeats, four for di-nt repeats, ten for tri-nt repeats, 33 for tetra-nt repeats, 102 for penta-nt repeats, and 350 for hexa-nt repeats.To localize the distribution of SSRs in different genomic regions, the position of SSRs were compared with the genome annotation by Perl scripts.To describe the abundance of SSRs in different genomic regions, we calculated the "relative abundance"(RA) by dividing the number of SSRs by the mega base-pair (MB) of sequences in our analyses.

Genome-wide distribution of microsatellites in V. dahliae
A total of 5,418 microsatellites were identified in the 33.8 Mb genomic DNA sequence of V. dahliae using the SciRoko programs.The results show that SSRs were abundant in the V. dahliae genome with about one SSR every 6.24 kb.The SSR loci were classified by repeat type and frequency of repeats per motif (Table 1).The most commonly occurring number of repeats per motif ranged from 3 to 7. The most abundant microsatellite was the tri-nt repeats of which 1,677 (31.0% of the SSRs) were identified, followed by the hexa-nt repeats (930, 17.2%) and mono-nt repeats (811, 15.%).The numbers of di-nt, tetra-nt and pentra-nt repeats were similar, 659, 633 and 678, respectively (Table 2).
In the V. dahliae genome, more SSRs were found in the intergenic regions (64.1%) than in the genic regions (35.9%) (Figure 1 and Table 2).The different SSR repeat units showed obviously differential or non-random distributions in the different genomic locations.The tri-nt repeats were the most abundant SSR type in the genic region, whereas, mono-nt repeats were the most abundant SSR type in the intergenic region (Figure 1 and Table 2).In an attempt to analyze the differential distribution of SSRs more clearly, we characterized the distribution of the SSR types in each repeat unit across the different genomic locations.

Mono-nt SSRs
The mono-nt SSRs were the third largest class of SSRs, representing 15.0% of the SSRs in the V. dahliae  genome.The mono-nt SSRs were distributed preferentially in the intergenic (690 loci) and intronic regions (112 loci) but were rare in the exonic regions (only 9 loci).The relative abundances of the mono-nt SSRs in  the intergenic, exonic and intronic regions were 40.4,0.6 and 112 per Mb, respectively (Table 2 and Figure 1).Of the two possible types of mono-nt SSRs, poly-G/C was the predominant form with 554 loci and a relative abundance of 16.3 poly-G /C repeats per Mb in the genome.The poly-A/T SSR type had 257 loci with a relative abundance of 7.6 per Mb (Table 3).

Di-nt SSRs
The total number of di-nt SSRs in the V. dahliae genome was 659.Most of di-nt SSRs (571) were distributed in the intergenic regions, 81 were in the intronic regions, and only seven were in the exonic regions.Overall, 98% of the di-nt SSRs were found in the noncoding regions (intergenic and intronic) (Table 2).Among them, the AG/TC SSRs (363) were the most abundant with a relative abundance of 10.7 SSRs per Mb.The AC/TGs (234) also had a high relative abundance with about 6.9 SSRs per Mb in the genome.

Tri-nt SSRs
The tri-nt SSRs were the most abundant in terms of unit numbers being 31.0% of the total number of SSRs in the genome (1,677 of the 5,418) (Table 2).1,037 (61.8%) of the tri-nt SSRs were distributed in the exonic regions, 552 were in the intergenic regions, and only 88 were found in the intronic regions.Thus, the relative abundance of tri-nt SSRs in the exonic regions was approximately two fold higher than in the intergenic regions (66.1 per Mb versus 32.3 per Mb) (Table 2 and Figure 1).The distribution of some of the tri-nt SSR types in the genome was non-random.For example, the AGC/TCG, CCG/GGC, ACG/TGC, AAC/TTG and AGG/TCC SSRs were preferentially located in the exonic regions, the ACC/TGG and ATC/TAG SSRs were exclusively located in the intergenic regions, and the AAG/TTC SSRs were almost equally distributed in the exonic and intergenic regions (Table 3).

Tetra-nt and penta-nt SSRs
The tetra-nt and penta-nt SSRs, with 33 and 102 SSR types respectively were predominantly distributed in the intergenic and intronic regions (Table 2 and Figure 1).Among them, the most frequent tetra-nt SSR type was ACCT/TGGA with 6.6 per Mb in the genome.97% of the ACCT/TGGA SSRs were located in the intergenic regions, followed by AGGC/TCCG, 88.9% of which were distributed in the intergenic regions and 11.1% were in the intronic regions.No AGGC/TCCG SSRs were found in the exonic regions (Table 3).The distribution of the penta-nt SSRs was similar to that of the tetra-nt SSRs.The most abundant penta-nt SSR type was AAGGT/TTCCA, 98.3% of which were in the intergenic regions.The relative abundance of the penta-nt SSRs in the intergenic regions was close to 14-fold higher than in the exonic regions (Table 2).

Hexa-nt SSRs
Hexa-nt SSRs were the second largest class by repeat units representing 17.2% of the total SSRs in the V. dahliae genome with a relative abundance of 27.5 per Mb.The hexa-nt SSRs were found preferentially in the intergenic (459) and exonic regions (443), and were rare in the intronic regions (28) (Table 2).The hexa-nt SSRs can combine to form 350 SSR types in all.The AGCGGC/TCGCCG SSRs with a total number of 35 were the most common type, 77.1% of them were found in the exonic regions of the genome (Table 3).

DISCUSSION
As the primary destructive phytopathogenic species in the genus, population and genetic diversity of V. dahliae have been areas of intensive study (Klosterman et al., 2009).
Although random amplified polymorphic DNA (RAPD) (Perez-Artes et al. 2000) and amplified fragment length polymorphic (AFLP) markers (Collins et al., 2003) have been used to characterize molecular diversity in the populations of V. dahliae, the dominance of the RAPD and AFLP markers prevent the detection of heterozygotes in diploid species.The development of co-dominant, multi-allelic and sequence-specific microsatellite markers may enable the molecular genotypes at a number of loci to be determined, and markers of this type could be more widely used for the genetic diversity and genetic mapping of V. dahliae.An array of 22 simple sequence repeat markers of V. dahliae has been developed to detect recombination, transcontinental gene flow and genetic drift (Almany et al., 2009;Atallah et al., 2010).But there must be more potential SSR loci linked with other genetic characteristics across the whole genome to be explored.
The completed genome sequences of V. dahliae has greatly assisted the understanding of SSRs at the genome-wide level.We identified 5,418 SSR loci in the V. dahliae genome.The relative abundance was 160 SSRs per Mb (Table 1) and the distribution of the SSRs was 811, 659, 1,677, 633, 678 and 930 for the mono-nt, di-nt, tri-nt, tetra-nt, penta-nt and hexa-nt SSRs, respectively (Table 2).The tri-nt SSRs were the major class, accounting for 31.0% of the total.This result was consistent with the findings for Neurospora crassa in which 39.4% of the SSRs were tri-nt repeats (Kim et al., 2008).However, the Aspergillus nidulans, Cryptococcus neoformans, Magnaporthe grisea, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Ustilago maydis genomes contained more mono-nt SSR loci than the other types while di-nt SSRs were predominant in the genomes of Encephalitozoon cuniculi and Fusarium graminearum (Karaoglu et al., 2005).
In this study, we found that the distribution of the SSRs in the V. dahliae genome is not random.This finding is consistent with the results from previous studies (Li et al., 2004;Kim et al., 2008).The abundance of SSRs was differentially distributed across exonic, intergenic, and intronic regions of the genome.Mono-nt repeats were found at a relatively high frequency in the V. dahliae genome, most (690 out of 811) were distributed in the intergenic regions and only nine were in the exonic regions (Table 2).We found a strong over-representation of poly-G/C compared with poly-A/T SSRs in the V. dahliae genome, unlike the findings reported in other eukaryotes (Karaoglu et al., 2005;Lawson and Zhang, 2006;Chistiakov et al., 2006).Di-nt SSRs were preferentially distributed in the intergenic regions and rare in genic regions as reported in other organisms (Li et al., 2002).Among the di-nt SSR motifs, AG/TC and AC/TG were the most abundant in agreement with the reported differences in SSR composition between fungi and other organisms (Ellegren, 2004).The GC/CG SSRs were the least abundant di-nt motif with only nine repeats, as described in many other fungi (Karaoglu et al., 2005).
In the exonic regions, tri-nt and hexa-nt SSRs were the dominant types.The tri-nt repeats were approximately two-fold more frequent in the exonic regions than in the intronic and intergenic regions combined (Table 2, Figure 1).The enrichment of the tri-nt SSRs in the exonic regions was observed in a variety of other eukaryotic organisms  (Li et al., 2004;Kim et al.,2008).The tri-nt and hexa-nt SSRs in the exonic regions are translated into amino acid repeats in the proteins that they encode.These amino acid repeats may contribute to the biological functions of the proteins.The dominance of SSR triplets over the other repeats in coding regions may be explained by specific selection operating against frameshift mutations and a tight negative selection on the other SSRs that would perturb the reading frames in coding regions (Metzgar et al., 2000;Young et al., 2000).Of the tri-nt SSRs, the AGC/TCG repeat that encodes an Asp amino acid repeat was the most abundant.The most common hexa-nt SSR was the AGCGGC/TCGCCG repeat that encodes an Asp-Gly dipeptide repeat (Table 3).This finding suggested that the differential appearance of particular tri-nt and hexa-nt SSRs within the exonic regions may be the result of functional selection on amino acid repeats in encoded proteins (Li et al., 2002(Li et al., , 2004;;Katti et al., 2001).Furthermore, the tri-nt and hexa-nt triplet repeats in exonic regions are variable across the V. dahliae genome (data not shown), perhaps indicating that these SSRs play important roles in the evolution of gene functions that may help promote adaptation to new environments (Li et al., 2004;Kashi and King, 2006).
In the V. dahliae genome, the occurrence of tetra-nt and pentra-nt SSRs was similar, and most of these SSRs were distributed in the intergenic regions.The most frequent tetra-nt SSRs were the ACCT/TGGA repeats, in concurrence with the findings in N. crassa and M. grisea (Karaoglu et al., 2005).The most common penta-nt SSRs were found much less frequently than the tetra-nt SSRs; for example, 59 AAGGT/TTCCA repeats were found in the V. dahliae genome and 58 of them were in the intergenic regions, the remaining one was in an intronic region.In conclusion, this study of SSRs in the completely sequenced plant pathologic fungus V. dahliae, is a small step towards a better understanding of the important role of these sequences.The microsatellite analysis showed that the distribution of SSRs in exonic, intronic and intergenic regions of the genome were non-random and strongly biased, probably reflecting the functional significance of SSRs.The enhanced frequency of tri-nt SSRs in the coding regions might indicate the effects of selection against possible frameshift mutations.The SSRs in V. dahliae could be explored as evolutionary neutral DNA markers, and the data that we have obtained could be used to select highly polymorphic SSR markers to study the population genetic diversity and evolution of the Verticillium dahliae.Furthermore, the SSRs loci mentioned in this study also have the potential to explore useful markers linked with significant genetic characteristics, such as pathogenicity related loci, key loci for differentiation between defoliating and nondefoliating pathotypes, and microsclerotia producing loci.

Figure 1 .
Figure 1.Genome-wide distribution and relative abundance of SSR types by their unit size.Each bar represents the relative abundance of the SSR types in different genome locations.

Table 1 .
Frequency and distribution of the 5,418 SSRs identified in the V. dahliae genome.

Table 2 .
Relative abundance of SSR types in different regions of the V. dahliae genome

Table 3 .
Distribution of SSRs types in different regions of the V. dahliae genome.