Verification of genetic identity of introduced cacao germplasm in Ghana using single nucleotide polymorphism (SNP) markers

Accurate identification of individual genotypes is important for cacao (Theobroma cacao L.) breeding, germplasm conservation and seed propagation. The development of single nucleotide polymorphism (SNP) markers in cacao offers an effective way to use a high-throughput genotyping system for cacao genotype verification. In the present study, high-throughput genotyping with SNP markers was used to fingerprint 160 cacao trees in the germplasm collection at the Cocoa Research Institute of Ghana (CRIG). These accessions had been originally introduced from international germplasm collections. The multilocus SNP profiles, generated by the Sequenom Mass Spectrometry platform, were compared with the SNP profiles of reference trees maintained in the international cacao collections. The comparison unambiguously identified mislabeled trees. For materials introduced as hybrid seeds without an available reference genotype, parentage analysis and model-based assignment were applied to verify their recorded parentage and genetic background. Our study shows that a small set of polymorphic SNP markers can provide a robust and accurate result for cacao genotype identification. This protocol can be applied for large-scale genotyping of cacao as well as for many other crops.


INTRODUCTION
Cacao (Theobroma cacao L.) is an important tropical tree crop that provides raw ingredients for the chocolate confectionery industries.This global commodity has an annual production that exceeded 4 million tons in 2010, of which 75% was produced in West Africa.Ghana alone produced 850,000 tons of cacao, accounting for 21% of the world's total output in 2010 (FAOSTAT, http://faostat3.fao.org/home/index.html).Cacao originated in the Amazon rainforest in South America and was domesticated by the Maya and Olmec peoples at least 3000 years ago (Cuatrecasas, 1964;Wood and Lass, 1985;Bartley, 2005;Powis et al., 2011).Beginning in the late 1800's and continuing into recent times, cacao has been repeatedly introduced into Ghana.Germplasm was ultimately deposited in an in situ germplasm bank at the Cocoa Research Institute of Ghana (CRIG) in Tafo, which currently houses over 1200 clones of various genetic origins (Edwin and Masters, 2005;Adu-Ampomah et al., 2006).Cacao is an outcrossing species (Wood and Lass, 1985) and germplasm is conserved as clonally propagated trees in field genebanks.Cacao germplasm collections have been shown to contain a variety of mislabeled individuals, and mislabeling is estimated at 15 to 44% in global cacao collections (Motilal and Butler, 2003;Motilal, 2004;Sounigo et al., 2006;Takrama et al., 2005).Misidentifications can be attributed to multiplicity of introductions and transfers of plants from point-of-collection to establishment in early holding sites, and to subsequent recollection of budwood and repropagation of material for establishment.The potential for human error during plot demarcations and planting also contributes to this problem.Molecular markers have been used to characterize cacao germplasm since the 1980s (Guiltinan et al., 2008).Mislabeled accessions were identified by using dominant markers (Figueira et al., 1994;Whitkus et al., 1998;Sounigo et al., 2005) as well as codominant DNA markers such as restriction fragment length polymerphisms (Lerceteau et al., 1997;N'Goran et al., 2000).The development of microsatellite markers (Lanaud et al., 1999) greatly increased the efficiency and capacity for cacao fingerprinting and resulted in a wide application of cacao genotype identification (Aikpokpodion et al., 2005;Motilal and Butler, 2003;Efombagan et al., 2008;Motilal et al., 2010).
Recent progress in the development of cacao genomic resources has led to the use of single nucleotide polymorphisms (SNPs) as markers for cacao DNA fingerprinting, since SNPs are the most abundant class of polymorphisms in plant genomes (Buckler and Thornsberry, 2002).Compared with SSR markers, the assays of SNPs can be done without requiring separation of DNA by size, and therefore can be automated in an assay-plate format or on microchips.The diallelic nature of SNPs results in a much lower error rate in allele calling, and the genotyping can be multiplexed, allowing quicker completion at a lower cost than with SSRs.In recent years, SNP markers have been developed to assist cacao breeding and germplasm management (Allegre et al., 2012;Kuhn et al., 2012).TaqMan-based SNP assays have been developed for cacao genotyping under field conditions (Livingstone et al., 2012;Takrama et al., 2012).Using a set of SNP markers derived from express sequence tag (EST) databases, Ji et al. (2013) characterized farmer selections of cacao from Nicaragua and Honduras and demonstrated that the SNP markers constitute a cost-effective marker resource suitable for cacao germplasm characterization.Results for genotyping with SNPs can be compared across different genotyping platforms and laboratories, facilitating the integration and interpretation of SNP data across different genebanks in various cacaoproducing countries.The objective of the present study was to test the efficacy of using high-throughput SNP genotyping for molecular characterization of cacao and to assess the extent of mislabeling, or off-type, in the CRIG cacao germplasm collection.

Sample preparation and SNP genotyping
One hundred and sixty (160) trees from the CRIG germplasm collection, representing 39 cacao accessions (each accession included one to five trees), were sampled for this experiment.Samples were collected from eight plots in the germplasm collection: D8 (2), L6 (34), M6 (32), M6 Ext.(5), Q6 (67), Q6 Ext.2 (8), Q6 Ext.4 (9), and V3 (3) (Table 1).Two young leaves were collected from each individual cacao tree and each sampled branch was tagged for potential revisiting.Both accession name and DNA extraction number were used to label each sample.DNA was extracted from the CRIG samples using the CTAB DNA Extraction Protocol (Doyle and Doyle, 1990).In addition, one hundred international clones were used as references.Preparation of DNA samples for the reference international clones was described in Zhang et al. (2009a;b).DNA concentration was determined with a NanoDrop spectrophotometer (Thermo Scientific, Wilmington, DE).Based on the level of polymorphism and on their distribution across the ten chromosomes in cacao, 54 SNP markers were selected from 1560 candidate SNPs that had been developed using cDNA sequences from a wide range of cacao tissues (Argout et al., 2008).SNP genotyping was performed at the Human Genetics Division Genotyping Core facility, Washington University, St. Louis, using MALDI-TOF mass spectrometry (Sequenom, Inc., San Diego, CA).The heterozygosity and polymorphic information index (PIC) of these SNP markers has been reported by Ji et al. (2013).

Data analysis
Key descriptive statistics for measuring the informativeness of the SNP markers were calculated, including observed heterozygosity, expected heterozygosity, minor allele frequency, inbreeding coefficient and probability of identity (Evett and Weir, 1998;Waits et al., 2001).The program GenAlEx 6.2 (Peakall and Smouse, 2006;2012) was used for computation.For the identification of mislabeling (off-types), SNP profiles of 100 reference trees maintained in the International Cacao Genebank, Trinidad (ICG,T) were used in the analysis.The genetic identity of the 100 reference trees has been characterized by both SNP (D. Zhang, USDA/ARS, Beltsville, personal communication) and SSR fingerprinting (Zhang et al., 2009b;Motilal et al., 2010;Johnson et al., 2009).Pairwise multilocus matching was applied among each pair of individual trees, including the reference trees from the international germplasm collections, using the same program.Accessions with same names as the reference trees, but not matching them, were declared off- types.For the multilocus matching, the option to ignore missing data was used.Discriminating power of the SNP loci was computed using the probability of identity (PID) (Waits et al., 2001) option implemented in the same computer program.
For accessions without a reference tree but with known pedigree record (for example, breeding lines selected in Ghana's breeding program), the genetic identities were verified using parentage analysis and/or model-based assignment test.An example is the T clones (Table 1) that were hybrid families introduced into West Africa in 1944.Since these were the products of hybridization in Trinidad in the early 1940s, and the seed families were evaluated and selected in Ghana (Posnette, 1986), there are no existing reference trees available from the international cacao collections.Nonetheless, because pedigree records for these selections are available, the T clones were used as "offspring" and their parental clones in ICG,T were verified according to the recorded pedigree (Lockwood and Gyamfi, 1979).A likelihood-based method implemented in the program CERVUS 3.0 (Marshall et al., 1998;Kalinowski et al., 2007) was used for computation.For each parentoffspring pair, the natural logarithm of the likelihood ratio (LOD score) was calculated.
Critical LOD scores were determined for the assignment of parentage to a group of individuals without knowing the maternity or paternity.Simulations were run for 10000 cycles with the assumption that 80% of candidate parents were sampled and a total of 80% of loci were typed, with a typing error rate of 0.5%.The most probable single mother (or father) for each offspring was identified on the basis of the critical difference in LOD scores (Δ) between the most likely and next most likely candidate parent at greater than 95% or 80% confidence (Marshall et al., 1998;Kalinowski et al., 2007).
For accessions lacking a reference tree, assignment test was applied to infer their hidden membership to a known population or germplasm group, using a model-based clustering analysis implemented in the STRUCTURE software program (Pritchard et al., 2000).SNP profiles of 100 reference accessions were included in the analysis.These 100 accessions were taken from six known Forastero germplasm groups, including Amelonado, Scavina (SCA) and Ucayali, Iquitos Mixed Calabacillo (IMC), Morona (MO), Nanay (NA) and Parinari (PA).Classification of these accessions have been reported by (Motamayor et al., 2008;Zhang et al., 2009b).The number of clusters (K-value, which indicated the number of sub-populations of the program attempted to find) was set from two to ten, and the analysis was carried out without assuming any prior information about the genetic group or geographic origin of the samples.Ten independent runs were assessed for each fixed number of clusters (K).The ∆K value was computed to detect the most probable number of clusters (Evanno et al., 2005).Of the 10 independent runs, the one with the highest Ln Pr (X|K) value (log probability or log likelihood) was chosen and represented as a bar plot.

Descriptive statistics of the SNP markers
In total, 53 SNP markers were reliably scored, as assessed by markers producing less than 10% missing genotypic data.Marker TcSNP 174 failed to generate SNP data thus was excluded in subsequent data analysis.The descriptive statistics of the remaining 53 SNP loci are presented in Table 2.The 53 SNP markers were polymorphic across the 39 cacao accessions.The mean expected heterozygosity was 0.343 and the observed heterozygosity was 0.274.An inbreeding coefficient with an average of 0.218 was observed.

Multilocus matching
Comparison of the multilocus SNP profiles with the reference accessions identified seven intraclonal mislabelings in accessions NA 79, PA 150 and IMC 76 (Figure 1).The multilocus matching also found that AMAZ 3-2 and PA 303 were mislabeled.These trees were defined as off-type or homonymous mislabeling because they shared the same name with the reference tree but differed in multilocus SNP profiles.In this experiment the mismatched accessions differed at a minimum of five loci.With all 53 loci considered, the combined probability of identity was in the order of 10 -9 (Table 2).Overall, the procedure of multilocus matching with known reference trees led to the identification of 149 true-to-type trees out of 160 tested samples.Based on the verified result, 39 samples (a single sample from each accession) were used in the subsequent analyses of population structure and genealogical relationships.Among these 39 samples, the status of the nine T clones could not be decided solely based on multilocus matching, because they were selections made in Ghana and no reference trees were available.For these trees, assignment test and parentage analysis were applied to verify their genetic identity.

Assignment test
Based on the value of delta K, the model-based approach of STRUCTURE indicated K=5 as the most probable number of genetic clusters.The 39 tested cacao accessions from the Ghana cacao collection, as well as the 100 reference accessions, were stratified as germplasm groups of Amelonado, IMC, SCA/Ucayali, Morona, Nanay and Parinari, respectively (Figure 2).The assignment result largely agreed with the previously classified germplasm groups (Figure 2; Zhang et al., 2009b;Motamayor et al., 2008) except that the germplasm groups of SCA/Ucayali and Morona were not separated.The assigned memberships for all the tested trees from Ghana were compatible with their known parentage germplasm groups (Figure 2).The assignment test of the T clones confirmed their recorded parental germplasm groups, as shown in Figure 2. The parental groups of PA and IMC were clearly reflected in the admixed ancestry profiles of T60, T63, T65 and T79.A full genetic background of IMC was revealed for accession T85/799, supporting its recorded parentage of IMC 60 and NA 34 (a member of the IMC germplasm group; Motamayor et al., 2008).In addition, admixed ancestry of IMC and Amelonado was revealed for T16/613 family, which not only supported the recorded parentage of IMC 24, but also detected that the other parent came from the Amelonado group.

Parentage analysis
Of the eight candidate parent-offspring relationships, the results of parentage inference confirmed six pairs at the 95% confidence level and one pair (NA 34 -T85/799) at the 80% confidence level (Table 3).For offspring T16/613, only one parent (Amelonado 22) was identified at the >80% confidence level because the reference genotype of maternal parent IMC 24 was not available.The result of parent-offspring assignment supported the outcome of model-based clustering analysis by the STRUCTURE program (Figure 2).

Multilocus matching
Over 50 cacao germplasm collections are present worldwide   and of these, two are universal collections (representing nearly all of the known genetic diversity): CATIE (Centro Agronómico Tropical de Investigación y Enseñanza) in Costa Rica and ICG,T in Trinidad and Tobago (Motilal et al, 2013;Wadsworth and Harwood, 2000).Mislabeled plants have been identified as a serious problem in germplasm collections (Hurka et al., 2004).Significant efforts have been made to solve the problem in some international cacao collections (Motilal et al., 2013;Zhang et al., 2009a,b); however, the mislabeling problem in most of the various national collections has not been systematically addressed.Until recently, tools have not been available to clearly identify mislabeled germplasm accessions.Molecular markers such as AFLP (amplified fragment length polymorphism) have sufficient discriminatory power to distinguish cacao accessions; however, these tools often failed to reach clear conclusions, with convincing statistical rigor, that two genotypes are identical (Christopher et al., 1999;Perry et al., 1998;Sounigo et al., 2001).
In the past few years, microsatellite markers have been widely used in cacao genotyping and individual identification, enabling systematic assessment of genetic identity in national and international cacao genebanks (Zhang    et al., 2009a;Motilal et al., 2009Motilal et al., ,2010)).In contrast to dominant markers, identical genotypes can have a 100% match in the multilocus SSR profiles without ambiguity, thus accuracy of identification is significantly improved.
Reference SSR profiles of cacao clones have been deposited in the International Cacao Germplasm Data-base at the University of Reading, UK (http://www.icgd.rdg.ac.uk/index.php).However, comparison of genotyping results from different laboratories has not been straight forward.The effectiveness of clone identification via SSR fingerprints depends on the number of loci used for genotyping, as well as the rate of geno-typing error.For example, it may require multiple repeated genotyping runs to reach the "consensus genotype".Moreover, data generated from different genotyping platforms can be difficult to compare with one another because the same allele may be binned differently, leading to false conclusions.
The present study demonstrated that using the SNPbased multilocus fingerprints significantly improved the efficiency of genotype identification.Off-type identification, through the comparison with reference SNP profiles, is straightforward when reference trees are available.The reference trees used in the present study were sampled from the original collections maintained at Marper Farm and San Juan Estate in Trinidad, and Cabiria Farm, CATIE, in Costa Rica.These reference trees have been genotyped by SSR markers and passed through rigorous statistical population genetics tests (Motamayor et al., 2008;Zhang et al., 2009a,b;Johnson et al., 2009).

Parentage verification and assignment test
Many national cacao germplasm collections also maintain local varieties and breeding lines, which do not have a reference tree in international germplasm collections.In this situation, indirect verification such as Bayesian assignment test, parentage analysis, and sibship reconstruction need to be applied.The present study demonstrated how parentage analysis and Bayesian assignment test can be used to verify the genetic identity and pedigree information.Of the eight tested accessions, six were confirmed to have the correct maternal or paternal parent matching with the breeding record.Among them, T63/967 and T63/971 were supposed to be siblings and their verified parentage supported each other.T16/63 was recorded as the open pollinated progeny of IMC 24.Parentage analysis identified Amelonado 22, at a 95% confidence level, as the hidden pollen parent.For candidate parents that did not reach the 80% confidence level, the failure indicates mislabeling (off-type).Another possibility is possible conta-mination due to unwanted pollen or self-compatibility.
The SCA/Ucayali and Morona accessions represent two distinct geographical regions and were clustered as two different genetic groups when SSR markers were used (Zhang et al., 2009b;Motamayor et al., 2008).However, in the present study, the Bayesian clustering analysis based on 53 SNP markers did not significantly differrentiate these two germplasm groups (Figure 2).Differences in genetic distances quantified by SNP and SSR markers have been reported in other crops.Yang et al. (2011) reported a correlation between kinship coefficient estimated by SSR and SNP of 0.69 in maize.Murray et al. (2009) found that some sorghum individuals shifted groups, depending upon whether SSR or SNP data was used in the STRUCTURE program.The discrepancy in stratification based on the two marker systems could also be due to the relatively small number of SNP markers used in the present study.Yu et al. (2009) showed that kinship estimated using 1,000 SNPs was consistent with that estimated with 100 SSRs in maize.Van Inghelandt et al. (2010) proposed that 7 to 11 times more SNPs than SSR markers should be used for analyzing population structure and genetic diversity in maize germplasm.Given that our previous stratification was based on 15 SSR markers, it would require more than 100 SNP markers to reach the same precision level.Additional SNP markers need to be evaluated for cacao and the correlation between SNP markers and SSR markers needs to be systematically assessed.
In addition to the limitation due to a limited number of SNP markers, the discrepancy between the two marker systems might also be partially explained by the derivation of the SNP markers used in the present study from the EST data.A set of unequivocally neutral SNP markers would be ideal.Despite the lack of differentiation between the SCA/Ucayali and Morona populations, the assignment test correctly excluded both groups in terms of parentage contribution to the tested T clones.The assignment of the T clones is fully consistent with the outcome of parentage analysis and is consistent with the recorded pedigree (Lockwood and Gyamfi, 1979).The high repeatability of the genotyping result, as demonstrated by the multiple trees for some cacao germplasm maintained in the Ghana collection, as well as the consistency in pedigree records and parentage analysis, demonstrated that these SNP markers provide a reliable and efficient solution for cacao genotype identification.This modest set of SNP markers thus constitutes a costeffective marker resource, suitable for backstopping large-scale clone propagation in cacao.Nonetheless, the study also showed that a larger number of SNP markers would be needed for comprehensive diversity analysis.

Figure 1 .
Figure1.Intraclonal mislabeling (off-type) identified in 160 cacao trees from Ghana cacao collections based on 53 SNP markers (of which only 21 loci were presented).The true-to-type clones were marked as "√".The SNP profiles of the reference clones were generated using original trees from International Cacao Genebank, Trinidad.

Figure 2 .
Figure 2. Verification of genetic membership for ten T clones of cacao in Ghana cacao germplasm using assignment test.The computer program STRUCTURE was used, where K is the potential number of genetic clusters that may exist in the overall sample of individuals.Each vertical line represents one individual multilocus genotype.Individuals with multiple colors have admixed genotypes from multiple clusters.Each color represents the most likely ancestry of the cluster from which the genotype or partial genotype was derived.Clusters of individuals are represented by colors.

Table 1 .
List of the 39 cacao accessions (represented by 160 trees), their field plot and tree stand, from Ghana cacao germplasm collection.

Table 2 .
Observed and expected heterozygosities, inbreeding coefficient, minor allele frequency and probability of identity of the 53 SNP loci scored on 39 cacao accessions from the Ghana Cacao germplasm collection.

Table 3 .
Parentage verification for cacao selections with known breeding pedigree, based on 53 SNP markers with LOD scores at 80 and 95% probability.The SNP profiles of the parental clones were generated using original trees from International Cacao Genebank, Trinidad.