|
|
||||||||
a Facultad de Agronomía, Universidad de la República, Av. Garzón 780 CP 12900. Montevideo, Uruguay
b International Maize and Wheat Improvement Center (CIMMYT), Apdo. Postal 6-641, 06600, Mexico D.F., Mexico
* Corresponding author (j.crossa{at}cgiar.org)
| ABSTRACT |
|---|
|
|
|---|
Abbreviations: MR, Modified Rogers CE, Cavalli-Sforza and Edwards SH, Shannon diversity index HE, proportion of heterozygous loci per individual NE, the number of effective alleles PN, proportion of non-informative alleles in the sample
| INTRODUCTION |
|---|
|
|
|---|
A sampling strategy involves defining a sampling intensity, a sampling method, and an allocation method (Thompson, 2002). The sampling intensity defines the sample size as a proportion of the population size, and for core subsets, several authors suggest intensities that range from 5 to 20% of the total number of accessions (Brown, 1989; Schoen and Brown, 1993; Brown and Spillane, 1999; van Hintum, 1999; van Hintum et al., 2000). Regarding sampling methods, researchers have recommended stratified sampling strategies (as opposed to simple random sampling) for managing genetic resources and forming core subsets (Peeters and Martinelli, 1989; Crossa et al., 1994, 1995a, 1995b; Spagnoletti Zeuli and Qualset, 1993; Charmet and Balfourier, 1995; Rincon et al., 1996; Chandra et al., 2002; Franco et al., 2003). These stratified sampling strategies suggest first classifying or clustering the genotypes on the basis of prior knowledge such as origin, passport data, or a numerical classification, followed by an allocation of accessions from each cluster to the subset. Franco et al. (1998, 1999, 2002, 2003) and Franco and Crossa (2002) propose a sequential clustering strategy for forming core subsets using discrete and continuous morphological data simultaneously. This classification approach has been extensively used for forming core subsets in tropical maize, Zea mays L. (Taba et al., 1998, 1999, 2001).
There are many classification methods and many measurements of distances between individuals or populations. The most popular clustering methods include the unweighted pair-group method using arithmetic average, UPGMA (Sokal and Michener, 1958), and the Ward minimum variance within groups (Ward, 1963). Distances calculated between individuals or populations based on phenotypic data are generally Euclidean distance for continuous variables, and Gower's (1971) distance for mixtures of continuous and discrete variables. There are two types of distances used with molecular marker data, depending on the marker type used: (i) non-informative (i.e., Simple Matching, Jaccard, etc.), which use a table of absence or presence (0s and 1s) of each marker and each individual or population and (ii) informative, or genetic distances, (i.e., Modified Rogers, Cavalli-Sforza and Edwards, etc.), which use information on loci and alleles within loci for each individual or population.
When choosing a core subset using stratified random sampling, an allocation method determines the number of accessions to be selected from each cluster. For core subsets, Brown (1989) described three allocation methods whose sample sizes do not depend on the diversity of the clusters: (i) constant (or fixed) across clusters, (ii) proportional to the cluster size (P method), and (iii) proportional to the logarithm of the cluster size (L method). Franco et al. (2005), working with mixtures of phenotypic variables, proposed using Gower's distance between accessions within each cluster (D method) as the allocation criteria. The authors found that the D method used with Gower's distance was the best allocation strategy when compared with other allocation strategies.
Schoen and Brown (1993) addressed the issue of how to use genetic markers to sample collections of wild crops while maximizing allele richness. They proposed the M (maximization) strategy that maximizes the number of observed alleles at each marker locus. The allele richness of a core subset formed by the M strategy is defined in terms of the number of allele classes represented in the sample. Another strategy that uses genetic markers to form subsamples is the H strategy, which seeks to maximize the number of alleles in the core subset by sampling accessions from groups in proportion to their within cluster genetic diversity. Bataillon et al. (1996) used computer simulation for comparing the retention of neutral alleles when forming core collections using nonmarker-based random sampling and stratified random sampling strategies versus the M strategy using genetic markers. The M strategy was more effective for retaining widespread and low frequency neutral alleles than the other sampling strategies. The MSTRAT algorithm developed by Gouesnard et al. (2001) implements the M strategy for selecting accessions that increase the number of allele classes. McKhann et al. (2004) formed core collections from 265 accessions of Arabidopsis thaliana (L.) Heynh. using MSTRAT and single nucleotide polymorphisms (SNPs) from a limited number of DNA fragments.
Marita et al. (2000) developed a computer algorithm that selects accessions for the core by ranking the genetic distance between each accession and the mean of all others. They suggest that core subsets can be formed that either include rare and localized alleles, which will maximize the total allele diversity in the core (as favored by taxonomists and geneticists) or constructed by including widely adapted accessions that maximize the representativeness of the genetic diversity in the core (which is the breeder's perspective). It can be deduced that the breeder's perspective might not maximize the total allele diversity of the core, especially for traits not used to form the core.
Different classification and allocation methods were used by Franco et al. (2005) to form core subsets, employing continuous and categorical phenotypic variables. How the D method proposed by Franco et al. (2005) will perform with genetic distances or allele diversity calculated with molecular markers, and how these will compare with the M strategy, has not been tested.
In this study, 24 sampling allocation strategies were created from the combination of the following three factors: (i) two clustering methods (UPGMA and Ward), (ii) two initial genetic distance measures (Modified Rogers, or Cavalli-Sforza and Edwards), and (iii) six allocation criteria [two of them based on the size of the cluster and four based on the D method of Franco et al. (2005) used with two genetic distances and with three allele diversity indices]. The first objective of this research was to study the influence of these three factors and their interaction on the diversity of the core subsets formed. The second objective of the research was to compare these 24 sampling strategies with the M strategy implemented in the MSTRAT algorithm. Twenty independent stratified random samples were obtained from the molecular markers of three maize data sets, using one sampling intensity (20%) with the purpose of evaluating the different factors and allocation methods affecting the diversity of the core subsets.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The values of the bulk data set are frequencies obtained with Genotyper 2.1 (PerkinElmer/Applied Biosystems) and the MSSC program, a SAS/IML code written by Dubreuil et al. (2003); allele frequency values fall between the interval [0, 1]; this data set had 1.5% missing values. The frequency values for the accession data set are 0.0, 0.5, or 1.0; these frequencies correspond to each codominant allele of the diploid individuals; no missing values were observed. The frequency values of the population data set are within the [0, 1] interval; no missing values were observed.
Cluster Analyses
The Ward (1963) and the UPGMA (Sokal and Michener, 1958) procedures for clustering observations are hierarchical techniques well described in Kaufman and Rousseeuw (1990). Both clustering strategies can be calculated from different initial matrices of distance between genotypes, such as the Modified Rogers (MR) and Cavalli-Sforza and Edwards (CE) genetic distance measurements. Both clustering methods were used in this study using the CLUSTER procedure of SAS (2000). The number of groups was determined using the pseudo-F (Calinski and Harabasz, 1974) and pseudo-t2 statistic related to the J(e) statistic (Duda and Hart, 1973). The pseudo-F and the J(e) statistics were found by Milligan and Cooper (1985) to be the best two criteria (out of 30) for defining the number of groups. We used two criteria because there is not a unique solution and the criteria can propose different solutions.
Distances and Diversity Indices
To test the success of each sampling allocation strategy in conserving diversity in the core as compared with the original collection, we used two genetic distances between pairs of genotypes, and three diversity indices. Recently, Reif et al. (2005) gave mathematical and genetic details of the most commonly used distance measures used with molecular data. The genetic distances were the Modified Rogers (MR) and the Cavalli-Sforza and Edwards (CE), and the three diversity indices were the Shannon diversity index (SH), the expected proportion of heterozygous loci (HE), and the number of effective alleles (NE) of each cluster and the whole collection. In addition, we used the proportion of noninformative alleles in the sample (PN). These six measures were defined as follows:
MRxy = 
1 where
xla is the estimated frequency of the allele a, within locus l, at genotype x; L the number of loci (SSR markers), and nl the number of alleles within the lth locus;
CExy =
1; 
a = 1A
a ln(
a) where A =
l = 1Lnl is the total number of alleles in the sample, and
a is the frequency of the ath allele over the whole sample (
a = 1A
a = 1);
HE = 
l = 1L(1
a = 1nl
la2)
1. HE is a composite measure that summarizes genetic variation at the allele level (Berg and Hamrick, 1997), and it is computed as the mean of HE for each locus;
NE =
, and it measures the number of alleles at a locus and the equality of the allele frequencies at that locus (Berg and Hamrick, 1997); and
Allocation Methods
P and L Allocation Methods
These allocation methods are implemented on the basis of the size of the clusters (but not on their diversity). The P allocation method uses the size of the tth cluster (Nt) to obtain the sample size of the tth cluster (nt) nt = n x
, where n is the total sample size (10 or 20%). The L allocation method proposed by Brown (1989) uses the logarithm of the size of the tth cluster (Nt) to obtain the sample size of the tth cluster (nt) nt = n x
, where n is the total sample size (10 or 20%).
D Allocation Method
The D allocation strategy was proposed by Franco et al. (2005) using, as D-criterion, the Gower distance for a mixture of discrete and continuous phenotypic traits. When used with molecular marker data, the D method determines the size of the sample to be drawn from each cluster, which should be proportional to a genetic distance or allele diversity measure within the cluster. Groups that are more diverse will have a larger mean genetic distance (or greater diversity) and therefore larger samples will be drawn from them.
For t = 1,2,..., g clusters, the number of accessions (nt) to be drawn from the tth cluster is
nt = n x pt = n x
where n is the total sample size to be drawn from the collection (which in this study will be 20% of the entire collection), pt is the proportion of the sample size to be drawn from the tth cluster,
and
t is the mean of the genetic distances between accessions within the tth cluster (MR or CE) or the value for the selected diversity index of this cluster (SH, HE, or NE).
Strategies Using Stratified Sampling
To study the influence of three factors (forming 24 factorial combinations) on the diversity of the resulting core subset, the 24 sampling allocation strategies were tested one at a time and the diversity present in the resulting core subsets compared. To simplify the notation we assigned a number (124) to each of the 24 strategies as shown in Table 1.
|
Alleles at a locus might be related to alleles at another locus because they share common ancestries or because the species has a certain mating system (e.g., self-fertilization) that favors gametophytic disequilibrium. The M strategy correlates the allele richness at a marker locus with the allelic richness at a target locus.
Independent Samples
Independent Stratified Random Samples
A preliminary study was conducted to determine the minimum number of independent stratified random samples (independent replicates or independent core subsets) required for detecting differences of less than 2% of the overall mean between factor levels at two sampling intensities, 10 and 20%. Five hundred independent repetitions (500 independent core subsets) for each of the 24 factorial combinations were used. Results showed that 20 independent stratified random samples (core subsets) for each of the 24 combinations of factors were sufficient to detect differences between factor levels of 2% or less of the overall mean, and a sample intensity of 20% always performs better than 10%.
The generation of 20 independent stratified random samples from a population (collection), was done by the SURVEYSELECT procedure of SAS (SAS Institute, 2000). Similarly, 20 independent core subsets using the MSTRAT software (http://www.ensam.inra.fr/gap/resgen88; verified 1 December 2005) developed by Gouesnard et al. (2001) for the M strategy application were obtained for further comparison. The final number of iterations run on MSTRAT were 25, 55, and 105 for the population, bulk, and accession data sets, respectively. A higher number of iterations did not improve the values of the core subsets' diversity measurements.
Selection Criteria for the Best Core Subset
The criteria used for selecting the best core subsets are the same as those used for creating the core subsets. We used two criteria for selecting the best core subset: (i) the average genetic distances criteria (MRs or CEs) between pair of accessions of the core subset and (ii) the diversity indices (SHs, HEs, and NEs) for measuring the allele richness of each core subset. Thus, the best core subset is the one with (i) the highest average genetic distance between accessions (measured with MRs and CEs), (ii) the highest allele richness (as measured with SHs, HEs and NEs), and (iii) the lowest proportion of non-informative alleles, (PN)s. These criteria are in agreement with Marita et al. (2000), who suggests that core subsets can be formed with the perception of maximizing the total diversity and thus ensuring the inclusion of restricted or rare alleles (taxonomists' or geneticists' perspective), or instead, by maximizing the representativeness of the genetic diversity in the core subset by including "generalists" alleles (breeder's perspective). Values of MRs, CEs, SHs, HEs, NEs, and PNs were obtained from 20 independent replicates (core subsets) and subjected to statistical analyses where they were used as response variables.
Statistical Analyses
For evaluating the importance of the main effects of the three factors and their interaction (first objective), variance components and their contribution to the total variance were estimated by the VARCOMP procedure from SAS (2000) with the Restricted Maximum Likelihood Estimation (REML) option. Also, mean comparisons were performed for the significant sources of variation by the Tukey test at the 1% probability level. Twenty independent core subsets for each factorial combination were used and the response variables were MRs, CEs, SHs, HEs, NEs, and PNs.
For comparing the 24 sampling strategies with the M strategy (MSTRAT algorithm) (second objective), the 25 resulting treatments (with 20 independent replicates each) were compared for MRs, CEs, SHs, HEs, NEs, and PNs by the Dunnet test (P
0.01).
All procedures were implemented in SAS-IML software, the SAS-STAT module, and the MSTRAT software.
| RESULTS |
|---|
|
|
|---|
Effect of the Factors and Their Interactions on the Diversity of the Core Subset
Variance Components
For the bulk data set, the distance criteria MRs and CEs were mainly affected by the allocation method (ALLOC) (54 and 56% of the total variance, respectively) (Table 2) and the cluster method (CM) (28 and 15%, respectively). They were less affected by the cluster method x allocation method interactions (CM x ALLOC) (7 and 5%, respectively) and the cluster method x distances (CM x DIST) (6 and 14%, respectively). For the selection criteria given by the diversity indices SHs, HEs, and NEs, the interaction of CM x DIST was a relatively important source of variability (28, 14, and 32%, respectively). The variable PNs for the bulk data set did not seem to be highly affected by any of these sources of variability. A sizeable portion of the total variance was left unexplained for the diversity indices' selection criteria SHs, HEs, NEs and PNs (58, 71, 59, and 88%, respectively). (Table 2).
|
The variance components due to CM, DIST, ALLOC, and their interactions for the population data set were very similar to that found in the bulk data set. Values of MRs and CEs were mainly affected by the ALLOC (21 and 22% of the total variance, respectively, Table 2), the CM (33 and 38%, respectively), and less affected by the CM x ALLOC interactions (4 and 3%, respectively) and the CM x DIST (10 and 3%, respectively). For SHs, HEs, and NEs, most of the total variance was left unexplained.
In summary, for the three data sets, the distance selection criteria (MRs and CEs) were affected by clustering method, allocation method, and their interaction. On the other hand, the diversity selection criteria (SHs, HEs, and NEs) were affected only slightly by the clustering method at the bulk and population data sets and for all factors at the accession data set. The variance of the auxiliary selection variable PNs was explained principally by the residual component, showing that all factors produce a similar number of non-informative alleles in the core groups.
Mean Comparisons
For the bulk data set, the Tukey's tests indicated that, on average, UPGMA formed core subsets that are more diverse than did Ward for the two genetic distances (MRs and CEs) but less diverse than Ward for the three diversity indices (SHs, HEs, NEs) (Table 3). Allocation criteria that used the CE genetic distance formed more diverse core subsets than the MR genetic distance for MRs, CEs, SHs, HEs, and NEs. The allocation method that produces more diverse core subsets with respect to all six selection criteria used the D method with MR or CE, followed by the D method using the number of effective alleles (NE). This result suggests that for the bulk data, the D method is effective in selecting genotypes that are diverse in terms of genetic distances and with significantly higher allele richness.
|
For the population data set, core subsets assembled by UPGMA were significantly more diverse than those formed by Ward for all the selection criteria except NEs (Table 3). CE formed more diverse clusters than those formed using MR for all selection criteria except for MRs and PNs. The D allocation method used with MR or CE was the best for all six selection criteria.
For all three data sets, most of the allocation strategies formed core subsets that are more diverse than the entire collection measured for the six selection criteria (Table 3). These results indicated that the primary objective of using allocation strategies that form core subsets that will preserve the diversity of the original by eliminating redundant accessions was achieved.
In summary, results show that the most important trends were (i) the UPGMA cluster method was almost always better than Ward for the distances criteria, (ii) the CE genetic distance tended to be better than the MR for all three data sets, and (iii) for the three data sets the best allocation method was D used with MR, CE, or NE.
Comparing All Strategies with the M Strategy by the Diversity of Their Core Subsets
For each data set, the diversity average of the 20 independent core subsets formed for each of the 24 factor combinations were compared with the average diversity of 20 independent core subsets formed by the M strategy (Table 4 and Fig. 1
3
).
|
|
|
|
For the accession data set, strategies 14 through 18 and 21 through 24 formed core subsets significantly more diverse than those formed by the M strategy for the MRs response variable (Fig. 2a). With regard to the CEs selection criteria, strategies 1518 and 2124 formed core subsets with the same diversity as the M strategy (Fig. 2b); the best strategy (24) was only 0.1% better than the M strategy (Table 4). The best strategy in terms of SHs was the M strategy (Fig. 2c). For the HEs selection criteria, strategies 23 and 24 formed core subsets with the same diversity as the M strategy (Fig. 2d), and for the NEs selection criteria, strategies 21, 23, and 24 formed core subsets with the same diversity as the M strategy (Fig. 2e). Core subsets from the best strategy (24) had, on average, 16.4% noninformative alleles, whereas core subsets formed from the M strategy had no noninformative alleles.
Results for the population data set were very similar for MRs and CEs. Strategies 15 through 18 and 20 through 24 formed significantly more diverse cores than the M strategy (Fig. 3a and 3b), with gains over the M strategy of 5.6 and 4.3%, respectively (Table 4). For the selection criteria, SHs, NEs, and PNs, the M strategy formed statistically more diverse core subsets (Fig. 3c, 3e, 3f). For HEs, strategies 20, and 22 through 24 did not differ from the M strategy (Fig. 3d). The percentage of noninformative alleles was 26.4% for strategy 24 and 18.3% for the M strategy.
In summary, for criteria based on MRs, strategies 15, 17, 18, and 2124 formed more diverse core subsets than the M strategy in the three data sets, whereas for CEs, strategies 18, 23, and 24 were better than the M strategy. For the response variables related to the diversity indices, SHs, HEs, and NEs, and PNs, the M strategy was better than any of the D strategies, except for HEs in the accession and population data sets, where strategies 23 and 24 formed core subsets with diversity similar to those formed by the M strategy.
| DISCUSSION |
|---|
|
|
|---|
It can be speculated that the D method with the genetic distances MR or CE between genotypes should produce core subsets that would tend to represent the genotype diversity present in the entire collection (closer to the breeder's objectives). On the other hand, the D method with the allele diversity indices SH, HE, or NE will tend to produce core subsets that will fully represent the allele diversity present in the entire collection (closer to the taxonomist's objectives). The D method used with genetic distances or allelic diversity indices seems to combine well the two main criteria by which accessions are selected for forming a core subset: (i) allele representativeness obtained by the retention of alleles that are widespread in the collection but usually found at low frequencies in each accession and (ii) allele richness achieved by conserving very localized (rare) alleles that might be found at high frequency in only a few accessions. This agrees with the reasoning of Marita et al. (2000) regarding conserving alleles by using the breeders' and taxonomists' criteria simultaneously. The objective of the D method is to select the most diverse accessions in terms of genetic distances among genotypes (breeder's perspective for forming core subsets), whereas the M strategy emphasizes selecting accessions with the most diverse alleles (taxonomistsgeneticists perspective). Nevertheless, some D strategies showed a good recovery of the allelic diversity when compared with the M strategy in the accession data set for HEs and NEs and in the population data set for HEs. The ultimate goal is to form core subsets that simultaneously maximize allele richness and representativeness. Results of this research indicated that the D method provides appropriate criteria for forming core subsets that will tend to increase both, genetic distance and genetic diversity of the accession in different clusters.
Results of this study agreed with Bataillon et al. (1996) and Gouesnard et al. (2001) in the sense that the M strategy is very effective in eliminating redundancy that comes from noninformative alleles because of possible correlations among loci (linkage disequilibrium), which can arise from shared coancestry (population mixtures) and certain assortative mating systems. The M strategy was effective for forming core subsets with high allele richness and a low proportion of noninformative alleles. Franco et al. (2001) determined that large numbers of molecular markers are noninformative for the purpose of classifying genotypes and proposed statistical analyses for discarding redundant markers. The D methods, however, show less effectiveness in not including noninformative alleles in the core subsets than the M strategy.
Simulated results from Bataillon et al. (1996) show that the M strategy did not outperform other strategies when migration and outcrossing are present in the collection, but it performed well when associations between loci are caused by population substructure due to selfing in the absence of gene flow. This might be a reason why the M strategy outperformed the D method on the diversity indices SHs, and NEs in the population data set but not on HEs and NEs in the accession data set. In the accession data set, the HE correlation between the 26 loci was very low (0.05) but higher (0.14) when analyzed as populations. The HE correlations were obtained by computing the HE of each locus in each individual and then calculating the Pearson correlation between the HE values across individuals for every pair of loci. In the accession data set, it is expected that outcrossing between different landraces will be substantial because more than one accession was collected from neighboring geographical regions in Mexico (several accessions from the same states), where the farmers have been found to exchange seeds with their neighbors quite frequently (Pressoir and Berthaud, 2004). On the other hand, it is not expected that outcrossing will occur frequently between landraces that are geographically distant. Although, the population data set was derived from the accession data set, their structures are different. The population and bulk data sets are similar in the sense that both contain allele frequencies combined over several individuals (per bulk or population); however, the structure of the accession data set had allele frequencies for each individual disregarding the groups (populations) to which they belong.
The D method was always superior to the M strategy with respect to the breeders' perspective and equal or worse than the M strategy concerning the taxonomists' point of view. One advantage of the D method is that it can be used with continuous and categorical variables because it is possible for the Ward-MLM or UPGMA-MLM strategies to form clusters with continuous and discrete variables (Franco et al., 2001). Using continuous variables with the M strategy requires the continuous variable to be broken into several series of discrete variables. Although the criterion of conserving allele diversity for qualitative loci is important, the challenge of preserving quantitative genetic variation in conjunction with marker variation should be considered given the potential for marker-based genetic resources conservation and germplasm enhancement. Further research is required to examine the effectiveness of the D allocation method when phenotypic and genetic marker data are used simultaneously. The presence of redundant markers can be detected and the simultaneous use of relevant genetic markers and quantitative traits will form better core subsets. Nevertheless, further research is required to examine the performance of the D method as compared with the M strategy for sampling individuals and forming diverse core subsets for different maize materials and for other crops as well as for genetic markers other than the SSRs used in this research.
Received for publication July 11, 2005.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
K.-W. Kim, H.-K. Chung, G.-T. Cho, K.-H. Ma, D. Chandrabalan, J.-G. Gwag, T.-S. Kim, E.-G. Cho, and Y.-J. Park PowerCore: a program applying the advanced M strategy with a heuristic search for establishing core sets Bioinformatics, August 15, 2007; 23(16): 2155 - 2162. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| The SCI Journals | Agronomy Journal | Vadose Zone Journal | |||
| Journal of Natural Resources and Life Sciences Education |
Soil Science Society of America Journal | ||||
| Journal of Plant Registrations | Journal of Environmental Quality |
The Plant Genome | |||